Solr
From PTAGISWiki
Contents |
Apache Solr indexing and search
I've installed solr on snapper and am trying to index the documentation on sockeye as part of the knowledge transfer from Doug.
I've written a script that walks the filesystem on sockeye below /home/ptagdev/ and generates an xml file describing the content and metadata of each file it encounters. Those xml files are then submitted to solr.
Reference this tutorial: http://lucene.apache.org/solr/tutorial.html
create a custom schema
I started with the example schema and tried to turn it into a description of files in a filesystem. The result is here: solr schema.xml for sockeye. The schema sits here in the filesystem:
/home/rday/downloads/apache-solr-1.3.0/example/solr/conf/schema.xml
indexing files
The perl script that walks sockeye's filesystem and emits xml is here: solr-sockeye.pl
A sample output xml is here: sample solr xml for sockeye
posting files to solr
The solr package I downloaded included this tool for posting files:
/home/rday/downloads/apache-solr-1.3.0/example/exampledocs/post.jar
It is accepts a list of files as arguments and inserts them into the index. It is invoked like this:
[rday@snapper exampledocs]$ java -jar post.jar /home/rday/bin/solr/usr/pit/ptagdev/ftp_test.txt.xml SimplePostTool: version 1.2 SimplePostTool: WARNING: Make sure your XML documents are encoded in UTF-8, other encodings are not currently supported SimplePostTool: POSTing files to http://localhost:8983/solr/update.. SimplePostTool: POSTing file ftp_test.txt.xml SimplePostTool: COMMITting Solr index changes..
deleting files from solr
It should work to submit a document named delete.xml with this content to solr:
<delete><query>*:*</query></delete>
There is a script in /home/rday/downloads/apache-solr-1.3.0/example/exampledocs that should do the trick:
java -jar post.jar delete.xml
searching solr
The default search URL is here:
http://localhost:8983/solr/admin/
custom output
The output can be routed through a "response writer" and transformed by an xslt. An example of this is here:
http://localhost:8983/solr/select/?stylesheet=&q=pittag&wt=xslt&tr=example.xsl
A cgi wrapper for that url is here:
http://snapper.psmfc.org/cgi-bin/solr.pl
schema browser
Going to the admin console and clicking on schema browser, then fields allows you to see each field, how it is configured, the top ten values, and a histogram of value distribution.
solr statistics
This url http://localhost:8983/solr/admin/stats.jsp tells how many documents are currently in the index so you can watch the progress of the indexing program.
faceted search results
This query returns the top results and facet information about the query:
http://snapper.psmfc.org:8983/solr/select/?&fl=*,score&q=rday&facet=true&facet.field=extension&facet.field=uid&facet.field=mtime
This query returns facet information based on a date field:
http://snapper.psmfc.org:8983/solr/select/? &fl=*,score &q=rday &facet=true &facet.field=extension &facet.field=uid &facet.field=mtime &facet.query=mtime:[* TO 2000-01-01T00:00:00Z] &facet.query=mtime:[2000-01-01T00:00:00Z TO 2002-01-01T00:00:00Z] &facet.query=mtime:[2002-01-01T00:00:00Z TO *]
