30 November 2011

Apache solr some optimization and trouble shooting


Checking for corrupt Indexes
Make a backup of the existing index before running the CheckIndex tool. First create a sample folder then copy the lucence-core-x.jar and index file to this folder. Suppose the directory structure is

../check-corrupt-index
|-- lucene-core-3.5.0.jar
`-- test-index
|-- _1.fdt
|-- _1.fdx
|-- _1.fnm
|-- _1.frq
|-- _1.nrm
|-- _1.prx
|-- _1.tii
|-- _1.tis
....
|-- _z.tis
|-- segments.gen
`-- segments_12

Just checking the indexes are corrupt or not use following command

$ java -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex /opt/apache-solr-3.5.0/example/solr/data/check-corrupt-index/test-index/

Here /opt/apache-solr-3.5.0/.. is path to the indexes to check. The output will be like

Opening index @ /opt/apache-solr-3.5.0/example/solr/data/check-corrupt-index/test-index/

Segments file=segments_12 numSegments=10 version=3.5 format=FORMAT_3_1 [Lucene 3.1+]
1 of 10: name=_c docCount=2938
compound=false
hasProx=true
numFiles=8
size (MB)=38.219
diagnostics = {mergeFactor=10, os.version=2.6.18-194.el5, os=Linux, lucene.version=3.5.0 1204988 - simon - 2011-11-22 14:46:51, source=merge,
os.arch=i386, mergeMaxNumSegments=-1, java.version=1.6.0_11, java.vendor=Sun Microsystems Inc.}
no deletions
test: open reader.........OK
test: fields..............OK [11 fields]
test: field norms.........OK [5 fields]
test: terms, freq, prox...OK [171871 terms; 1354874 terms/docs pairs; 6045083 tokens]
test: stored fields.......OK [35110 total field count; avg 11.95 fields per doc]
test: term vectors........OK [0 total vector count; avg 0 term/freq vector fields per doc]

....

10 of 10: name=_13 docCount=22
compound=false
hasProx=true
numFiles=8
size (MB)=2.192
diagnostics = {os.version=2.6.18-194.el5, os=Linux, lucene.version=3.5.0 1204988 - simon - 2011-11-22 14:46:51, source=flush,
os.arch=i386, java.version=1.6.0_11, java.vendor=Sun Microsystems Inc.}
no deletions
test: open reader.........OK
test: fields..............OK [11 fields]
test: field norms.........OK [5 fields]
test: terms, freq, prox...OK [6290 terms; 9779 terms/docs pairs; 35206 tokens]
test: stored fields.......OK [248 total field count; avg 11.273 fields per doc]
test: term vectors........OK [0 total vector count; avg 0 term/freq vector fields per doc]

No problems were detected with this index.


The last line shows the status. If there is not problem it will say "No problems were detected with this index" Otherwise it will show error message like "org.apache.lucene.index.CorruptIndexException: did not read all bytes" or similar other
message to fix this problem use following command

$ java -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex /opt/apache-solr-3.5.0/example/solr/data/check-corrupt-index/test-index -fix


For more information check:
http://solr.pl/en/2011/01/17/checkindex-for-the-rescue

Reduce the number of files the index is made of
In order to make the index more compound and merge all the segment files into one file, we need to run the optimize command. To do that, we run the optimize command in the following way:

curl 'http://localhost:8983/solr/update' --data-binary '<optimize/>" -H 'Content-type:text/xml; charset=utf-8'


It is good to have an optimize command running in a set period of time; for example, once a day. But remember to run it only on the master server and during the time when the master server is not used as much as during the peak indexing time. The optimize command uses the I/O operations heavily, and this can and will affect the performance of the server you send the optimize command to. Also, remember to send the optimize command to every core you have deployed on your Solr server.
You should always measure the performance of Solr servers on optimized and non-optimized indexes to ensure that the optimization is really needed.

Dealing with with a locked indexes
If there is no indexing process running and lock is beacause of error. then go to the directory containing the index files. One of the files in the /usr/share/solr/data/index/ directory is he one we are searching for. That file is:
lucene-fe3fc928a4bbfeb55082e49b32a70c10-write.lock
Remove that file and restart Jetty.

Merge Factor
The lower the mergeFactor setting, the longer the indexing time will be, and will thus improve the search speed. On the other hand, the higher the mergeFactor setting, the less time indexing will take, but search time may degrade as there will be more segments, and thus more files that create the index.

MergeFactor Tradeoffs
High value merge factor (e.g., 25):
    * Pro: Generally improves indexing speed
    * Con: Less frequent merges, resulting in a collection with more index files which may slow searching

Low value merge factor (e.g., 2):
    * Pro: Smaller number of index files, which speeds up searching.
    * Con: More segment merges slow down indexing.

No comments:

Post a Comment