Monday, December 10, 2012

Lucene - Updating index for an existing file


Updating index files could mean below two possibilities -
  • Adding a new file to existing index
  • Updating an existing file

Adding a new file to an existing index

Adding a new file is very simple. Please see my previous article. click here.

Updating an existing file

When you do IndexWriter.add() for a document that is already in the index it won't overwrite the previous document instead it will add multiple copies of the same document in the index.

There is no direct update procedure in Lucene. To update an index incrementally you must first delete the documents that were updated, and then re-add them to the index. In this example we will see how to delete a file and then you can re-add the same file with the help of my previous article. click here.

How to delete a documents from the index?

IndexWriter allows you to delete by Term or by Query. The deletes are buffered and then periodically flushed to the index, and made visible once commit() or close() is called.

IndexReader can also delete documents, by Term or document number, but you must close any open IndexWriter before using IndexReader to make changes (and, vice/versa). IndexReader also buffers the deletions and does not write changes to the index until close() is called, but if you use that same IndexReader for searching, the buffered deletions will immediately take effect. Unlike IndexWriter's delete methods, IndexReader's methods return the number of documents that were deleted.

Generally it's best to use IndexWriter for deletions, unless 1) you must delete by document number, 2) you need your searches to immediately reflect the deletions or 3) you must know how many documents were deleted for a given deleteDocuments invocation.

If you must delete by document number but would otherwise like to use IndexWriter, one common approach is to make a primary key field, that holds a unique ID string for each document. Then you can delete a single document by creating the Term containing the ID, and passing that to IndexWriter's deleteDocuments(Term) method.

Once a document is deleted it will not appear in TermDocs nor TermPositions enumerations, nor any search results. Attempts to load the document will result in an exception. The presence of this document may still be reflected in the docFreq statistics, and thus alter search scores, though this will be corrected eventually as segments containing deletions are merged.

To know more, click here.

About the example :

As mentioned above we will create a primary key field,that holds a unique ID string for each document and store it in the index file.

doc.add(new Field("id",""+i,Field.Store.YES,Field.Index.ANALYZED));

Also if you are updating index file you can get max ID, increment it and store it in the index file as shown below.

//get max id
IndexReader iReader = IndexReader.open(FSDirectory.open(new File(index)), true);
int i = iReader.maxDoc();
i++;

doc.add(new Field("id",""+i,Field.Store.YES,Field.Index.ANALYZED));

Lets see the delete method

public static void deleteIndex(String id) {
System.out.println("Deleting index...."+id);
try {
Term term = new Term("id", id);
Directory directory = FSDirectory.open(new File(index));
IndexReader indexReader = IndexReader.open(directory, false);
indexReader.deleteDocuments(term);
indexReader.flush();
indexReader.close();
}
catch (Exception e) {
e.printStackTrace();
}
}

Please see the complete example below, to run this code you need to add lucene-core-3.0.2.jar in your classpath. Also check lucene API for more details.



Console output :

Creating index....
C:\TestLucene\files\Java.txt
C:\TestLucene\files\Javascript.txt
C:\TestLucene\files\SQL.txt
Searching.... 'Object'
Found in :: 2 C:\TestLucene\files\Javascript.txt
Found in :: 1 C:\TestLucene\files\Java.txt
Updating index....
C:\TestLucene\newFiles\PHP.txt
Searching.... 'Object'
Found in :: 4 C:\TestLucene\newFiles\PHP.txt
Found in :: 2 C:\TestLucene\files\Javascript.txt
Found in :: 1 C:\TestLucene\files\Java.txt
Deleting index....2
Searching.... 'Object'
Found in :: 4 C:\TestLucene\newFiles\PHP.txt
Found in :: 1 C:\TestLucene\files\Java.txt