Lucene MoreLikeThis Java example

Lucene is an open source search engine written in Java and C#.
Using the MoreLikeThis query can quickly find similar entries from it index.
It’s also able to search for similar documents simply by comparing a string to it indexed fields.

Lucene MoreLikeThis on fields

On Stack Overflow when writing a question you get some similar question:

MoreLikeThis Lucene real exemple

MoreLikeThis Lucene real example on Stack Overflow

For this prototye our Lucene index documents have the fields “id”, “title” and “content”.
Let’s search for document where field “title” and “content” are related to the string “doduck prototype”.

IndexReader reader = [...];
Analyzer analyzer = [...];
MoreLikeThis mlt = new MoreLikeThis(reader);
mlt.setFieldNames(new String[]{"title", "content"});
mlt.setAnalyzer(analyzer);

Reader reader = new StringReader("doduck prototype idea");
Query query = mlt.like(reader, null);
	 
TopDocs topDocs = indexSearcher.search(query,5);
	    
for ( ScoreDoc scoreDoc : topDocs.scoreDocs ) {
    Document aSimilarDocument = indexSearcher.doc( scoreDoc.doc );
    //The more similar document come first
}

Lucene MoreLikeThis on document ID

Sometime your document is already in the Lucene index. It’s the case when a user is looking at a document.
With Lucene we can search related document using a documentId.

It’s also the case on Stack Overflow:

MoreLikeThis real example similar from document

MoreLikeThis real example similar from document

IndexReader reader = [...];
Analyzer analyzer = [...];
int currentlyReadyDocumentID = [...];
MoreLikeThis mlt = new MoreLikeThis(reader);
mlt.setFieldNames(new String[]{"title", "content"});
mlt.setAnalyzer(analyzer);

Query query = mlt.like(currentlyReadyDocumentID);
	 
TopDocs topDocs = indexSearcher.search(query,5);
	    
for ( ScoreDoc scoreDoc : topDocs.scoreDocs ) {
    Document aSimilarDocument = indexSearcher.doc( scoreDoc.doc );
    //The more similar document come first
}

Lucene MoreLikeThis full example

	public static void main(String[] args) throws IOException {
		Main m = new Main();
		m.init();
		m.writerEntries();
		m.findSilimar("doduck prototype");
	}

	private Directory indexDir;
	private StandardAnalyzer analyzer;
	private IndexWriterConfig config;
	
	public void init() throws IOException{
		analyzer = new StandardAnalyzer(Version.LUCENE_42);
		config = new IndexWriterConfig(Version.LUCENE_42, analyzer);
		config.setOpenMode(OpenMode.CREATE_OR_APPEND);
		
		indexDir = new RAMDirectory(); //don't write on disk
		//indexDir = FSDirectory.open(new File("/Path/to/luceneIndex/")); //write on disk
	}
	
	public void writerEntries() throws IOException{
		IndexWriter indexWriter = new IndexWriter(indexDir, config);
		indexWriter.commit();
		
		Document doc1 = createDocument("1","doduck","prototype your idea");
		Document doc2 = createDocument("2","doduck","love programming");
		Document doc3 = createDocument("3","We do", "prototype");
		Document doc4 = createDocument("4","We love", "challange");
		indexWriter.addDocument(doc1);
		indexWriter.addDocument(doc2);
		indexWriter.addDocument(doc3);
		indexWriter.addDocument(doc4);
		
		indexWriter.commit();
		indexWriter.forceMerge(100, true);
		indexWriter.close();
	}

	private Document createDocument(String id, String title, String content) {
		FieldType type = new FieldType();
		type.setIndexed(true);
		type.setStored(true);
		type.setStoreTermVectors(true); //TermVectors are needed for MoreLikeThis
		
		Document doc = new Document();
		doc.add(new StringField("id", id, Store.YES));
		doc.add(new Field("title", title, type));
		doc.add(new Field("content", content, type));
		return doc;
	}


	private void findSilimar(String searchForSimilar) throws IOException {
		IndexReader reader = DirectoryReader.open(indexDir);
		IndexSearcher indexSearcher = new IndexSearcher(reader);
		
		MoreLikeThis mlt = new MoreLikeThis(reader);
	    mlt.setMinTermFreq(0);
	    mlt.setMinDocFreq(0);
	    mlt.setFieldNames(new String[]{"title", "content"});
	    mlt.setAnalyzer(analyzer);
	    
	    
	    Reader sReader = new StringReader(searchForSimilar);
	    Query query = mlt.like(sReader, null);
		
	    TopDocs topDocs = indexSearcher.search(query,10);
	    
	    for ( ScoreDoc scoreDoc : topDocs.scoreDocs ) {
	        Document aSimilar = indexSearcher.doc( scoreDoc.doc );
	        String similarTitle = aSimilar.get("title");
	        String similarContent = aSimilar.get("content");
	        
	        System.out.println("====similar finded====");
	        System.out.println("title: "+ similarTitle);
	        System.out.println("content: "+ similarContent);
	    }
	    
	}

Result:

====similar finded====
title: doduck
content: prototype your idea
====similar finded====
title: doduck
content: love programming
====similar finded====
title: We do
content: prototype

Checkout the full Lucene MoreLikeThis example on github.

  • Rod Mccartney

    How can I substitute the strings in doc1,doc2,doc3, etc for files in a directory or even better for a database?