Detecting string similarity in Java and C#

Detecting when two textes are very similar can be use full in many case.

With string similaity algorithms we can de-duplicate documents from a database, cleaning document, finding almost duplicate html page on our website and much more.

Some popular algorithm for similarity detection:

You can find the source code of those algorithm in the Java Lucene projet or Lucene.Net for C#.

	private static String wikiOK = "Wikipedia is a collaboratively edited, multilingual, free Internet encyclopedia supported by the non-profit Wikimedia Foundation";
	private static String wikiKO =  "wiki is a collaboratively edited, unilangual, pay Internet encyclopedia supported by the non-profit Wikimedia Foundation";
	private static String wikiOKRevert = "Internet encyclopedia supported by the non-profit Wikimedia Foundation Wikipedia is a collaboratively edited, multilingual, free ";
	
	public static void main(String[] args) {
		LevensteinDistance levenstein = new LevensteinDistance();
		NGramDistance nGram  = new NGramDistance();
		JaroWinklerDistance jaroWinkler = new JaroWinklerDistance();
		
		float simlOkKo = levenstein.getDistance(wikiOK, wikiKO);
		float simnOkKo = nGram.getDistance(wikiOK, wikiKO);
		float simjOkKo = jaroWinkler.getDistance(wikiOK, wikiKO);
		
		System.out.println("==wikiOK - wikiKO ==");
		System.out.println("levenstein: "+simlOkKo);
		System.out.println("nGram: "+simnOkKo);
		System.out.println("simjOkKo: "+simjOkKo);
		System.out.println();
		//////////////////////////////////////////////////////
		
		float simlrevert = levenstein.getDistance(wikiOK, wikiOKRevert);
		float simnrevert = nGram.getDistance(wikiOK, wikiOKRevert);
		float simjrevert = jaroWinkler.getDistance(wikiOK, wikiOKRevert);
		
		System.out.println("==wikiOK - wikiOKRevert ==");
		System.out.println("levenstein: "+simlrevert);
		System.out.println("nGram: "+simnrevert);
		System.out.println("simjOkKo: "+simjrevert);
	}

The output will be:

==wikiOK – wikiKO ==
levenstein: 0.890625
nGram: 0.87890625
simjOkKo: 0.80534005

==wikiOK – wikiOKRevert ==
levenstein: 0.14728683
nGram: 0.13953489
simjOkKo: 0.8153731