I’m working on a blog post describing how blog search engines like Technorati, PubSub, and Feedster could/should use language categorization to help deal with the chaos of tagging and full-text search. Google has done this for a long time now and Technorati has it in beta.

While I’m still working on the post I wanted to get the source out there so that developers could play with the code and give me feedback.

The project is called ngramcat (which stands for NGram Categorizer) and is based on a 1994 paper by Cavnar and Trenkle entitled N-Gram Base Text Categorization.

I’ve extended this paper to support unicode and asian languages including Farsi, Arabic, Chinese, Korean, Chinese, and Japanese.

The API is pretty simple and if you’re a decent Java developer you should be able to figure it out. I’ve decided to release it as Open Source specifically because I need/want feedback.

An interesting side note is that I’ve used Wikipedia to provide the language categories. I just took a couple large subjects like World War II and then grabbed the text version of all the languages I could find.

I would really love some help building out the remaining languages. If you have a language that we don’t cover just submit a LA.txt file (where LA is the name of the language in ISO code) and I’ll add it to the source.


  1. Apparently at the posted site there is no code available.
    A free Java implementation of that paper in Java (not related to Technorati) is available at http://www.olivo.net/software/lc4j/

  2. Does NGram work on CJK? If they are in unicode they should be fairly distinctive because there the code range of CJK characters are well defined.

    I have looked at libtextcat before
    http://software.wise-guys.nl/libtextcat/

    It still has no unicode support. Corpus is base on Usenet messages.

    Yes wikipedia is incredibly great! In this case not for its content but that it is a great technical resources for developing multi-lingual software!

  3. The one I wrote will. It supports all charsets and encodings.