I’m working on a blog post describing how blog search engines like Technorati, PubSub, and Feedster could/should use language categorization to help deal with the chaos of tagging and full-text search. Google has done this for a long time now and Technorati has it in beta.
While I’m still working on the post I wanted to get the source out there so that developers could play with the code and give me feedback.
The project is called ngramcat (which stands for NGram Categorizer) and is based on a 1994 paper by Cavnar and Trenkle entitled N-Gram Base Text Categorization.
I’ve extended this paper to support unicode and asian languages including Farsi, Arabic, Chinese, Korean, Chinese, and Japanese.
The API is pretty simple and if you’re a decent Java developer you should be able to figure it out. I’ve decided to release it as Open Source specifically because I need/want feedback.
An interesting side note is that I’ve used Wikipedia to provide the language categories. I just took a couple large subjects like World War II and then grabbed the text version of all the languages I could find.
I would really love some help building out the remaining languages. If you have a language that we don’t cover just submit a LA.txt file (where LA is the name of the language in ISO code) and I’ll add it to the source.