Tuesday, January 08, 2008

Character Cluster Based Thai Information Retrieval (TCC) paper is bullshit!

I really don't want to write about this at the beginning but I cannot help myself due to the unacceptable no response from the author (perhaps the main author of this paper as his name appears first). The paper is mainly authored by Thanaruk Theeramunkong. This paper describes a scheme that the author claims that it can separate series of Thai characters into a set of clusters. The clusters represent unambiguous block of characters that cannot be separated from each other anymore as far as Thai language is concerned. They can be used to compose bigger forms which ares supposedly the meaningful Thai words.

After struggling through the paper, I found that the EBNF grammar provided doesn't appear to work as advertised. I run through it a number of simple word samples and found that the grammar doesn't do a darn thing useful. As to give you an idea what I am talking about, please take a at the grammar in the paper directly. Go to this http://citeseer.ist.psu.edu/theeramunkong02character.html for more information about the paper.

Ok... now get down to the bullshit business here. The first rule of the grammar says that a cluster can be either "ก็", "อึ", "หึ" and others (which are currently irrelevant to what I want to show you now). If you read this paper and pay careful attention to this, you'll know that this very first rule is already wrong? Why? Suppose I have a word "สะอึก". The correct clusters of it should be "สะ" and "อึก". However, skipping the "สะ" part for now, the rule already wrongly suggests that the cluster should stop at "อึ" leaving "ก" alphabet dangling.

The above is just an example. If you are interested, you might want to run the grammar through other words you can think of and you will see what I mean. Another example you might want to try is "สยบมาร".

When I ran into this, I wrote an email to the author at the given email address on paper. No reply. Nothing happened. Then I goggled his information and found a definitive email address of his. I wrote to him and nothing happened. No reply or clarification or whatsoever.

There are two possibilities here. One he really didn't get an email which I don't thnk it's likely because he's still work for Thammasat University and the email is right on his faculty home page which I believe that other people would use it for contacting him as well. So the other possibility is that he doesn't know what to answer my questions because it really doesn't work. Had he replied to me saying that the grammar is wrong, then it would unfortunately have invalidated his paper. Instead, keeping quiet and pretend that the email didn't reach him is the best deal.

I think he perhaps thinks that most readers will simply skim through his paper and don't bother trying the grammar out for real. This gives him the opportunity to make the paper as bullshit as possible without anyone knowing about this.

My second thought on this however is that he might actually have the real grammar that can correctly clusters the characters. However, he just doesn't want to disclose that and just puts the bullshit one in front of the very eyes of the reader so he can still keep that secret.

Sorry K. Thanaruk Theeramunkong if this blog really bothers you but it's the fact. If you feel that my blog comments here are bullshit too, please comment it back showing that your bullshit TCC grammar really works.

0 Comments:

Post a Comment

<< Home