Wednesday, January 09, 2008

The TCC paper failed on its own example

Following my previous blog about the TCC paper, I came back to this paper quite several times just to be really sure that I don't misunderstand anything that the author indicated in there. The funny thing about this the more I went back to it, the more holes (or the bullshit things) I can see.

An example of this is that, as part of the motivation of the paper, the author raises an issue about the semantically ambiguous word "ตากลม". Here an excerpt from the paper.

The fragile case for word segmentation is when it needs semantic background for making decision. For example, "ตากลม". can be segmented as "ตา-กลม" or "ตาก-ลม". The correct segmentation depends on the context. Moreover, if the sequence on focus is composed of unknown words, it is very hard to segment it into words, and only sistrings of the sequence can be kept instead. To overcome the problem, this research proposes a concept of character cluster, which is a unit smaller than a word but larger than a character. We call this 'Thai Character Cluster(TCC)'. The composition of TCC is unambiguous and can be defined by a set of rules.

If this paper were a stage magic show, the author would probably be a very good misdirector. Why? Let's look at the important properties of the TCC that I highlight above. In the paper, after the author talked about this fragile case, then he showed his proposed TCC grammar with a bunch of details. And guess what, he swiftly switched to another example (i.e. "พิสูจน์ได้ค่ะ" that his grammar happens to work correctly. What is misdirected here is that he didn't show how the grammar applies to the fragile case.

If you know Thai, using his concepts of TCC above, there are only two possible clusters which are "ตา-กลม" and "ตาก-ลม". Yes, they are exactly the same as the word segmentation thing which is of course ambiguous. What this means is that his proposed TCC is in fact still ambiguous, defeating his own statement. And if one follows the grammar, s/he would see that the grammar would separate "ตา" correctly (but again ambiguous) and fail on "กลม" because there is no rule that is applicable.

In order to be unambiguous, the letter "ก" has to stand by itself. So the clusters should be "ตา-ก-ลม-". Then the ambiguity can be resolved at the word composition level. The careful readers of this blog would know instantly that I am misdirecting you. If you are of those, you should have had these questions: How could "ก" be a cluster by itself? Isn't a cluster supposed to be bigger than a letter? There you have it. Another contradiction in the paper that was misdirected.

With that said, I have no doubt why the author never get back to this fragile case.

This is another example why this paper is bullshit. I wish the author is reading my blog and prove me wrong on this.

Tuesday, January 08, 2008

Character Cluster Based Thai Information Retrieval (TCC) paper is bullshit!

I really don't want to write about this at the beginning but I cannot help myself due to the unacceptable no response from the author (perhaps the main author of this paper as his name appears first). The paper is mainly authored by Thanaruk Theeramunkong. This paper describes a scheme that the author claims that it can separate series of Thai characters into a set of clusters. The clusters represent unambiguous block of characters that cannot be separated from each other anymore as far as Thai language is concerned. They can be used to compose bigger forms which ares supposedly the meaningful Thai words.

After struggling through the paper, I found that the EBNF grammar provided doesn't appear to work as advertised. I run through it a number of simple word samples and found that the grammar doesn't do a darn thing useful. As to give you an idea what I am talking about, please take a at the grammar in the paper directly. Go to this http://citeseer.ist.psu.edu/theeramunkong02character.html for more information about the paper.

Ok... now get down to the bullshit business here. The first rule of the grammar says that a cluster can be either "ก็", "อึ", "หึ" and others (which are currently irrelevant to what I want to show you now). If you read this paper and pay careful attention to this, you'll know that this very first rule is already wrong? Why? Suppose I have a word "สะอึก". The correct clusters of it should be "สะ" and "อึก". However, skipping the "สะ" part for now, the rule already wrongly suggests that the cluster should stop at "อึ" leaving "ก" alphabet dangling.

The above is just an example. If you are interested, you might want to run the grammar through other words you can think of and you will see what I mean. Another example you might want to try is "สยบมาร".

When I ran into this, I wrote an email to the author at the given email address on paper. No reply. Nothing happened. Then I goggled his information and found a definitive email address of his. I wrote to him and nothing happened. No reply or clarification or whatsoever.

There are two possibilities here. One he really didn't get an email which I don't thnk it's likely because he's still work for Thammasat University and the email is right on his faculty home page which I believe that other people would use it for contacting him as well. So the other possibility is that he doesn't know what to answer my questions because it really doesn't work. Had he replied to me saying that the grammar is wrong, then it would unfortunately have invalidated his paper. Instead, keeping quiet and pretend that the email didn't reach him is the best deal.

I think he perhaps thinks that most readers will simply skim through his paper and don't bother trying the grammar out for real. This gives him the opportunity to make the paper as bullshit as possible without anyone knowing about this.

My second thought on this however is that he might actually have the real grammar that can correctly clusters the characters. However, he just doesn't want to disclose that and just puts the bullshit one in front of the very eyes of the reader so he can still keep that secret.

Sorry K. Thanaruk Theeramunkong if this blog really bothers you but it's the fact. If you feel that my blog comments here are bullshit too, please comment it back showing that your bullshit TCC grammar really works.

Tuesday, January 01, 2008

Frustrated and satisfied at the same time

Why do I have the two feeling at the same time? It's because I just got some time to start working on my NLP research more seriously. Now I switch to focus on Thai language instead because it's my native and I think if what I discover turns out to be useful, it would be harder to find competition because there are hugely more people who do NLP research in English than Thai.

Anyhow, I'm satisfied because now I got to use Haskell and I find that it's a very powerful language. I could defy many rules for parsing Thai string in just not so many lines. Also the parsec library is so so helpful and very customization. This is one of many indispensable Haskell libraries. I am so satisfied with what I learned from using the language and library.

However, I am so frustrated with the Thai Characters Clustering (TCC) rules I found on the Internet. After running them in my head against just a few test words, I couldn't really see how they would do what the authors advertised for. I just wrote an email to one of the authors but not sure I would get a reply because I did it once to another author but nothing comes back to my mail box. Let's see if this time doesn't work, I am going to roll out my own research and come up with my own rules instead.