Wednesday, January 09, 2008

The TCC paper failed on its own example

Following my previous blog about the TCC paper, I came back to this paper quite several times just to be really sure that I don't misunderstand anything that the author indicated in there. The funny thing about this the more I went back to it, the more holes (or the bullshit things) I can see.

An example of this is that, as part of the motivation of the paper, the author raises an issue about the semantically ambiguous word "ตากลม". Here an excerpt from the paper.

The fragile case for word segmentation is when it needs semantic background for making decision. For example, "ตากลม". can be segmented as "ตา-กลม" or "ตาก-ลม". The correct segmentation depends on the context. Moreover, if the sequence on focus is composed of unknown words, it is very hard to segment it into words, and only sistrings of the sequence can be kept instead. To overcome the problem, this research proposes a concept of character cluster, which is a unit smaller than a word but larger than a character. We call this 'Thai Character Cluster(TCC)'. The composition of TCC is unambiguous and can be defined by a set of rules.

If this paper were a stage magic show, the author would probably be a very good misdirector. Why? Let's look at the important properties of the TCC that I highlight above. In the paper, after the author talked about this fragile case, then he showed his proposed TCC grammar with a bunch of details. And guess what, he swiftly switched to another example (i.e. "พิสูจน์ได้ค่ะ" that his grammar happens to work correctly. What is misdirected here is that he didn't show how the grammar applies to the fragile case.

If you know Thai, using his concepts of TCC above, there are only two possible clusters which are "ตา-กลม" and "ตาก-ลม". Yes, they are exactly the same as the word segmentation thing which is of course ambiguous. What this means is that his proposed TCC is in fact still ambiguous, defeating his own statement. And if one follows the grammar, s/he would see that the grammar would separate "ตา" correctly (but again ambiguous) and fail on "กลม" because there is no rule that is applicable.

In order to be unambiguous, the letter "ก" has to stand by itself. So the clusters should be "ตา-ก-ลม-". Then the ambiguity can be resolved at the word composition level. The careful readers of this blog would know instantly that I am misdirecting you. If you are of those, you should have had these questions: How could "ก" be a cluster by itself? Isn't a cluster supposed to be bigger than a letter? There you have it. Another contradiction in the paper that was misdirected.

With that said, I have no doubt why the author never get back to this fragile case.

This is another example why this paper is bullshit. I wish the author is reading my blog and prove me wrong on this.

0 Comments:

Post a Comment

<< Home