Saturday, July 23, 2011

[Grammatical Inference: Colin de la Higuera]: Notes

In Grammar Induction,  what really matters is the data and the relationship between the data and the induced grammar, whereas in Grammatical Inference the actual learning process is what is central and is being examined and measured, not just the result of the process. 


GraIn has been emerged as an independent field connecting bioinformatics, computational linguistics, formal language theory, machine learning and pattern recognition. GraIn is a task where the goal is to learn or infer a grammar (or some device that can generate, recognize or describe strings) for a language and from all sorts of information about this language.



Computational BioHanziology [1]

The basic elements in computational biology are strings or sequences describing DNA or proteins. A number of questions thus require the analysis of sets of strings and the extraction of (grammatical) rules and patterns.


所以如果我們把漢字轉換成生物序列,生物資訊與漢字資訊就可以變成交流學門。這也可以是 BioNLP 的一個新領域。


We are dealing with strings, trees and graphs. When dealing with strings, we typically manipulate a 4-letter alphabet {A,T,G,C} or a 20-letter alphabet if we intend to assemble the nucleotides 3 * 3 into amino-acids in order to define proteins. There are trees involved in the patterns or in the secondary structure, and even graphs in the tertiary structure. 



學習語言

「學習語言」是什麼意思?回答此問題,免不了必須先定義「語言」,或是比較方便一點去看,「語言」怎麼被「表達」(represent)出來。這樣想的話,我們有的一種重要的觀看素材,就是「語料」(linguistic data)。

觀看「語料」的一個形式角度,就是把語料看成是字串的組合學(stringology)。這樣下去,就可以再談「語言學習」的機制,或演算法是有哪些可能?

我覺得,從機器學習的角度,人類學習語言的方式與經驗固然值得參考,但是是不是最好的學語言的方式,就很難說。當然我們要先有對於學習的評測(evaluation)有合理的基準,或定義。(叉開說,我覺得絕大多數的討論,都跟定義、立場息息相關)

漢字系統是個很奇妙的東西。從一開始想,我就覺得很著迷,在博士論文中也試圖要解決。後來發現自不量力,這種有歷史深度的東西,沒有相應深厚的人文基礎與計算背景,只會對漢字學研究領域及其社會心理效應,帶來噪音與不安。這點跟佛法很像,它直指最深刻的宇宙人生基本道理,但是由功夫不到家的人去講述傳佈,反而造成「污名化」。其實,這也是合理的。那麼容易理悟,我們還在書堆中混什麼,對吧?

據說一萬小時可成專家,我應該要開始。