if 0 {[Richard Suchenwirth] 2003-10-26 - Machine translation, conversion of text in one source language S into the equivalent text in another target language T by computers, was one of the earliest dream applications - much work was put in it, and after 50 years the translation quality is still moderate to ridiculous (try Babelfish on a non-trivial text). In this weekend fun project, I want to demonstrate aspects of machine translation very simply. Let S=German (my mother tongue), T=English (so you can judge the outcome), and start with Dies ist ein einfacher Satz The simplest approach is word-by-word translation. The "dictionary" is very simply structured and takes care of German inflection suffixes already (which is a bit wasteful, as it leads to n-plication of entries). It contains a few words more that will be needed in later examples:} array set de_en { anderen other anderer other Dies this diesen this ein a einen a einfacher simple Ich I ist is Satz sentence übersetzen translate will want } if 0 {Our first "translator" requires the input string to be a well-formed list, and leaves words not in the dictionary (e.g. proper names) untranslated:} proc translate0 {string dictName} { upvar 1 $dictName dict set res {} foreach word $string { if [info exists dict($word)] {set word $dict($word)} lappend res $word } set res } if 0 { % translate0 "Dies ist ein einfacher Satz" de_en this is a simple sentence Perfect! But.. well, it was simple indeed. The next test comes out worse: % translate0 "Dies ist ein anderer einfacher Satz" de_en this is a other simple sentence Hm... obviously, two German words "ein anderer" should make one English word, "another". Before bothering with multi-word sequence lookup (which is a powerful technique that enhances translation quality), we can fix this one by post-processing, for which we start a map of strings in a global variable - we can always add to it later:} set postprocess {{a other} another} #-- proc translate1 {string dictName} { upvar 1 $dictName dict string map $::postprocess [translate0 $string dict] } if 0 { % translate1 "Dies ist ein anderer einfacher Satz" de_en this is another simple sentence Good enough. But the next test % translate1 "Ich will diesen Satz übersetzen" de_en I want this sentence translate brings us to the limits of word-by-word translation. We rather have to parse the input into a tree structure, do transformations on that tree, and finally render the target text. A simple parser can be borrowed from [Syntax parsing in Tcl]: } proc parse1 {s rules} { while 1 { set fired 0 foreach {lhs rhs} $rules { if [llength [set t [lmatch [lskim $s] $rhs]]] { foreach {from to} $t break set s [lreplace $s $from $to \ [concat $lhs [lrange $s $from $to]]] incr fired } } if !$fired break ;# no rule could be applied in this round } set s } if 0 {Here's a starter kit of German syntax rules (including lexicals) that's sufficient for the example:} set rules { s {np vp} np {det n} np {det adj n} np {pn} vp {vm np v} pn "Ich" vm "will" det "diesen" det "einen" adj "anderen" n "Satz" v "übersetzen" } #-- ..and some helper procedures: proc lskim L { # returns all first elements of a list of lists set res {} foreach i $L {lappend res [lindex $i 0]} set res } proc lmatch {list sublist} { # returns start and end positions of first occurrence of sublist in # list, or an empty list if not found set lsub [llength $sublist] for {set i 0} {$i<=[llength $list]-$lsub} {incr i} { set to [expr {$i+$lsub-1}] if {[lrange $list $i $to]==$sublist} { return [list $i $to] } } } proc atomar? list {expr {$list eq [lindex $list 0]}} if 0 {This is the result of the parsing, the tree coming as a nested list: % parse1 "Ich will diesen Satz übersetzen" $rules {s {np {pn Ich}} {vp {vm will} {np {det diesen} {n Satz}} {v übersetzen}}} Unitl here we were concerned with single-language analysis. The differences between S and T have to be taken care of in rules how to rewrite, or transform, the parse tree. Transformation rules can be specified as pairs ''from to'' - "if the parse tree contains the sequence ''from'', convert it to the sequence ''to''", where integers denote list elements (counting from 1) of the input, and other %%strings as words that are to be inserted. I use the %% markup to keep the words that are already in T from possible further "translation": } set de_en_xform { {vm np v} {1 %%to 3 2} } if 0 {The transformation takes a parse tree and rules, traverses the tree recursively, and returns a modified parse tree:} proc xform {ptree rules} { array set xform $rules if [atomar? $ptree] {return $ptree} set t [lskim $ptree] set children [lrange $t 1 end] if [info exists xform($children)] { set tree2 [lindex $t 0] ;# top node category foreach part $xform($children) { if [string is integer -strict $part] { lappend tree2 [lindex $ptree $part] } else {lappend tree2 [list - $part] ;# literal} } set ptree $tree2 } set res {} foreach part $ptree {lappend res [xform $part $rules]} set res } if 0 {Here's how the transformed tree comes out, again as a nested list: % xform $pt $de_en_xform {s {np {pn Ich}} {vp {vm will} {- %%to} {v übersetzen} {np {det diesen} {n Satz}}}} The "terminals" (or leaves of the parse tree) are still the German words, but the word order has been changed to the English one, and the "to" has been inserted in the right place. So it's an easy job to finalize the translation by another tree traversal:} proc translate2 {string dictName} { upvar 1 $dictName dict set parseTree [parse1 $string $::rules] set xformTree [lindex [xform $parseTree $::de_en_xform] 0] string map {%% ""} [translate1 [leaves $xformTree] dict] } #-- This returns the list of the terminal nodes (leaves) of a tree: proc leaves tree { if [atomar? $tree] { set tree } else { set res {} foreach child [lrange $tree 1 end] { lappend res [leaves $child] } join $res } } if 0 { % translate2 "Ich will diesen Satz übersetzen" de_en I want to translate this sentence % translate2 "Ich will einen anderen Satz übersetzen" de_en I want to translate another sentence Phew - made it... and these are only the first steps into non-trivial MT... More work is required, but obviously it's fun too to do it in Tcl. Many years ago I've done it in Prolog, but somehow in Tcl I feel more free and easy... As a first summary, machine translation could involve the following steps: * Preprocessing (not shown - case, punctuation, etc.) * Lemmatizing (not shown - analyzing word forms, extracting stem + ending) * Assigning categories to lemmas (e.g. noun, verb) * Parsing into a parse tree * Transformation of the parse tree * Generation of the target text from leaves of the transformed tree * Post-processing (e.g. fixing "a other") Don't mistake these experiments for a translation system fit for practical use - it's just a toy I've played with. But maybe it can help to understand better the problems and possible solutions of machine translation. } ----- [RobertAbitbol] Hi Richard! Automatic translation will always produce garbage. I read this page with lots of interest. You are on the right track but you'll need more than what you put in to achieve perfect translations. I have been working on the problems of machine translation and I am the only linguist who was ever able to produce perfect results on a translation software. My software is user-assisted and written in straight C. I never touched Prolog; I know it was supposed to be '''the''' language for translation software. Mind you it took me a good number of years to achieve this task. I will be marketing it after I work on the dictionaries. I am on AIM pnb120. I can give you some advice on how to set up a wiki for a German dictionary. And I can negociate my first franchise with you. (I'd be getting percentage). You don't need any capital, any moeny to start. Well really not much. Anyone interested I can set you up on any starting language to any language; except French and English. (I am taking care of them). '''LES''': So you say "automatic translation will always produce garbage", but you "were able to produce perfect results on a translation software". Well, make up your mind. [Robert Abitbol] Ahah LES! I did not say I did automatic translation. Read well! '''LES''': Then I don't see what you meant. [Robert Abitbol] Well my software does user-assisted translation. The user answers questions. Automatic translation means the computer doesn't ask questions to the user. It just translates (usually word for word) and it does a lot of mistakes. You know in this field we use our jargon and we don't realize that we are not being understood by other people. It's my fault I should have done the explanation in my text. Sorry! :-) '''LES''': I know [machine translation] and I know [CAT]: [computer-assisted translation] (which usually involves '''translation memory'''). But I still can't even imagine what kind of achievement you seem to have made. Your words "perfect results" and "user-assisted" have clarified nothing to me. '''RA''': I obviously cannot give you the secrets of my achievement. I could show you results when the time comes. In the mean time if you are interested I invite you to get on my dictionary project. If you have a mother tongue other than French and English, you are in. [RS] likes to point out that this is a free Wiki (freedom of expression, and of charge, for us) for "publishing" open-source code produced with open-source software (Tcl/Tk and others, so let's dont talk secrets and money :) I respect open source people. But I am not ready to go open source myself and reveal publicly all my trade secrets! :-) After all this project was a good 7 years of my life! I am an old man (I am 50). Open source is not for me. I am a commercial person. Money makes the world go round. :-) [Robert Abitbol] '''LES''': In other words, you just came here to brag a little. No, bragging is not my forte. Mind you when I was 21 I was in Mexico and an American guy was calling me the French bragger. Arrogant would have been a better term. I really don't see how I could have bragged. In fact I probably wouldn'it have mentioned my translation software had I not seen this Machine Translation page! No I am here for many reasons. First I visit all sorts of wikis looking for features for the c2 page: InnovativeWikiFeatures. So I wrote about Wikit already (#7, 8 and 18). I like this wiki somehow. It's fast and I love the free links []. The second reason is I have just discovered TCL and I find this language amazing in the sense that you download an exe and you are up and running. This is not what many programs can do. This is a compilable language obviously. I think this language can kick ass. The third reason is I have downloaded a gui.tcl module from [tk outline] and I was wondering if I could change the code and make it an executable I could put anywhere and that would enable me to use a file without going to Windows Explorer. In fact since I am not a programmer I was looking for a person that could do it for me. I have a small budget for that. And those are basically the reasons I am here. [Robert Abitbol] ---- [Arts and crafts of Tcl-Tk programming]