Version 27 of Language identification

Updated 2005-12-25 19:20:38

if 0 {Richard Suchenwirth 2003-10-25 - Of the thousands of languages in the world, we usually know one very well, a few more or less, others we at least can identify when we see them, and still others leave us in the dark. In this fun project I want to find out how to let Tcl do language identification, so for a given string the ISO code (e.g. en for English, de for German) of its language is returned.

Languages like Arab, Greek, Hebrew, Japanese, Korean can be easily identified by their specific alphabet Unicodes. So I concentrated on languages using the Latin alphabet, mostly European, but with some Malay thrown in for fun (or rather trouble, to make things more difficult).

The idea was to collect typical substrings for each language, and see how often they occur in the string in question. These may be characteristic single letters, word parts, or even frequent words. If they're very characteristic, they may be specified two or more times, and thus score more. The sum of such hits is the score, and the language(s) with highest score is the result. The "feature substrings" are collected in a paired list:

JCG: Corrected a few typos in some Spanish sentences.

ECS: Added Portuguese.}

if 0 {LES: The strings for Portuguese are a good start, but I don't agree with some of them because they can easily be found in other languages, like 'nt'. RS: Sure, but if they help to raise the score.. compare "ch" in de, en, es, fr.

Another thing: the "special" characters in de, es, fr and it looked OK the first time I viewed this page. Now they all look like weird symbols. Could someone's browser have fudged it when saving changes?

Hmm... I think I fixed it, such is the power of backup. But hey, Ed��©sio, there seems to be something wrong with your browser.

BR You're browser needs to support UTF-8 correctly in HTML forms if you want to edit non-ASCII on this wiki. So what browser do you use actually?}

 set features {
    de {��¤ ��¶ ��¼ ���� ���� ch ck sch ei en ge ung w}
    en {th y sh ch in ing ed en ns rs w with from and}
    es {��¡ ��© i��³n dad es ll ch qu j ya os as ��¼e con ya}
    fr {��© ��© ��ª ��¨ ch ei eu i��¨ oi x y qu au ou de d' et ir te}
    ga {an ��¡ bh bh dh gc iai iai mb mh n- ��³ uai uai ��º}
    it {��¨ ��¨ ��¹ da " e " re di il gl io ione ioni cc cch z tt una qu}
    ms {ah ak j jan ke ngan pu se uan ang}
    nl {ij ij ing tj sj z aa ee ei ou oe oe met ge sch te baar}
    pt {lh��£ lh��µ lha lhe lh nha nh��£ nho nh��µ nhe nt ndo de ��£o ��µes ss rr os as ch ��§a ��§��£o ��§��µes ico ns apro}
 }

# LES gets carried away and suggests:

 # pt {" de " " do " " da " " os " " as " " e " " que " " em " " nos " " nas " " na " "de " "te " " se " " �� s " " aos " " com " lh��£ lh��µ lha lhe lh nha nh��£ nho nh��µ nhe ndo ��£o ��µe ��£e ssa sse ssi sso ssu rra rre rri rro rru ch ��§a ��§��£o ��§��µes ico}

# (removed: de nt os as ns apro ss ��µes)

if 0 {The identification code itself is fairly straightforward:}

 proc lang'identify {string features} {
    set res {}
    foreach {language regexps} $features {
        set score 0
        foreach regexp $regexps {
            incr score [regexp -all -nocase $regexp $string]
        }
        if $score {lappend res [list $language $score]}
    }
    set t [lsort -decreasing -integer -index 1 $res]
 }

if 0 {But most important is the testing. I collected sample phrases from food packages, which are often pretty multilingual in Germany, and other sources. I kept adding sample strings, and tuning the above feature set until the target language came first most of the time (short strings can't always be identified right):

KPV: If you need examples in various unusual languages you could use Google's advanced search page and specify different languages to search for. }

 set ntests 0
 set nproblems 0
 set score 0
 foreach {language string} {
    de {Dies ist ein Beispiel f��¼r einen deutschen Satz}
    de {Knusprige Weizenflocken mit Schokoladengeschmack}
    de {Waffeln mit feiner Haselnusscremef��¼llung}
    de {Vor dem ����ffnen bitte sch��¼tteln}
    de {Trocken lagern, vor Licht sch��¼tzen}
    de {Sicherheitsinformationen f��¼r Netzkabel und Zubeh��¶r}
    de {Auf dieses Produkt wird eine zw��¶lfmonatige Garantie gegeben}
    en {This is an example for an English sentence}
    en {Wafers filled with hazelnut creme}
    en {Store in a dry place, protect from light}
    en {Safety precautions for power cords and accessories}
    en {This product is warranted for the period of twelve months}
    en {Worldwide telephone numbers}
    en {For continuous quality improvement, calls may be monitored}
    es {����l que no espera vencer, ya est��¡ vencido}
    es {Copos de trigo tostados con chocolate}
    es {Barquillos rellenos de crema de avellanas}
    es {Precauciones de seguridad para cables de alimentaci��³n y accesorios}
    es {Este producto est��¡ garantizado por un per����odo de doce meses}
    es {Esta garant����a no cubre ninguno de los siguientes casos}
    es {En caso necesario, la lista de nuestros Servicios Autorizados
        est��¡ disponible}
    fr {Voil��  un autre exemple pour une phrase francaise}
    fr {Gaufrettes fourr��©es ��  la noisette}
    fr {Agiter avant d'ouvrir}
    fr {Conserver au r��©frigerateur une fois ouvert et consommer dans les
        jours qui suivent}
    fr {Prot��©ger contre la lumi��¨re}
    fr {A consommer de pr��©f��©rence avant fin:}
    fr {Pr��©cautions de s��©curit��© concernant les cordons d'alimentation}
    fr {Cet appareil est couvert par une garantie de douze mois}
    ga {Eolas an Chuairteora do L��¡ithre��¡in Oidhreachta}
    ga {T��¡ roinnt br����onna leis an bhfocal D��ºchas}
    ga {Tugann na milli��ºin daione cuairt ar ��¡r n-ionaid gach bliain}
    ga {T��¡ na hAmanna oscailte sa bhfoilseach��¡n seo i gceart ag am priond��¡la}
    ga {Ionad Cuairteora Ph��¡irc an Fhionnuisce}
    ga {T��³gadh an caisle��¡n le linn na mblianta 1870-73}
    ga {Deirtear gur t��³gadh an foirgneamh is sine anseo 400 bliain ��³ shin}
    it {Questo ��¨ un altro esempio per una frase italiana}
    it {Fiocchi di frumento al cioccolato}
    it {Wafers ripieni di crema alla nocciola}
    it {Agitare prima di aprire}
    it {Da consumare preferibilmente entro il:}
    it {Una volta aperto tenere in frigo e consumare entro qualche giorno}
    it {Precauzioni relative alla sicurezza per i cavi di alimentazione}
    it {Questo prodotto ��¨ garantito per un periodo di dodici mesi}
    it {Se non ��¨ vero, ��¨ ben trovato}
    ms {Pendahuluan cetak yang dibarahui}
    ms {Dia pun hendak ikut saya ke kedai}
    ms {Orang itu pun membuat kerjanya dengan cepat}
    ms {Pukul sembilan setengah malam}
    ms {Jepan pun kalah dalam pertandingan bolasepak Piala Merdeka}
    ms {Meja ini baik, tetapi meja itu pun baik juga}
    ms {Cik Pun pun turut serta dalam pertandingan itu}
    nl {Het is tijd om op te staan, vandaag is het zaterdag}
    nl {Wafeltjes met hazelnootcr��¨mevulling}
    nl {Tegen licht beschermen. Tenminste houdbaar tot einde:}
    nl {Dit produkt is gegarandeerd voor een periode van twaalf maanden}
    nl {De garantie is alleen geldig wanneer de garantiekaart volledig is ingevuld}
    nl {In probleemgevallen kunt U nadere informatie verkrijgen}
    nl {Reparaties onder garantie moeten door servicecentra worden uitgevoerd}
    pt {Alguns quatrilh��µes de ����tens de informa��§��£o, formando amostragens de}
    pt {Essas pulsa��§��µes eram armazenadas em mnemocircuitos id��ªnticos}
    pt {Na pra��§a da cidade, a fila tinha se formado �� s cinco da manh��£ com os}
    pt {sentido humor����stico. Tinha um fim. Estava armado da lista telef��´nica.}
    pt {aproximar da Esta��§��£o Ether e poderei aproveitar a caminhada para}
 } {
    incr ntests
    set res [join [lang'identify $string $features]]
    if {[lindex $res 0] ne $language} {
        puts "$string\n   $res **** should have been: $language"
        incr nproblems
    } elseif {[llength $res]==2} {
        incr score [lindex $res 1]
    } elseif {[lindex $res 1] > [lindex $res 3]} {
        incr score [expr {[lindex $res 1]-[lindex $res 3]}]
    } else {
        puts "$string\n   $res **** ambiguous: $language"
        incr nproblems
    }
 }
 puts "score: [expr 30*$score/$ntests] Passed:[expr $ntests-$nproblems]/$ntests"

GS (040612) Some hints to go further at the Gertjan van Noord web site [L1 ].


Category Human Language - Arts and crafts of Tcl-Tk programming