Language identification

Richard Suchenwirth 2003-10-25 - Of the thousands of languages in the world, we usually know one very well, a few more or less, others we at least can identify when we see them, and still others leave us in the dark. In this fun project I want to find out how to let Tcl do language identification, so for a given string the ISO code (e.g. en for English, de for German) of its language is returned.

Languages like Arab, Greek, Hebrew, Japanese, Korean can be easily identified by their specific alphabet Unicodes. So I concentrated on languages using the Latin alphabet, mostly European, but with some Malay thrown in for fun (or rather trouble, to make things more difficult).

The idea was to collect typical substrings for each language, and see how often they occur in the string in question. These may be characteristic single letters, word parts, or even frequent words. If they're very characteristic, they may be specified two or more times, and thus score more. The sum of such hits is the score, and the language(s) with highest score is the result. The "feature substrings" are collected in a paired list:

JCG: Corrected a few typos in some Spanish sentences.

ECS: Added Portuguese.

LES: The strings for Portuguese are a good start, but I don't agree with some of them because they can easily be found in other languages, like 'nt'. RS: Sure, but if they help to raise the score.. compare "ch" in de, en, es, fr.

Another thing: the "special" characters in de, es, fr and it looked OK the first time I viewed this page. Now they all look like weird symbols. Could someone's browser have fudged it when saving changes?

Hmm... I think I fixed it, such is the power of backup. But hey, Edésio, there seems to be something wrong with your browser.

BR You're browser needs to support UTF-8 correctly in HTML forms if you want to edit non-ASCII on this wiki. So what browser do you use actually?

Zarutian 18. june 2007: added icelandic

set features {
    de {ä ö ü ß ß ch ck sch ei en ge ung w}
    en {th y sh ch in ing ed en ns rs w with from and}
    es {á é ión dad es ll ch qu j ya os as üe con ya}
    fr {é é ê è ch ei eu iè oi x y qu au ou de d' et ir te}
    is {á í ú ó ý é þ æ ð ö}
    ga {an á bh bh dh gc iai iai mb mh n- ó uai uai ú}
    it {è è ù da " e " re di il gl io ione ioni cc cch z tt una qu}
    ms {ah ak j jan ke ngan pu se uan ang ber ku}
    nl {ij ij ing tj sj z aa ee ei ou oe oe met ge sch te baar}
    pt {lhã lhõ lha lhe lh nha nhã nho nhõ nhe nt ndo de ão ões ss rr os as ch ça ção ções ico ns apro}
}

# LES gets carried away and suggests:

# pt {" de " " do " " da " " os " " as " " e " " que " " em " " nos " " nas " " na " "de " "te " " se " " às " " aos " " com " lhã lhõ lha lhe lh nha nhã nho nhõ nhe ndo ão õe ãe ssa sse ssi sso ssu rra rre rri rro rru ch ça ção ções ico}
# ''(removed: de nt os as ns apro ss ões)''

The identification code itself is fairly straightforward:

proc lang'identify {string features} {
    set res {}
    foreach {language regexps} $features {
        set score 0
        foreach regexp $regexps {
            incr score [regexp -all -nocase $regexp $string]
        }
        if $score {lappend res [list $language $score]}
    }
    set t [lsort -decreasing -integer -index 1 $res]
}

But most important is the testing. I collected sample phrases from food packages, which are often pretty multilingual in Germany, and other sources. I kept adding sample strings, and tuning the above feature set until the target language came first most of the time (short strings can't always be identified right):

KPV: If you need examples in various unusual languages you could use Google's advanced search page and specify different languages to search for.

set ntests 0
set nproblems 0
set score 0
foreach {language string} {
    de {Dies ist ein Beispiel für einen deutschen Satz}
    de {Knusprige Weizenflocken mit Schokoladengeschmack}
    de {Waffeln mit feiner Haselnusscremefüllung}
    de {Vor dem Öffnen bitte schütteln}
    de {Trocken lagern, vor Licht schützen}
    de {Sicherheitsinformationen für Netzkabel und Zubehör}
    de {Auf dieses Produkt wird eine zwölfmonatige Garantie gegeben}
    en {This is an example for an English sentence}
    en {Wafers filled with hazelnut creme}
    en {Store in a dry place, protect from light}
    en {Safety precautions for power cords and accessories}
    en {This product is warranted for the period of twelve months}
    en {Worldwide telephone numbers}
    en {For continuous quality improvement, calls may be monitored}
    es {Él que no espera vencer, ya está vencido}
    es {Copos de trigo tostados con chocolate}
    es {Barquillos rellenos de crema de avellanas}
    es {Precauciones de seguridad para cables de alimentación y accesorios}
    es {Este producto está garantizado por un período de doce meses}
    es {Esta garantía no cubre ninguno de los siguientes casos}
    es {En caso necesario, la lista de nuestros Servicios Autorizados
        está disponible}
    fr {Voilà  un autre exemple pour une phrase francaise}
    fr {Gaufrettes fourrées à la noisette}
    fr {Agiter avant d'ouvrir}
    fr {Conserver au réfrigerateur une fois ouvert et consommer dans les
        jours qui suivent}
    fr {Protéger contre la lumière}
    fr {A consommer de préférence avant fin:}
    fr {Précautions de sécurité concernant les cordons d'alimentation}
    fr {Cet appareil est couvert par une garantie de douze mois}
    ga {Eolas an Chuairteora do Láithreáinin Oidhreachta}
    ga {Tá roinnt bríonna leis an bhfocal Dúchas}
    ga {Tugann na milliúin daione cuairt ar ár n-ionaid gach bliain}
    ga {Tá na hAmanna oscailte sa bhfoilseachán seo i gceart ag am priondála}
    ga {Ionad Cuairteora Pháirc an Fhionnuisce}
    ga {Tógadh an caisleán le linn na mblianta 1870-73}
    ga {Deirtear gur tógadh an foirgneamh is sine anseo 400 bliain ó shin}
    it {Questo è un altro esempio per una frase italiana}
    it {Fiocchi di frumento al cioccolato}
    it {Wafers ripieni di crema alla nocciola}
    it {Agitare prima di aprire}
    it {Da consumare preferibilmente entro il:}
    it {Una volta aperto tenere in frigo e consumare entro qualche giorno}
    it {Precauzioni relative alla sicurezza per i cavi di alimentazione}
    it {Questo prodotto è garantito per un periodo di dodici mesi}
    it {Se non è vero, è ben trovato}
    ms {Pendahuluan cetak yang dibarahui}
    ms {Dia pun hendak ikut saya ke kedai}
    ms {Orang itu pun membuat kerjanya dengan cepat}
    ms {Pukul sembilan setengah malam}
    ms {Jepun pun kalah dalam pertandingan bolasepak Piala Merdeka}
    ms {Meja ini baik, tetapi meja itu pun baik juga}
    ms {Cik Pun pun turut serta dalam pertandingan itu}
    nl {Het is tijd om op te staan, vandaag is het zaterdag}
    nl {Wafeltjes met hazelnootcrèmevulling}
    nl {Tegen licht beschermen. Tenminste houdbaar tot einde:}
    nl {Dit produkt is gegarandeerd voor een periode van twaalf maanden}
    nl {De garantie is alleen geldig wanneer de garantiekaart volledig is ingevuld}
    nl {In probleemgevallen kunt U nadere informatie verkrijgen}
    nl {Reparaties onder garantie moeten door servicecentra worden uitgevoerd}
    pt {Alguns quatrilhões de ítens de informação, formando amostragens de}
    pt {Essas pulsações eram armazenadas em mnemocircuitos idênticos}
    pt {Na praça da cidade, a fila tinha se formado às cinco da manhã com os}
    pt {sentido humorístico. Tinha um fim. Estava armado da lista telefônica.}
    pt {aproximar da Estação Ether e poderei aproveitar a caminhada para}
} {
    incr ntests
    set res [join [lang'identify $string $features]]
    if {[lindex $res 0] ne $language} {
        puts "$string\n   $res **** should have been: $language"
        incr nproblems
    } elseif {[llength $res]==2} {
        incr score [lindex $res 1]
    } elseif {[lindex $res 1] > [lindex $res 3]} {
        incr score [expr {[lindex $res 1]-[lindex $res 3]}]
    } else {
        puts "$string\n   $res **** ambiguous: $language"
        incr nproblems
    }
}
puts "score: [expr 30*$score/$ntests] Passed:[expr $ntests-$nproblems]/$ntests"

GS (040612) Some hints to go further at the Gertjan van Noord web site [1 ].


slebetman The opening line of this common Malay rhyme fails the test and is identified as english with a score of {en 2} {nl 1}:

set teststring {dua tiga kucing berlari}

adding "ber" and "ku" to the list of ms identifiers improves the match with a score of {en 2} {ms 2} {nl 1}. However the full rhyme passes the original test with a score of {ms 5} {en 4} {ga 4} {nl 2} {it 1}. Here's the full rhyme:

set teststring {
    dua tiga kucin berlari,
    mana nak sama si kucing belang,
    dua tiga boleh ku cari,
    mana nak sama si adik seorang.
}

The addition of "ber" and "ku" improves the result further with a score of {ms 9}. So I added them to the features list above.


you can get utf-8 samples of a boatload of languages here: http://unicode.org/udhr/