Hanglish

Richard Suchenwirth 2001-02-06 -- Hangul is the Korean writing system. Each syllable is represented by an often square arrangement of its constituting letters ("jamo") in either left-right or top-bottom fashion. Transliteration is element-by-element conversion of text in one writing system to another (if to English/Latin, it's also called romanization). "Hanglish" is the name of the following romanization scheme from The Lish family , chosen in analogy to Greeklish often used on the Net to write Greek in Latin.

There is an ISO agreement (ISO/TC46/SC2/WG4, 1992) on Hangul transliteration from which I slightly deviate:

  • let only L stand for "r,l"
  • let only Q stand for "('),ng" - at least it looks circular. Empty strings are ugly in transliteration. All other consonants are unchanged.
  • let W stand for "eu", let E stand for "eo"
  • let EI stand for "e" (two distinct graphemes, easily segmented)
  • for the bottom/right diphtongs, don't use "w-" indiscriminated for U and O. Instead, OA for "wa", UE for "weo", UI for "wi", WI for "yi"
  • the palatal vowels "ya, yeo, yo, yu" will hardly be segmented into the extra dot and the base vowel. So, use remaining letters for these graphemes: V for "ya", X for "yeo", Y for "yo", Z for "yu"
  • thus, the "e" above is rendered as EI, "ye" as XI. While XI is easily segmented, the "ae/yae" diphtongs would rather come as single graphemes.

One could still express the composition with AI for "ae", VI for "yae". For best adaptation to OCR/interpretation needs, I however prefer to use the two left-over letters: F for "yae", R for "ae".

After so much theory, here's the code:

 proc hangul2hanglish {numuc} {
    # takes a numeric Unicode so far (until scan works, from 8.1b1)
    set ncount [expr 21*28]
    set index [expr $numuc - 0xAC00] ;# offset of Unicode 2.0 Hangul
    append res [lindex {G GG N D DD L M B BB S SS Q J JJ C K T P H}\
            [expr int($index/$ncount)]]
    append res [lindex {A R V F E EI X XI O OA OR OI Y U UE UEI UI Z W WI I}\
            [expr int(($index%$ncount)/28)]]
    append res [lindex {"" G GG GS N NJ NH D L LG LM LB LS LT LP LH \
            M B BS S SS Q J C K T P H}\
            [expr $index%28]]
    return $res
 }
 proc hanglish2uc {hanglish} {
    # convert a Hanglish string to one Unicode 2.0 Hangul if possible
    set L ""; set V "" ;# in case regexp doesn't hit
    regexp {^([GNDLMBSQJCKTPH]+)([ARVFEIXOYUZW]+)([GNDLMBSQJCKTPH]*)$} \
            [string toupper $hanglish] ->  L V T 
    ;# lead consonant - vowel - trail cons.
    if {$L=="" || $V==""} {return $hanglish}
    set l [lsearch {G GG N D DD L M B BB S SS Q J JJ C K T P H} $L]
    set v [lsearch {A R V F E EI X XI O OA OR OI Y U UE UEI UI Z W WI I} $V]
    set t [lsearch {"" G GG GS N NJ NH D L LG LM LB LS LT LP LH  \
            M B BS S SS Q J C K T P H} $T] ;# trailing consonants
    if {[min $l $v $t]<0} {return $hanglish}
    set uc [expr $l*21*28 + $v*28 + $t + 0xAC00]
    return [format %c $uc]
 }
 proc hanglish {args} {
    # tolerant converter: makes Unicode 2.0 Hangul where possible
    set res ""
    foreach i $args {
        set word ""
        foreach {from to} {
            ai r vi f
        } {regsub -all $from $i $to i}
        foreach j [split $i "-"] {
            set t [hanglish2uc $j]
            if {$j==$t} {set word $i; break} ;# all syllables must fit
            append word $t
        }
        lappend res $word 
    }
    return $res
 }

Usage example: [hanglish Se-qul] produces the hangul for s.Korea's capital. Note that the circle jamo is written as Q, although it's silent at the beginning of a syllable (at end, it is /ng/)


These routines have been incorporated into taiku, see taiku goes multilingual, which also introduces liberalisations - for Q you can write NG, or you can omit it at syllable-initial position, so se-ul has the same effect there. Also, going both ways, and with a GUI: A little Hangul converter.