[Richard Suchenwirth] 2001-02-06 -- Hangul is the Korean writing system. Each syllable is represented by an often square arrangement of its constituting letters ("jamo") in either left-right or top-bottom fashion. Transliteration is element-by-element conversion of text in one writing system to another (if to English/Latin, it's also called romanization). "Hanglish" is the name of the following romanization scheme from [The Lish family] , chosen in analogy to [Greeklish] often used on the Net to write Greek in Latin. There is an ISO agreement (ISO/TC46/SC2/WG4, 1992) on Hangul transliteration from which I slightly deviate: * let only L stand for "r,l" * let only Q stand for "('),ng" - at least it looks circular. Empty strings are ugly in transliteration. All other consonants are unchanged. * let W stand for "eu", let E stand for "eo" * let EI stand for "e" (two distinct graphemes, easily segmented) * for the bottom/right diphtongs, don't use "w-" indiscriminated for U and O. Instead, OA for "wa", UE for "weo", UI for "wi", WI for "yi" * the palatal vowels "ya, yeo, yo, yu" will hardly be segmented into the extra dot and the base vowel. So, use remaining letters for these graphemes: V for "ya", X for "yeo", Y for "yo", Z for "yu" * thus, the "e" above is rendered as EI, "ye" as XI. While XI is easily segmented, the "ae/yae" diphtongs would rather come as single graphemes. One could still express the composition with AI for "ae", VI for "yae". For best adaptation to OCR/interpretation needs, I however prefer to use the two left-over letters: F for "yae", R for "ae". After so much theory, here's the code: ====== proc hangul2hanglish {numuc} { # takes a numeric Unicode so far (until scan works, from 8.1b1) set ncount [expr 21*28] set index [expr $numuc - 0xAC00] ;# offset of Unicode 2.0 Hangul append res [lindex {G GG N D DD L M B BB S SS Q J JJ C K T P H}\ [expr int($index/$ncount)]] append res [lindex {A R V F E EI X XI O OA OR OI Y U UE UEI UI Z W WI I}\ [expr int(($index%$ncount)/28)]] append res [lindex {"" G GG GS N NJ NH D L LG LM LB LS LT LP LH \ M B BS S SS Q J C K T P H}\ [expr $index%28]] return $res } proc hanglish2uc {hanglish} { # convert a Hanglish string to one Unicode 2.0 Hangul if possible set L ""; set V "" ;# in case regexp doesn't hit regexp {^([GNDLMBSQJCKTPH]+)([ARVFEIXOYUZW]+)([GNDLMBSQJCKTPH]*)$} \ [string toupper $hanglish] -> L V T ;# lead consonant - vowel - trail cons. if {$L=="" || $V==""} {return $hanglish} set l [lsearch {G GG N D DD L M B BB S SS Q J JJ C K T P H} $L] set v [lsearch {A R V F E EI X XI O OA OR OI Y U UE UEI UI Z W WI I} $V] set t [lsearch {"" G GG GS N NJ NH D L LG LM LB LS LT LP LH \ M B BS S SS Q J C K T P H} $T] ;# trailing consonants if {[min $l $v $t]<0} {return $hanglish} set uc [expr $l*21*28 + $v*28 + $t + 0xAC00] return [format %c $uc] } proc hanglish {args} { # tolerant converter: makes Unicode 2.0 Hangul where possible set res "" foreach i $args { set word "" foreach {from to} { ai r vi f } {regsub -all $from $i $to i} foreach j [split $i "-"] { set t [hanglish2uc $j] if {$j==$t} {set word $i; break} ;# all syllables must fit append word $t } lappend res $word } return $res } ====== ''Usage example:'' [[hanglish Se-qul]] produces the hangul for s.Korea's capital. Note that the circle jamo is written as Q, although it's silent at the beginning of a syllable (at end, it is /ng/) ---- These routines have been incorporated into ''taiku'', see [taiku goes multilingual], which also introduces liberalisations - for Q you can write NG, or you can omit it at syllable-initial position, so ''se-ul'' has the same effect there. Also, going both ways, and with a GUI: [A little Hangul converter]. <> Arts and crafts of Tcl-Tk programming | A little Korean editor | Characters