Version 10 of bidi rendering

Updated 2008-06-27 09:36:05 by APN

Richard Suchenwirth - bidi is short for bidirectional and is especially used in situations where right-to-left (r2l) text (e.g. Arabic, Hebrew - see Heblish) is embedded in left-to-right text or viceversa. The Unicode standard specifies that text storage is always in logical direction, so in memory (or in files), Arabic and Hebrew would run in the order of ascending addresses, exactly like left-to-right text, and it's the task of the rendering software to restore the habitual direction.

Tk has no bidi facilities yet, so Unicodes even if from r2l systems must come naively left-to-right if they are to appear correct on screen. In taiku (taiku goes multilingual) I ran into the problem that my simple "printer driver", which lets IE do the work, failed on Hebrew, because IE has bidi handling, and thus wrongly re-reverts Hebrew words. OK, so here's a simple bidi handler that reverts sequences of r2l characters. The condition for prepending characters to a swapping sequence are

(1) character is in Arabic or Hebrew range of Unicode, or:

(2) character is in a set of "direction-independent" ones (presently dash and space) - then it is only prepended if the buffer is not empty.

All other characters append the buffer to output and flush it before being inserted themselves. As IE is also able to render Arabic into context-dependent glyphs, this was not necessary here (otherwise, see A simple Arabic renderer). Handling of mixed lines with both English and Arab words is satisfactory. Embedded Indo-Arabic digits seem to turn off IE's bidi totally, so better avoid them for now. }

 proc fixBidi s {
    set res {}; set buf {}
    foreach c [split $s ""] {
        if {[r2l $c] || [regexp {[ -]} $c] && $buf ne ""} {
            set buf $c$buf ;# prepend
        } else {
            append res $buf$c ;# empty buffer won't hurt
            set buf {}
        }
    }
    append res $buf ;# in case some were left at end of line
 }
 proc r2l c {
    scan $c %c uc
    expr {$uc >= 0x05b0 && $uc <= 0x065f
       || $uc >= 0x066a && $uc <= 0x06bf
    } ;# Arabic context glyphs need not be reverted for IE
 }

The description and the code seem to be at odds. It seems that the test for space and dash should be removed from r2l, and the test for '$c!=""' should be '$buf!=""'. (Jeff Epler mailto:[email protected] ) RS Thanks - fixed.

The next routine "fixes" a file with partial bidi content, so that it can be correctly rendered by bidi-conscious software, like: Outlook, Wordpad, Powerpoint, IE.


 proc fixBidiFile {filename} {
    set fp [open $filename]
    fconfigure $fp -encoding utf-8
    set data [read $fp]
    close $fp
    set fp [open $filename w]
    fconfigure $fp -encoding utf-8
    foreach line [split $data \n] {
       puts $fp [fixBidi $line]
    } 
    close $fp
 }

Devanagari i: The Indian alphabets normally run from left to right. In at least the Devanagari alphabet there is one special case, though: the vowel (short-i) is written to the left of its consonant, even though it follows it as phoneme. One good example word is हिन्दी HindI (the second being a long-i), where the letter order from left to right is iHndI. On Windows XP, some renderers for Devanagari do this transposition automatically, others (including the one that Tk uses) don't.

For a fix to this problem, a regsub should be enough, to locally revert the character order:

 regsub -all {(.)\u093f} $str \u093f\\1 str

Another problem in Tk rendering of Devanagari, for which I have no solution, can also be demonstrated on Hindi हिन्दी . The "n" in proper rendering does not have its vertical bar, but rather just sticks to the following "d". The Tk rendering provides a full "n" with subscript \u094D (silent vowel indicator), like हिन् दी . I'm not sure how much this inhibits readability, or how "wrong" it looks to experienced users of Devanagari... - RS 2008-06-27 APN It looks completely wrong to me, though I'm not a native Hindi speaker per se, I'm close enough I think. For what it's worth, Firefox 2 also has this rendering problem. I'm surprised because I had thought (naively, I now realize) Windows does all the rendering and thus all apps would display the same way.


Natural languages - Arts and crafts of Tcl-Tk programming