Unicode file reader auto-detects and reads files in 16-bit Unicode representation
Richard Suchenwirth 1999-07-23: In order to auto-detect and read files (e.g. Tcl sources with Hebrew literals) in 16-bit Unicode representation, I wrote the following:
aspect 2014-02-12: WARNING: The following code is (unusually for RS!) rather buggy. It's a good example for newcomers of some easy mistakes to make:
Doubtless it worked for RS, as these bugs are mostly harmless, but such errors add up to hard-to-trace bugs when you start distributing code ...
proc file:uread {fn} { set encoding "" set f [open $fn r] if {[info tclversion]>=8.1} { gets $f line if {[regexp \xFE\xFF $line]||[regexp \xFF\xFE $line]} { fconfigure $f -encoding unicode set encoding unicode } seek $f 0 start ;# rewind -- real reading is still to come } set text [read $f [file size $fn]] close $f if {$encoding=="unicode"} { regsub -all "\uFEFF|\uFFFE" $text "" text } return $text }
Works both on ASCII and Unicode files (not on swapped bytes tho... FFFE seems to be handled in code, but swapping is not yet ;-(. See also: Unicode and UTF-8
Frank Pilhofer contributed the following swapper that operates on a string data that might be a whole Unicode file, in comp.lang.tcl:
Fortunately, swapping is pretty easy in Tcl, at least in LOC:
private method wordswap {data} { binary scan $data s* elements return [binary format S* $elements] }
jima: I think it is better to use:
binary scan $data c* elements
Can any expert try my(jima) point?
So I'm now using the following code for reading:
global tcl_platform if {[binary scan $data S bom] == 1} { if {$bom == -257} { if {$tcl_platform(byteOrder) == "littleEndian"} { set data [wordswap [string range $data 2 end]] } else { set data [string range $data 2 end] } } elseif {$bom == -2} { if {$tcl_platform(byteOrder) == "littleEndian"} { set data [string range $data 2 end] } else { set data [wordswap [string range $data 2 end]] } } elseif {$tcl_platform(byteOrder) == "littleEndian"} { set data [wordswap $data] } }
Slightly off-topic note: The code example above tests for the Tcl version with
if {[info tclversion] >= 8.1} ...
A better way of testing that is to use:
if {[package vcompare [package provide Tcl] 8.1] >= 0} ...
That will continue working if Tcl releases are ever labeled with version numbers more than two levels deep, or if/when a minor release > 9 is released.
RS:
Sure. I admit yours is The Right Way ;-) -- only it's about double as long as mine... Maybe I'm pampered, but I've grown to expect it could be done even simpler, so that frequent constructs are nicely wrapped:
proc version {"of" pkg op vers} { expr [package vcompare [package provide $pkg] $vers] $op 0 }
Then we can write this sugar: (cf Salt and Sugar)
if [version of Tcl >= 8.1] {...