'''Unicode file reader''' auto-detects and reads files in 16-bit Unicode representation ** See Also ** [Unicode and UTF-8]: ** Description ** [Richard Suchenwirth] 1999-07-23: In order to auto-detect and read files (e.g. Tcl sources with Hebrew literals) in 16-bit Unicode representation, I wrote the following: [aspect] 2014-02-12: '''WARNING''': The following code is (unusually for [RS]!) rather buggy. It's a good example for newcomers of some easy mistakes to make: * the initial `gets` has `$fn` in `encoding system`, which could be anything (and might not let BOMs through unscathed) * the main `read` expects a number of characters in its second argument, but `file size` counts bytes * comparing strings with `==` instead of `eq` * `info tclversion` is not documented as returning a number (see [Donald Porter]'s comment below) Doubtless it worked for RS, as these bugs are mostly harmless, but such errors add up to hard-to-trace bugs when you start distributing code ... ====== proc file:uread {fn} { set encoding "" set f [open $fn r] if {[info tclversion]>=8.1} { gets $f line if {[regexp \xFE\xFF $line]||[regexp \xFF\xFE $line]} { fconfigure $f -encoding unicode set encoding unicode } seek $f 0 start ;# rewind -- real reading is still to come } set text [read $f [file size $fn]] close $f if {$encoding=="unicode"} { regsub -all "\uFEFF|\uFFFE" $text "" text } return $text } ====== Works both on ASCII and Unicode files (not on swapped bytes tho... FFFE seems to be handled in code, but swapping is not yet ;-(. See also: [Unicode and UTF-8] ---- [Frank Pilhofer] contributed the following swapper that operates on a string ''data'' that might be a whole Unicode file, in [comp.lang.tcl]: Fortunately, swapping is pretty easy in Tcl, at least in LOC: ====== private method wordswap {data} { binary scan $data s* elements return [binary format S* $elements] } ====== [jima]: I think it is better to use: ====== binary scan $data c* elements ====== Can any expert try my([jima]) point? So I'm now using the following code for reading: ====== global tcl_platform if {[binary scan $data S bom] == 1} { if {$bom == -257} { if {$tcl_platform(byteOrder) == "littleEndian"} { set data [wordswap [string range $data 2 end]] } else { set data [string range $data 2 end] } } elseif {$bom == -2} { if {$tcl_platform(byteOrder) == "littleEndian"} { set data [string range $data 2 end] } else { set data [wordswap [string range $data 2 end]] } } elseif {$tcl_platform(byteOrder) == "littleEndian"} { set data [wordswap $data] } } ====== ---- [Donald Porter]: '''Slightly off-topic note''': The code example above tests for the Tcl version with ====== if {[info tclversion] >= 8.1} ... ====== A better way of testing that is to use: ====== if {[package vcompare [package provide Tcl] 8.1] >= 0} ... ====== That will continue working if Tcl releases are ever labeled with version numbers more than two levels deep, or if/when a minor release > 9 is released. ---- [RS]: Sure. I admit yours is The Right Way ;-) -- only it's about double as long as mine... Maybe I'm pampered, but I've grown to expect it could be done even simpler, so that frequent constructs are nicely wrapped: ====== proc version {"of" pkg op vers} { expr [package vcompare [package provide $pkg] $vers] $op 0 } ====== Then we can write this sugar: (cf [Salt and Sugar]) ====== if [version of Tcl >= 8.1] {... ====== <> Characters | File