Version 4 of Unicode file reader

Updated 2004-12-21 17:30:39 by suchenwi

Richard Suchenwirth 1999-07-23 - In order to auto-detect and read files (e.g. Tcl sources with Hebrew literals) in 16-bit Unicode representation, I wrote the following:

  proc file:uread {fn} {
    set encoding ""
    set f [open $fn r]
    if {[info tclversion]>=8.1} {
        gets $f line
        if {[regexp \xFE\xFF $line]||[regexp \xFF\xFE $line]} {
            fconfigure $f -encoding unicode
            set encoding unicode
        }
        seek $f 0 start ;# rewind -- real reading is still to come
    }
    set text [read $f [file size $fn]]
    close $f
    if {$encoding=="unicode"} {
        regsub -all "\uFEFF|\uFFFE" $text "" text
    }         
    return $text
  }

Works both on ASCII and Unicode files (not on swapped bytes tho... FFFE seems to be handled in code, but swapping is not yet ;-(. See also: Unicode and UTF-8


Frank Pilhofer contributed the following swapper that operates on a string data that might be a whole Unicode file, in the comp.lang.tcl newsgroup:

Fortunately, swapping is pretty easy in Tcl, at least in LOC:

    private method wordswap {data} {
        binary scan $data s* elements
        return [binary format S* $elements]
    }

So I'm now using the following code for reading:

    global tcl_platform
    if {[binary scan $data S bom] == 1} {
        if {$bom == -257} {
            if {$tcl_platform(byteOrder) == "littleEndian"} {
                set data [wordswap [string range $data 2 end]]
            } else {
                set data [string range $data 2 end]
            }
        } elseif {$bom == -2} {
            if {$tcl_platform(byteOrder) == "littleEndian"} {
                set data [string range $data 2 end]
            } else {
                set data [wordswap [string range $data 2 end]]
            }
        } elseif {$tcl_platform(byteOrder) == "littleEndian"} {
            set data [wordswap $data]
        }
    }

Slightly off-topic note: The code example above tests for the Tcl version with

  if {[info tclversion] >= 8.1} ...

A better way of testing that is to use:

  if {[package vcompare [package provide Tcl] 8.1] >= 0} ...

That will continue working if Tcl releases are ever labeled with version numbers more than two levels deep, or if/when a minor release > 9 is released.

Donald Porter


Sure. I admit yours is The Right Way ;-) -- only it's about double as long as mine... Maybe I'm pampered, but I've grown to expect it could be done even simpler, so that frequent constructs are nicely wrapped:

   proc version {"of" pkg op vers} {
        expr [package vcompare [package provide $pkg] $vers] $op 0
   }

Then we can write this sugar: (cf Salt and Sugar)

   if [version of Tcl >= 8.1] {...

RS


See also Unicode and UTF-8


Category Characters