Java UTF Socket Communication

Java supports and encourages socket communication using UTF-8. To this end two methods are provided: readUTF() [L1 ] and writeUTF() [L2 ]. Socket communication between a Java process that uses these methods and Tcl is easy. First, you will have to put your socket in binary mode, through a command such as the following:

 fconfigure $socket -translation binary

Specifying the translation is a shortcut for not having any conversion, for not having CRLF at the end of your lines, etc.

The rest of the code on this page issues from a discussion between DKF and EF on the chatroom. Both functions should help you reading and writing data to a socket coupled to a Java process using the methods above for its communication with the external world. Error handling is minimal.

 proc writeJavaUtf {stream string} {
     set data [encoding convertto utf-8 $string]
     if {[string length $data] > 0xffff} {
         error "string to long after encoding"
     }
     set len [binary format S [string length $data]]
     puts -nonewline $stream $len$data
     flush $stream
 }

 proc readJavaUtf {stream} {
     set len [read $stream 2]
     binary scan $len S length
     set data [read $stream [expr {$length & 0xffff}]]
     return [encoding convertfrom utf-8 $data]
 }

Lars H: For normal UTF-8 communication, one would of course rather use

 fconfigure $socket -encoding utf-8

(hmm... does anyone know what the -translation should be?), but these Java sockets prefix two octets (bytes) for the string bytelength as a 16-bit unsigned integer to each string sent, which is why there is all that hassle.

Would it be worth recording some of the original discussion on this page? It could be interesting to know which alternatives were considered. For example, it says in the docs linked to above that Java uses the same modified encoding of U+0000 as Tcl does internally. Does that mean there should be a

  string map {\x00 \xC0\x80} $data

in the above writeJavaUtf?