Version 8 of string bytelength

Updated 2013-10-20 22:58:17 by AMG

string bytelength string

See Also

string
string length

Description

Returns a decimal string giving the number of bytes used to represent string in memory. Because UTF-8 uses one to three bytes to represent Unicode characters, the byte length will not be the same as the character length in general. The cases where a script cares about the byte length are rare. Refer to the Tcl_NumUtfChars manual entry for more details on the UTF-8 representation.

In almost all cases, you should use the string length operation (including determining the length of a Tcl ByteArray object). One example of when [string bytelength] would be needed appears on the tcom page, where a binary blob is being generated, and [string bytelength] is used to get the length of the blob without forcing an internal string representation to be generated, as [string length] would do.

Note that (perhaps confusingly) [string bytelength] should not be used with binary data. This command measures how long the UTF-8 representation of a string is in bytes. For binary data you don't want conversion to UTF-8, so you don't want [string bytelength] either. Use [string length] instead.

US: Proof for the sceptical:

for {set n 0} {$n < 256} {incr n} {
  lappend cl $n
}
set str [binary format c* $cl]
puts "len : [string length $str]"
puts "blen: [string bytelength $str]"

DKF: It's not even real UTF-8. It's the length of Tcl's internal encoding which is almost-UTF8 (i.e., it is consistently denormalized in certain ways). The only possible use of string bytelength is answering the question “How much memory is allocated to hold this value's bytes field?”

Example

string bytelength abc ;# -> 3

Questions

AMG: "UTF-8 uses one to three bytes to represent Unicode characters." This is true only for the BMP. For characters above FFFF, UTF-8 characters can be up to six bytes each. Does Tcl support such yet?