====== if 0 { ====== '''Creating an Invalid [Tcl_Obj] via [SQLite]''' explores the interactions of [SQLite] and [Tcl] ** Description ** According to [SQLite] documentation, a '''blob''' that is '''cast''' to '''text''' is "interpreted" as text encoded in the database encoding. What that means is that SQLite '''assumes''' that the '''blob''' value is properly-encoded, and assigns the type '''text''' to it. Because it doesn't verify that the value is properly encoded, it is possible store into the database a value of type "text" invalid, I.e. that is not properly encoded. Later on, [SQLite] may create an invalid [Tcl_Obj] by storing the invalid text into "bytes" field of a [Tcl_Obj]. The "bytes" field is supposed to store a string encoded in modified utf-8. If the data in "bytes" is not modified utf-8, various Tcl routines fail in a variety of ways. The following example illustrates some of that behaviour. First, some setup: ====== } #! /usr/bin/env tclsh package require sqlite3 sqlite3 db :memory: db eval { create table t1 ( vals ) } variable data1 bb if 0 { ====== In a Tcl `[binary%|%ByteArray]`, each byte represents a single Unicode character, so $data1b is the single character \ubb ====== } variable data1b [binary format H* $data1] if 0 { ====== If a value has a `BinaryArray` internal representation, but no string representation, SQLite uses the bytes in the array as a '''blob'''. In the current case, $data1b has no string representation so SQLite treats it as blob, making this cast a no-op. ====== } set castasblob1 [db eval {select cast($data1b as blob)}] if 0 { ====== When $data1b is cast as text, SQLite makes no effort to verify that the $data1b is validly-encoded according to the current database encoding. It simply changes the type to '''text''' . In the current case, a single byte with the value 0xbb is stored as a(n invalid) text value. ====== } db eval {insert into t1 values (cast($data1b as text))} if 0 { ====== Tcl requires the "bytes" field of a value to be valid utf-8, but here sqlite stores the single byte 0xbb to "bytes" field of $data2. This is not valid utf-8, but in subsequent operations Tcl routines assume that it is. ====== } set data2 [db onecolumn {select vals from t1}] if 0 { ====== The new value does not compare as equal to the old value because the "bytes" value of \ubb is the valid utf-8 sequence \xc2\xbb but the "bytes" value that SQLite created is \xbb: ====== } puts [list {strings equal?} [expr {$data2 eq $data1b}]] if 0 { ====== To further illustrate, the first character of the two strings does not match: ====== } set char1 [string index $data1b 0] set char2 [string index $data2 0] puts [list {first characters equal?} [expr {$char1 eq $char2}]] if 0 { ====== In the next example, Tcl happens to succesfully convert 0xbb back to the original character here because even though 0xbb isn't valid utf-8, the utf-8 decoder happens to be implemented such that it falls back to interpreting 0xbb as a single-byte value that is a unicode code point. The "utf-8-decoded" character is now equal to the original character: ====== } set char2u [encoding convertfrom utf-8 $char2] puts [list {utf-8 decoded character equal?} [expr {$char1 eq $char2u}]] if 0 { ====== Convert \ubb to its unicode code point: ====== } scan $data1b %c ord2 if 0 { ====== [scan] also happens to convert 0xba to \ubb, so the ordinal values of the two characters are equal: ====== } scan $data2 %c ord1 puts [list {ordinals are equal?} [expr {$ord1 eq $ord2}]] if 0 { ====== An attempt to print the invalid value results in the error: "error writing stdout: invalid argument" ====== } catch { puts $data2 } cres puts [list {error printing value returned by sqlite:} $cres] if 0 { ====== $castasblob1, which was created earlier, is equal to $data1b ====== } puts [list {earlier result of casting as blob equals original?} [ expr {$data1b eq $castasblob1}]] if 0 { ====== A side effect of previous operations is that a string representation has been generated for $data1b, so sqlite no longer treats it as a blob, and casting to a blob is no longer a no-op. Now, casting $data1b as a blob produces the utf-8 encoding \ubb. ====== } set data3 [db eval {select cast($data1b as blob)}] puts [list {result of casting as blob equals original?} [expr {$data1b eq $data3}]] set data1b_utf [encoding convertto utf-8 $data1b] puts [list {instaed, result of casting is the utf-8 encoding of the original value?} [ expr {$data1b_utf eq $data3}]] if 0 { ====== To summarize the findings: If a blob that is not valid utf is cast to text and then stored, the database now contains text data that is not in the database encoding. SQLite uses Tcl_NewStringObj to store this data into the "bytes" field of a Tcl_Obj, breaking the rule that the "bytes" field contains a utf-8 string. One could say that the client got what the client requested, but it is odd that a Tcl script can produce an invalid Tcl_Obj. If, SQLite performed conversion during the cast, the way it does with numeric values, instead of assuming that the blob bytes are properly-formatted, it should convert them to text using the database encoding. Since every single byte has a valid utf encoding, this conversion would never fail. ** Page Authors ** [PYK]: ====== } ======