======
if 0 {
======
'''Creating an Invalid [Tcl_Obj] via [SQLite]''' explores
the interactions of [SQLite] and [Tcl]
** Description **
According to [SQLite] documentation, a '''blob''' that is '''`cast''' to
'''text''' is "interpreted" as text encoded in the database encoding. What
that means is that SQLite '''assumes''' that the '''blob''' value is
properly-encoded, and assigns the type '''text''' to it. Because it doesn't
verify that the value is properly encoded, it is possible store into the
database a value of type "text" invalid, I.e. that is not properly encoded.
Later on, [SQLite] may create an invalid [Tcl_Obj] by storing the invalid text
into "bytes" field of a [Tcl_Obj]. The "bytes" field is supposed to store a
string encoded in modified utf-8. If the data in "bytes" is not modified
utf-8, various Tcl routines fail in a variety of ways. The following example
illustrates some of that behaviour.
First, some setup:
======
}
#! /usr/bin/env tclsh
package require sqlite3
sqlite3 db :memory:
db eval {
create table t1 (
vals
)
}
variable data1 bb
if 0 {
======
In a Tcl `[binary%|%ByteArray]`, each byte represents a single Unicode character,
so $data1b is the single character \ubb
======
}
variable data1b [binary format H* $data1]
if 0 {
======
If a value has a `BinaryArray` internal representation, but no string
representation, SQLite uses the bytes in the array as a '''blob'''. In the
current case, $data1b has no string representation so SQLite treats it as blob,
making this cast a no-op.
======
}
set castasblob1 [db eval {select cast($data1b as blob)}]
if 0 {
======
When $data1b is cast as text, SQLite makes no effort to verify that the $data1b
is validly-encoded according to the current database encoding. It simply
changes the type to '''text''' . In the current case, a single byte with the
value 0xbb is stored as a(n invalid) text value.
======
}
db eval {insert into t1 values (cast($data1b as text))}
if 0 {
======
Tcl requires the "bytes" field of a value to be valid utf-8, but here sqlite
stores the single byte 0xbb to "bytes" field of $data2. This is not valid
utf-8, but in subsequent operations Tcl routines assume that it is.
======
}
set data2 [db onecolumn {select vals from t1}]
if 0 {
======
The new value does not compare as equal to the old value because the "bytes"
value of \ubb is the valid utf-8 sequence \xc2\xbb but the "bytes" value
that SQLite created is \xbb:
======
}
puts [list {strings equal?} [expr {$data2 eq $data1b}]]
if 0 {
======
To further illustrate, the first character of the two strings does not match:
======
}
set char1 [string index $data1b 0]
set char2 [string index $data2 0]
puts [list {first characters equal?} [expr {$char1 eq $char2}]]
if 0 {
======
In the next example, Tcl happens to succesfully convert 0xbb back to the
original character here because even though 0xbb isn't valid utf-8, the utf-8
decoder happens to be implemented such that it falls back to interpreting 0xbb
as a single-byte value that is a unicode code point. The "utf-8-decoded"
character is now equal to the original character:
======
}
set char2u [encoding convertfrom utf-8 $char2]
puts [list {utf-8 decoded character equal?} [expr {$char1 eq $char2u}]]
if 0 {
======
Convert \ubb to its unicode code point:
======
}
scan $data1b %c ord2
if 0 {
======
[scan] also happens to convert 0xba to \ubb, so the ordinal
values of the two characters are equal:
======
}
scan $data2 %c ord1
puts [list {ordinals are equal?} [expr {$ord1 eq $ord2}]]
if 0 {
======
An attempt to print the invalid value results in the error:
"error writing stdout: invalid argument"
======
}
catch {
puts $data2
} cres
puts [list {error printing value returned by sqlite:} $cres]
if 0 {
======
$castasblob1, which was created earlier, is equal to $data1b
======
}
puts [list {earlier result of casting as blob equals original?} [
expr {$data1b eq $castasblob1}]]
if 0 {
======
A side effect of previous operations is that a string representation has been
generated for $data1b, so sqlite no longer treats it as a blob, and casting
to a blob is no longer a no-op. Now, casting $data1b as a blob produces the
utf-8 encoding \ubb.
======
}
set data3 [db eval {select cast($data1b as blob)}]
puts [list {result of casting as blob equals original?} [expr {$data1b eq $data3}]]
set data1b_utf [encoding convertto utf-8 $data1b]
puts [list {instaed, result of casting is the utf-8 encoding of the original value?} [
expr {$data1b_utf eq $data3}]]
if 0 {
======
To summarize the findings:
If a blob that is not valid utf is cast to text and then stored, the
database now contains text data that is not in the database encoding.
SQLite uses Tcl_NewStringObj to store this data into the "bytes" field
of a Tcl_Obj, breaking the rule that the "bytes" field contains a utf-8
string.
One could say that the client got what the client requested, but it is odd
that a Tcl script can produce an invalid Tcl_Obj.
If, SQLite performed conversion during the cast, the way it does with numeric
values, instead of assuming that the blob bytes are properly-formatted, it should
convert them to text using the database encoding. Since every single byte
has a valid utf encoding, this conversion would never fail.
** Page Authors **
[PYK]:
======
}
======