Creating an Invalid Tcl_Obj via SQLite

Difference between version 8 and 9 - Previous - Next
======
if 0 {
======
'''Creating an Invalid [Tcl_Obj] via [SQLite]''' explores 
the interactions of [SQLite] and [Tcl]



** Description **
According to [SQLite] documentation, a '''blob''' that is '''`cast''' to
'''text''' is "interpreted" as text encoded in the database encoding.  What
that means is that SQLite '''assumes''' that the '''blob''' value is
properly-encoded, and assigns the type '''text''' to it.  Because it doesn't
verify that the value is properly encoded, it is possible store into the
database a value of type "text" invalid,  I.e. that is not properly encoded.
Later on, [SQLite] may create an invalid [Tcl_Obj] by storing the invalid text
into "bytes" field of a [Tcl_Obj].  The "bytes" field is supposed to store a
string encoded in modified utf-8.  If the data in "bytes" is not modified
utf-8, various Tcl routines fail in a variety of ways. The following example
illustrates some of that behaviour.


First, some setup:

======
}
#! /usr/bin/env tclsh

package require sqlite3

sqlite3 db :memory:

db eval {
        create table t1 (
                vals
        )

}
variable data1 bb 

if 0 {
======
In a Tcl `[binary%|%ByteArray]`, each byte represents a single Unicode character,
so $data1b is the single character \ubb
======
}

variable data1b [binary format H* $data1]

if 0 {
======
If a value has a `BinaryArray` internal representation, but no string
representation, SQLite uses the bytes in the array as a '''blob'''.  In the
current case, $data1b has no string representation so SQLite treats it as blob,
making this cast a no-op.
======
}
set castasblob1 [db eval {select cast($data1b as blob)}]

if 0 {
======
When $data1b is cast as text, SQLite makes no effort to verify that the $data1b
is validly-encoded according to the current database encoding.  It simply
changes the type to '''text''' .  In the current case, a single byte with the
value 0xbb is stored as a(n invalid) text value.
======
}

db eval {insert into t1 values (cast($data1b as text))}

if 0 {
======
Tcl requires the "bytes" field of a value to be valid utf-8, but here sqlite
stores the single byte 0xbb to "bytes" field of $data2.  This is not valid
utf-8, but in subsequent operations Tcl routines assume that it is.
======
}
set data2 [db onecolumn {select vals from t1}]

if 0 {
======
The new value does not compare as equal to the old value because the "bytes"
value of \ubb is the valid utf-8 sequence \xc2\xbb but the "bytes" value 
that SQLite created is \xbb: 
======
}

puts [list {strings equal?} [expr {$data2 eq $data1b}]]

if 0 {
======
To further illustrate, the first character of the two strings does not match:
======
}
set char1 [string index $data1b 0]
set char2 [string index $data2 0]
puts [list {first characters equal?} [expr {$char1 eq $char2}]]

if 0 {
======
In the next example, Tcl happens to succesfully convert 0xbb back to the
original character here because even though 0xbb isn't valid utf-8, the utf-8
decoder happens to be implemented such that it falls back to interpreting 0xbb
as a single-byte value that is a unicode code point.  The "utf-8-decoded"
character is now equal to the original character:
======
}

set char2u [encoding convertfrom utf-8 $char2]
puts [list {utf-8 decoded character equal?} [expr {$char1 eq $char2u}]]

if 0 {
======
Convert \ubb to its unicode code point:
======
}

scan $data1b %c ord2

if 0 {
======
[scan] also happens to convert 0xba to \ubb, so the ordinal
values of the two characters are equal:
======
}

scan $data2 %c ord1
puts [list {ordinals are equal?} [expr {$ord1 eq $ord2}]]


if 0 {
======
An attempt to print the invalid value results in the error:
    "error writing stdout: invalid argument"
======
}

catch {
        puts $data2
} cres
puts [list {error printing value returned by sqlite:} $cres]


if 0 {
======
$castasblob1, which was created earlier, is equal to $data1b
======
}

puts [list {earlier result of casting as blob equals original?} [
    expr {$data1b eq $castasblob1}]]


if 0 {
======
A side effect of previous operations is that a string representation has been
generated for $data1b, so sqlite no longer treats it as a blob, and casting
to a blob is no longer a no-op.  Now, casting $data1b as a blob produces the
utf-8 encoding \ubb.
======
}
  set data3 [db eval {select cast($data1b as blob)}]
  puts [list {result of casting as blob equals original?} [expr {$data1b eq $data3}]]
  set data1b_utf [encoding convertto utf-8 $data1b]
  puts [list {instaed, result of casting is the utf-8 encoding of the original value?} [
          expr {$data1b_utf eq $data3}]]

if 0 {
======
To summarize the findings:

        If a blob that is not valid utf is cast to text and then stored, the
        database now contains text data that is not in the database encoding.

        SQLite uses Tcl_NewStringObj to store this data into the "bytes" field
        of a Tcl_Obj, breaking the rule that the "bytes" field contains a utf-8
        string.

One could say that the client got what the client requested, but it is odd
that a Tcl script can produce an invalid Tcl_Obj.

If, SQLite performed conversion during the cast, the way it does with numeric
values, instead of assuming that the blob bytes are properly-formatted, it should
convert them to text using the database encoding.  Since every single byte
has a valid utf encoding, this conversion would never fail.



** Page Authors **

   [PYK]:   

======
}
======