Creating an Invalid Tcl_Obj via SQLite

Difference between version 10 and 11 - Previous - Next
======
if 0 {
======
'''Creating an Invalid [Tcl_Obj] via [SQLite]''' explores 
the interactions of [SQLite] and [Tcl]



** Description **

According to [SQLite] documentation, a '''blob''' that is '''cast''' to
'''text''' is "interpreted" as text encoded in the database encoding.  What
that means is that SQLite '''assumes''' that the blob value is
properly-encoded, and assigns the type '''text''' to it.  Because it doesn't
verify that the value is properly encoded, it is possible store into thedatabase an invalid value, ofan type "text" invalid value,  Ii.e., a value that is not propin therly
database encoding, of typed "text".
  Later on, [SQLite] may create an invalid 
[Tcl_Obj] by storing the invalid text
into "bytes" field of a [Tcl_Obj].  The "bytes" field is supposed to store a
string encoded in modified utf-8.  If the data in "bytes" is not modified
utf-8, various Tcl routines fail in a variety of ways. The following example
illustrates some of that behaviour.


First, some setup:

======
}
#! /usr/bin/env tclsh

package require sqlite3

sqlite3 db :memory:

db eval {
        create table t1 (
                vals
        )

}
variable data1 bb 

if 0 {
======
In a Tcl `[binary%|%ByteArray]`, each byte represents a single Unicode character,
so $data1b is the single character \ubb
======
}

variable data1b [binary format H* $data1]

if 0 {
======
If a value has a `BinaryArray` internal representation, but no string
representation, SQLite uses the bytes in the array as a '''blob'''.  In the
current case, $data1b has no string representation so SQLite treats it as blob,
making this cast a no-op.
======
}
set castasblob1 [db eval {select cast($data1b as blob)}]

if 0 {
======
When $data1b is cast as text, SQLite makes no effort to verify that the $data1b
is validly-encoded according to the current database encoding.  It simply
changes the type to '''text''' .  In the current case, a single byte with the
value 0xbb is stored as a(n invalid) text value.
======
}

db eval {insert into t1 values (cast($data1b as text))}

if 0 {
======
Tcl requires the "bytes" field of a value to be valid utf-8, but here sqlite
stores the single byte 0xbb to "bytes" field of $data2.  This is not valid
utf-8, but in subsequent operations Tcl routines assume that it is.
======
}
set data2 [db onecolumn {select vals from t1}]

if 0 {
======
The new value does not compare as equal to the old value because the "bytes"
value of \ubb is the valid utf-8 sequence \xc2\xbb but the "bytes" value 
that SQLite created is \xbb: 
======
}

puts [list {strings equal?} [expr {$data2 eq $data1b}]]

if 0 {
======
To further illustrate, the first character of the two strings does not match:
======
}
set char1 [string index $data1b 0]
set char2 [string index $data2 0]
puts [list {first characters equal?} [expr {$char1 eq $char2}]]

if 0 {
======
In the next example, Tcl happens to succesfully convert 0xbb back to the
original character here because even though 0xbb isn't valid utf-8, the utf-8
decoder happens to be implemented such that it falls back to interpreting 0xbb
as a single-byte value that is a unicode code point.  The "utf-8-decoded"
character is now equal to the original character:
======
}

set char2u [encoding convertfrom utf-8 $char2]
puts [list {utf-8 decoded character equal?} [expr {$char1 eq $char2u}]]

if 0 {
======
Convert \ubb to its unicode code point:
======
}

scan $data1b %c ord2

if 0 {
======
[scan] also happens to convert 0xba to \ubb, so the ordinal
values of the two characters are equal:
======
}

scan $data2 %c ord1
puts [list {ordinals are equal?} [expr {$ord1 eq $ord2}]]


if 0 {
======
An attempt to print the invalid value results in the error:
    "error writing stdout: invalid argument"
======
}

catch {
        puts $data2
} cres
puts [list {error printing value returned by sqlite:} $cres]


if 0 {
======
$castasblob1, which was created earlier, is equal to $data1b
======
}

puts [list {earlier result of casting as blob equals original?} [
    expr {$data1b eq $castasblob1}]]


if 0 {
======
A side effect of previous operations is that a string representation has been
generated for $data1b, so sqlite no longer treats it as a blob, and casting
to a blob is no longer a no-op.  Now, casting $data1b as a blob produces the
utf-8 encoding \ubb.
======
}
  set data3 [db eval {select cast($data1b as blob)}]
  puts [list {result of casting as blob equals original?} [expr {$data1b eq $data3}]]
  set data1b_utf [encoding convertto utf-8 $data1b]
  puts [list {instaed, result of casting is the utf-8 encoding of the original value?} [
          expr {$data1b_utf eq $data3}]]

if 0 {
======
To summarize the findings:

        If a blob that is not valid utf is cast to text and then stored, the
        database now contains text data that is not in the database encoding.

        SQLite uses Tcl_NewStringObj to store this data into the "bytes" field
        of a Tcl_Obj, breaking the rule that the "bytes" field contains a utf-8
        string.

One could say that the client got what the client requested, but it is odd
that a Tcl script can produce an invalid Tcl_Obj.

If, SQLite performed conversion during the cast, the way it does with numeric
values, instead of assuming that the blob bytes are properly-formatted, it should
convert them to text using the database encoding.  Since every single byte
has a valid utf encoding, this conversion would never fail.



** Page Authors **

   [PYK]:   

======
}
======