Creating an Invalid Tcl_Obj via SQLite

if 0 {

Creating an Invalid Tcl_Obj via SQLite explores the interactions of SQLite and Tcl

Description

According to SQLite documentation, a blob that is cast to text is "interpreted" as text encoded in the database encoding. What that means is that SQLite assumes that the blob value is properly-encoded, and assigns the type text to it. Because it doesn't verify that the value is properly encoded, it is possible store into the database an invalid value, an invalid value, i.e., a value that is not in the database encoding, of type "text". Later on, SQLite may create an invalid Tcl_Obj by storing the invalid text into the "bytes" field of a Tcl_Obj. The "bytes" field is supposed to store a string encoded in modified utf-8. If the data in "bytes" is not modified utf-8, various Tcl routines fail in a variety of ways. The following example illustrates some of that behaviour.

First, some setup:

}
#! /usr/bin/env tclsh

package require sqlite3

sqlite3 db :memory:

db eval {
        create table t1 (
                vals
        )

}
variable data1 bb 

if 0 {

In a Tcl ByteArray, each byte represents a single Unicode character, so $data1b is the single character \ubb

}

variable data1b [binary format H* $data1]

if 0 {

If a value has a BinaryArray internal representation, but no string representation, SQLite uses the bytes in the array as a blob. In the current case, $data1b has no string representation so SQLite treats it as blob, making this cast a no-op.

}
set castasblob1 [db eval {select cast($data1b as blob)}]

if 0 {

When $data1b is cast as text, SQLite makes no effort to verify that the $data1b is validly-encoded according to the current database encoding. It simply changes the type to text . In the current case, a single byte with the value 0xbb is stored as a(n invalid) text value.

}

db eval {insert into t1 values (cast($data1b as text))}

if 0 {

Tcl requires the "bytes" field of a value to be valid utf-8, but here sqlite stores the single byte 0xbb to "bytes" field of $data2. This is not valid utf-8, but in subsequent operations Tcl routines assume that it is.

}
set data2 [db onecolumn {select vals from t1}]

if 0 {

The new value does not compare as equal to the old value because the "bytes" value of \ubb is the valid utf-8 sequence \xc2\xbb but the "bytes" value that SQLite created is \xbb:

}

puts [list {strings equal?} [expr {$data2 eq $data1b}]]

if 0 {

To further illustrate, the first character of the two strings does not match:

}
set char1 [string index $data1b 0]
set char2 [string index $data2 0]
puts [list {first characters equal?} [expr {$char1 eq $char2}]]

if 0 {

In the next example, Tcl happens to succesfully convert 0xbb back to the original character here because even though 0xbb isn't valid utf-8, the utf-8 decoder happens to be implemented such that it falls back to interpreting 0xbb as a single-byte value that is a unicode code point. The "utf-8-decoded" character is now equal to the original character:

}

set char2u [encoding convertfrom utf-8 $char2]
puts [list {utf-8 decoded character equal?} [expr {$char1 eq $char2u}]]

if 0 {

Convert \ubb to its unicode code point:

}

scan $data1b %c ord2

if 0 {

scan also happens to convert 0xba to \ubb, so the ordinal values of the two characters are equal:

}

scan $data2 %c ord1
puts [list {ordinals are equal?} [expr {$ord1 eq $ord2}]]


if 0 {

An attempt to print the invalid value results in the error:

    "error writing stdout: invalid argument"
}

catch {
        puts $data2
} cres
puts [list {error printing value returned by sqlite:} $cres]


if 0 {

$castasblob1, which was created earlier, is equal to $data1b

}

puts [list {earlier result of casting as blob equals original?} [
    expr {$data1b eq $castasblob1}]]


if 0 {

A side effect of previous operations is that a string representation has been generated for $data1b, so sqlite no longer treats it as a blob, and casting to a blob is no longer a no-op. Now, casting $data1b as a blob produces the utf-8 encoding \ubb.

}
  set data3 [db eval {select cast($data1b as blob)}]
  puts [list {result of casting as blob equals original?} [expr {$data1b eq $data3}]]
  set data1b_utf [encoding convertto utf-8 $data1b]
  puts [list {instaed, result of casting is the utf-8 encoding of the original value?} [
          expr {$data1b_utf eq $data3}]]

if 0 {

To summarize the findings:

        If a blob that is not valid utf is cast to text and then stored, the
        database now contains text data that is not in the database encoding.

        SQLite uses Tcl_NewStringObj to store this data into the "bytes" field
        of a Tcl_Obj, breaking the rule that the "bytes" field contains a utf-8
        string.

One could say that the client got what the client requested, but it is odd that a Tcl script can produce an invalid Tcl_Obj.

If, SQLite performed conversion during the cast, the way it does with numeric values, instead of assuming that the blob bytes are properly-formatted, it should convert them to text using the database encoding. Since every single byte has a valid utf encoding, this conversion would never fail.

Page Authors

PYK
}