Version 9 of Creating an Invalid Tcl_Obj via SQLite

Updated 2020-06-21 08:40:13 by pooryorick
if 0 {

Creating an Invalid Tcl_Obj via SQLite explores the interactions of SQLite and Tcl

Description

According to SQLite documentation, a blob that is cast to text is "interpreted" as text encoded in the database encoding. What that means is that SQLite assumes that the blob value is properly-encoded, and assigns the type text to it. Because it doesn't verify that the value is properly encoded, it is possible store into the database a value of type "text" invalid, I.e. that is not properly encoded. Later on, SQLite may create an invalid Tcl_Obj by storing the invalid text into "bytes" field of a Tcl_Obj. The "bytes" field is supposed to store a string encoded in modified utf-8. If the data in "bytes" is not modified utf-8, various Tcl routines fail in a variety of ways. The following example illustrates some of that behaviour.

First, some setup:

}
#! /usr/bin/env tclsh

package require sqlite3

sqlite3 db :memory:

db eval {
        create table t1 (
                vals
        )

}
variable data1 bb 

if 0 {

In a Tcl ByteArray, each byte represents a single Unicode character, so $data1b is the single character \ubb

}

variable data1b [binary format H* $data1]

if 0 {

If a value has a BinaryArray internal representation, but no string representation, SQLite uses the bytes in the array as a blob. In the current case, $data1b has no string representation so SQLite treats it as blob, making this cast a no-op.

}
set castasblob1 [db eval {select cast($data1b as blob)}]

if 0 {

When $data1b is cast as text, SQLite makes no effort to verify that the $data1b is validly-encoded according to the current database encoding. It simply changes the type to text . In the current case, a single byte with the value 0xbb is stored as a(n invalid) text value.

}

db eval {insert into t1 values (cast($data1b as text))}

if 0 {

Tcl requires the "bytes" field of a value to be valid utf-8, but here sqlite stores the single byte 0xbb to "bytes" field of $data2. This is not valid utf-8, but in subsequent operations Tcl routines assume that it is.

}
set data2 [db onecolumn {select vals from t1}]

if 0 {

The new value does not compare as equal to the old value because the "bytes" value of \ubb is the valid utf-8 sequence \xc2\xbb but the "bytes" value that SQLite created is \xbb:

}

puts [list {strings equal?} [expr {$data2 eq $data1b}]]

if 0 {

To further illustrate, the first character of the two strings does not match:

}
set char1 [string index $data1b 0]
set char2 [string index $data2 0]
puts [list {first characters equal?} [expr {$char1 eq $char2}]]

if 0 {

In the next example, Tcl happens to succesfully convert 0xbb back to the original character here because even though 0xbb isn't valid utf-8, the utf-8 decoder happens to be implemented such that it falls back to interpreting 0xbb as a single-byte value that is a unicode code point. The "utf-8-decoded" character is now equal to the original character:

}

set char2u [encoding convertfrom utf-8 $char2]
puts [list {utf-8 decoded character equal?} [expr {$char1 eq $char2u}]]

if 0 {

Convert \ubb to its unicode code point:

}

scan $data1b %c ord2

if 0 {

scan also happens to convert 0xba to \ubb, so the ordinal values of the two characters are equal:

}

scan $data2 %c ord1
puts [list {ordinals are equal?} [expr {$ord1 eq $ord2}]]


if 0 {

An attempt to print the invalid value results in the error:

    "error writing stdout: invalid argument"
}

catch {
        puts $data2
} cres
puts [list {error printing value returned by sqlite:} $cres]


if 0 {

$castasblob1, which was created earlier, is equal to $data1b

}

puts [list {earlier result of casting as blob equals original?} [
    expr {$data1b eq $castasblob1}]]


if 0 {

A side effect of previous operations is that a string representation has been generated for $data1b, so sqlite no longer treats it as a blob, and casting to a blob is no longer a no-op. Now, casting $data1b as a blob produces the utf-8 encoding \ubb.

}
  set data3 [db eval {select cast($data1b as blob)}]
  puts [list {result of casting as blob equals original?} [expr {$data1b eq $data3}]]
  set data1b_utf [encoding convertto utf-8 $data1b]
  puts [list {instaed, result of casting is the utf-8 encoding of the original value?} [
          expr {$data1b_utf eq $data3}]]

if 0 {

To summarize the findings:

        If a blob that is not valid utf is cast to text and then stored, the
        database now contains text data that is not in the database encoding.

        SQLite uses Tcl_NewStringObj to store this data into the "bytes" field
        of a Tcl_Obj, breaking the rule that the "bytes" field contains a utf-8
        string.

One could say that the client got what the client requested, but it is odd that a Tcl script can produce an invalid Tcl_Obj.

If, SQLite performed conversion during the cast, the way it does with numeric values, instead of assuming that the blob bytes are properly-formatted, it should convert them to text using the database encoding. Since every single byte has a valid utf encoding, this conversion would never fail.

Page Authors

PYK
}