Version 11 of Interning a String

Updated 2006-09-21 17:17:11 by AK

>Strick> In some languages ( e.g. [L1 ] ) you can "intern" a string, which gives you one unique "interned instance" of the string with that value.

Here's a proc to do it in Tcl:

 proc intern s {
        if { ! [info exists ::INTERN($s)] } { set ::INTERN($s) $s }
        set ::INTERN($s)
 }

I used Tcl a dozen years before needing this, but today i wrote a program that used many of the same long strings (file pathnames) in some Tcl lists.

There's two different reasons you might want to intern a string:

  • To make string comparisons as fast as pointer comparisons
  • To save memory, when strings with the same value are used many times

Most often you intern strings for the first reason, but in Tcl that can't work: there are no separate "value" and "address" comparisons. Today I interned for the second reason.

MS remarks that Tcl has a shared pool of literal strings. Any code within proc bodies reuses literals from the shared pool. In a sense, all literal strings are already "interned", and this mechanism is mainly useful for strings that are generated at runtime.

Lars H remarks that all direct copies of a string are shared automatically (it's the same Tcl_Obj). For example, in the result of

  proc lineindex {filename} {
     set res {}
     set F [open $filename r]
     while {![eof $F]} {
        lappend res [list $filename [tell $F]]
        gets $F ; # Skip ahead one line
     }
     close $F
     return $res
  }

there is one list element for every line of the file, but all of them share the Tcl_Obj for the $filename.

NEM has interned strings for the first reason (fast comparisons). In Tcl, I'd do that like this:

 set INTERN_COUNTER 0
 proc intern str {
     if {![info exists ::INTERN($str)]} {
         set ::INTERN($str) [incr ::INTERN_COUNTER]
     }
     return $::INTERN($str)
 }

That returns a unique integer for each string and you can then do integer tests for equality. Interned strings are used often for programming language interpreters to represent symbols/ atoms/variables names to optimise for equality checks and hash-table lookups.


A quick look at tclExecute.c shows that the byte code interpreter in fact contains the trivial optimization of string compare for comparing an instance to itself - JBR

 "tclExecute.c" line 2457 of 6089 
       if (valuePtr == value2Ptr) {
            /*
               In the pure equality case, set lengths too for
               the checks below (or we could goto beyond it).
             */
            iResult = s1len = s2len = 0;
        } else if ((valuePtr->typePtr == &tclByteArrayType)

Interning a string using method 1 above optimized string compare and memory usage.


AK I have done Tcl-level interning of strings as a form of compression. Pathnames again, several thousand times the same ones. Thing versions of the same file in a version control system, for example.


[ Category String Processing | Category Performance ]