emoji with Tcl/Tk 8.6

Description

Description

Emojis are graphical symbols used in digital communication, representing emotions, objects, or concepts. They are encoded in Unicode, a standard for text representation in computing systems. Unicode assigns a unique code point to each emoji. The Basic Multilingual Plane (BMP) is the first group of characters in Unicode, encompassing code points from U+0000 to U+FFFF. Most commonly used emojis are within the BMP; however, newer or more complex emojis, like those with skin tone variations and additional features, are located outside the BMP, requiring more advanced encoding techniques like surrogate pairs in UTF-16 encoding.

Tcl 8.6 and the changes regarding unicode and emoji

Tcl/Tk 8.6.13 Release Announcement: November 22, 2022, Unicode 15
Tcl/Tk 8.6.12 Release Announcement: November 5, 2021, Support for Unicode 14
Tcl/Tk 8.6.11 Release Announcement: December 31, 2020, Support for Unicode 13
Tcl/Tk 8.6.10 Release Announcement: November 21, 2019, Unicode 12.0, Partial support for emoji in Tk text displays (see Tk demo)

Introduction

greg (2024.1.1,2024.01.08) Tried something with Tcl and emoji. I made the procs and examples to understand more about Unicode and emoji. Before using emoji in production code, you should read the tips beforehand.

calculateSurrogatePair and convertSurrogatesToCodePoint

The calculateSurrogatePair procedure in Tcl is designed to calculate surrogate pairs for Unicode code points beyond the Basic Multilingual Plane (BMP). It takes a hexadecimal Unicode code point as input, optionally prefixed with 'U+', '\u', or '\U'. The procedure outputs the corresponding high and low surrogate values in various formats, including strings and lists, depending on the specified output format. This function requires Tcl version 8.6.10 or higher and is useful for handling Unicode characters, like emojis, which are not represented in the BMP.

#! /usr/bin/env tclsh

# emojicalc.tcl
# 2024.01.07

# This procedure requires Tcl 8.6.10 or hight
if {[package vcompare [package provide Tcl] 8.6.10] >= 0} {
 # example input: U+1F600 or 1F600 or \U1F600
 #   but also because it is used: \u1F600 or \U+1F600
 # outputFormat:symbol:0,s u-Format:1,u Numbers:2,n List:5,l {default}
 #   position in list: symbol:0, u-Format:1,Numbers:2
 # return a list or symbol or u-Format or Numbers
 # Check whether the Tcl version is at least 8.6.10
 proc calculateSurrogatePair {codePointHex {outputFormat l}} {
  # Regular expression that optionally recognizes and removes "U+","\u","\U"
  regexp {\\?[Uu]\+?([0-9A-F]+)} $codePointHex -> codePointHex

  # Converts the hexadecimal code point to a number
  scan $codePointHex %x codePoint

  # Subtracts 0x10000 to move the code point into the modifiable area
  set adjustedCodePoint [expr {$codePoint - 0x10000}]

  # Calculates the high surrogate
  set highSurrogate [expr {0xD800 + ($adjustedCodePoint >> 10)}]

  # Calculates the low surrogate
  set lowSurrogate [expr {0xDC00 + ($adjustedCodePoint & 0x3FF)}]

  # Converts the surrogates back to hexadecimal and returns them
  set uSurrogate [format "%04X %04X" $highSurrogate $lowSurrogate]
  set uSymbol "\\u[lindex $uSurrogate 0]\\u[lindex $uSurrogate 1]"
  set Numbers "[lindex $uSurrogate 0] [lindex $uSurrogate 1]"

  #output
  switch -- $outputFormat {
   0 -
   s {
    return {*}$uSymbol
   }
   1 -
   u {
    return $uSymbol
   }
   2 -
   n {
    return $Numbers
   }
   5 -
   l {
    return [list {*}$uSymbol $uSymbol $Numbers]
   }
  }
 }

 # example input: "D83D DE00" or {D83D DE00} or "\\uD83D\\uDE00" or {\uD83D\uDE00}
 # outputFormat: u-Format:1,u Numbers:2,n U+ Format:3,p List:5,l {default}
 #   Symbol:0,s no sense only of completeness
 #   position in list: Symbol:0 u-Format:1,Numbers:2
 # return a list or symbol or u-Format or Numbers or U+ Format
 proc convertSurrogatesToCodePoint {arg {outputFormat l}} {
  # Normalize the input to get uniform surrogate pairs
  set normalizedArg [regsub -all {\\+u|{|}|\s+} $arg ""]

  set highSurrogate [string range $normalizedArg 0 3]
  set lowSurrogate [string range $normalizedArg 4 end]

  # Convert surrogate pairs from hexadecimal to decimal
  scan $highSurrogate %x highValue
  scan $lowSurrogate %x lowValue

  # Subtract the start values of high and low surrogates
  set highValue [expr {$highValue - 0xD800}]
  set lowValue [expr {$lowValue - 0xDC00}]

  # Combine the values and add 0x10000 to get the original code point
  set codePoint [expr {($highValue << 10) + $lowValue + 0x10000}]

  # Convert the code point back to hexadecimal format
  switch -- $outputFormat {
   0 -
   s {
    return {*}[format "\\U%04X" $codePoint]
   }
   1 -
   u {
    return [format "\\U%04X" $codePoint]
   }
   2 -
   n {
    return [format "%04X" $codePoint]
   }
   3 -
   p {
    return [format "U+%04X" $codePoint]
   }
   5 -
   l {
    return [list {*}[format "\\u%04X" $codePoint] [format "\\U%04X" $codePoint] \
    [format "%04X" $codePoint] [format "U+%04X" $codePoint]]
   }
  }

 }

#Example, no symbol under Windows (console)
if {[info script] eq $argv0} {
 puts "Tclversion: [info patchlevel] encoding: [encoding system]\n"
 puts [calculateSurrogatePair U+1F600]
 puts [calculateSurrogatePair U+1F600 l]
 puts [calculateSurrogatePair U+1F600 s]
 puts [lindex [calculateSurrogatePair U+1F600 l] 0]
 puts [calculateSurrogatePair U+1F600 u]
 puts [lindex [calculateSurrogatePair U+1F600 l] 1]
 puts [calculateSurrogatePair U+1F600 n]
 puts [lindex [calculateSurrogatePair U+1F600 l] 2]
 puts [calculateSurrogatePair [convertSurrogatesToCodePoint "\\uD83D\\uDE00" u] ]
 puts \n
 puts [convertSurrogatesToCodePoint [calculateSurrogatePair U+1F600 u]]
 puts [convertSurrogatesToCodePoint [calculateSurrogatePair U+1F600 n]]
 puts [convertSurrogatesToCodePoint "D83D DE00"]
 puts [convertSurrogatesToCodePoint "{D83D DE00}"]
 puts [convertSurrogatesToCodePoint {\uD83D \uDE00}]
 puts [convertSurrogatesToCodePoint "\\uD83D\\uDE00"]
 puts [convertSurrogatesToCodePoint {\\uD83D \\uDE00}]
 puts [convertSurrogatesToCodePoint {\uD83D\uDE00}]
 puts [convertSurrogatesToCodePoint {\uD83D\uDE00} s]
 puts [convertSurrogatesToCodePoint {\uD83D\uDE00} u]
 puts [convertSurrogatesToCodePoint {\uD83D\uDE00} n]
 puts [convertSurrogatesToCodePoint {\uD83D\uDE00} p]
}

} else {
 error "This procedure requires Tcl 8.6.10 or higher"
}

Example 1

No output of emoji on Windows.

#! /usr/bin/env tclsh

# emojiexample1.tcl
# 2024-01-06
# Example 1
# not symbol with Windows 
set scriptDir [file dirname [info script]]
set emojicalc emojicalc.tcl
source [file join $scriptDir $emojicalc]

puts "Tclversion: [info patchlevel] encoding: [encoding system]\n"

set codePointHex "U+1F600"
set surrogatePair [calculateSurrogatePair $codePointHex]

puts "Symbol, Surrogate Pair for $codePointHex: $surrogatePair"
puts "\\uD83D\\uDE00: \uD83D\uDE00"
puts  "surrogatePair: $surrogatePair" 
puts \n

puts "Lines a-f, different types of access"
puts -nonewline "a : " 
puts [lindex $surrogatePair 1]
puts -nonewline "b \": "
puts "[lindex $surrogatePair 1]"
puts "c subst: [subst [lindex $surrogatePair 1]]"
puts "d1 catch: [catch {{*}[lindex $surrogatePair 1]} err]"
puts "d2 catch err: $err"
puts -nonewline "e \{*\}: "
puts  {*}[lindex $surrogatePair 1]
puts  "f \": {*}[lindex $surrogatePair 1]"
puts \n

puts "# outputFormat:symbol:0,s u-Format:1,u Numbers:2,n List:5,l {default}" 
foreach a {0 s 1 u 2 n 5 l} {
puts "$a: [calculateSurrogatePair $codePointHex $a]"
}

puts "#position in List: symbol:0, u-Format:1,Numbers:2"
foreach a {0 1 2} {
puts "$a: [lindex [calculateSurrogatePair $codePointHex] $a]"
}

Example 2

#! /usr/bin/env tclsh

package require Tk
# emojiexample2.tcl
# 2024.01.06
# Example 2

set scriptDir [file dirname [info script]]
set emojicalc emojicalc.tcl
source [file join $scriptDir $emojicalc]

text .text -width 70 -height 20 -yscrollcommand ".scroll set"
scrollbar .scroll -orient verti -command [list .text yview]
grid .text -row 0 -column 0 -sticky nsew
grid .scroll -row 0 -column 1 -sticky ns
grid columnconfigure . 0 -weight 1
grid rowconfigure . 0 -weight 1

#excerpt from the website
#https://wiki.tcl-lang.org/page/emoji
set uList {
  set uc_grinning_face \u1F600
  set uc_grinning_face_with_big_eyes \u1F603
  set uc_grinning_face_with_sweat \u1F605
  set uc_rolling_on_the_floor_laughing \u1F923
  set uc_handbag \u1F45C
  set uc_clutch_bag \u1F45D
  set uc_shopping_bags \u1F6CD
  set uc_backpack \u1F392
  set uc_mans_shoe \u1F45E
  set uc_running_shoe \u1F45F
  set uc_hiking_boot \u1F97E
  set uc_flat_shoe \u1F97F
  set uc_high-heeled_shoe \u1F460
  set uc_womans_sandal \u1F461
  set uc_womans_boot \u1F462
}

proc transformList {uList} {
  set nList {}
  foreach line [split $uList \n] {
    set line [string trim $line]
    if {$line eq ""} { continue }

    # Checks whether the line corresponds to the pattern "set varName \uxxxx", also with leading spaces
    if {[regexp {^\s*set\s+([\w-]+)\s+\\u([0-9A-Fa-f]+)$} $line -> varName unicodeValue]} {
      lappend nList $varName $unicodeValue
    } else {
      puts "Warning: Line does not match the expected pattern: $line"
    }
  }
  return $nList
}

.text insert end  "Tclversion: [info patchlevel] Encoding: [encoding system]\n"

foreach { desc uni} [transformList $uList] {
  .text insert end "[calculateSurrogatePair $uni] $desc \\$uni"
  .text insert end \n
}

Example 3

#! /usr/bin/env tclsh

package require Tk
# emojiexample3.tcl
# 2024.01.06
# # Example 3
# 
set scriptDir [file dirname [info script]]
set emojicalc emojicalc.tcl
source [file join $scriptDir $emojicalc]

text .text -width 80 -height 40 -yscrollcommand ".scroll set"
scrollbar .scroll -orient verti -command [list .text yview]
grid .text -row 0 -column 0 -sticky nsew
grid .scroll -row 0 -column 1 -sticky ns
grid columnconfigure . 0 -weight 1
grid rowconfigure . 0 -weight 1

.text insert end   "Tclversion: [info patchlevel] Encoding: [encoding system]"
.text insert end  \n
.text insert end  \n

set codePointHex "U+1F600"
set surrogatePair [calculateSurrogatePair $codePointHex]

.text insert end  "Symbol, Surrogate Pair for $codePointHex: $surrogatePair"
.text insert end  \n
.text insert end  "\\uD83D\\uDE00: \uD83D\uDE00"
.text insert end  \n
.text insert end   "surrogatePair: $surrogatePair"
.text insert end  \n
.text insert end  \n
.text insert end  "Lines a-f, different types of access"
.text insert end  \n
.text insert end  "a : " 
.text insert end  [lindex $surrogatePair 1]
.text insert end  \n
.text insert end  "b \": " 
.text insert end  "[lindex $surrogatePair 1]"
.text insert end  \n
.text insert end  "c subst: [subst [lindex $surrogatePair 1]]"
.text insert end  \n
.text insert end  "d1 catch: [catch {{*}[lindex $surrogatePair 1]} err]"
.text insert end  \n
.text insert end  "d2 catch err: $err"
.text insert end  \n
.text insert end  "e \{*\}: " 
.text insert end   {*}[lindex $surrogatePair 1]
.text insert end  \n
.text insert end   "f \": {*}[lindex $surrogatePair 1]"
.text insert end  \n
.text insert end  \n

.text insert end  "# outputFormat:symbol:0,s u-Format:1,u Numbers:2,n List:5,l {default}"
.text insert end  \n
foreach a {0 s 1 u 2 n 5 l} {
 .text insert end  "$a: [calculateSurrogatePair $codePointHex $a]"
 .text insert end  \n
}
.text insert end  \n
.text insert end  "#position in List: symbol:0, u-Format:1,Numbers:2"
.text insert end  \n
 foreach a {0 1 2} {
 .text insert end  "$a: [lindex [calculateSurrogatePair $codePointHex] $a]"
 .text insert end  \n
}

discussion

LES on 01-01-2024: None of this works for me. I also tried replacing all the puts statements with insertions into a text widget. I just get garbled text in every case. What I think is very strange is that I can copy and paste just about any Unicode symbol into a Tk text widget. So they are "displayable." Why can't we generate them? greg on 01-01-2024: It works for me under Linux. I had not tested it with Windows. LES I'm on Linux too. greg I have Linux with Tcl 8.6.13. Works with alited, geany and in the terminal directly.

greg on 01-01-2024.Works for me with Linux and with Tcl/Tk 8.6.10, Tcl/Tk 8.6.13 (see also Tcl interprets two adjacent surrogate code points as a character encoded using UTF-16 ). The symbols are not displayed under Windows. There is an error in the code or maybe it does not work with my procedure. If I understood the discussion in the link above correctly, my approach of using surrogate pairs in Tcl is not recommended for programs anyway.

greg on 02-01-2024: Check whether the Tcl version is at least 8.6.10 for the proc. Example3 works for me under Windows and Tcl/Tk 8.6.13

Two additional tcl scripts with procs for displaying the emoji from the data file of Unicode.org

convertToUnicodeEscapes

The convertToUnicodeEscapes procedure in Tcl converts a string of hexadecimal numbers into Unicode escape sequences. It supports various output formats including symbols, Unicode formats, numbers, and lists. This function is essential for processing and representing Unicode characters in Tcl scripts.

#! /usr/bin/env tclsh

# unicodekonv.tcl
# 2024.01.07
# Function to convert a string of hexadecimal numbers into Unicode escape sequences
# outputFormat:symbol:0,s u-Format:1,u Numbers:2,n List:5,l {default}
#  position in list: symbol:0, u-Format:1,Numbers:2
proc convertToUnicodeEscapes {hexString {outputFormat l}} {
    # Remove any whitespace from the input
    set cleanedHexString [regsub -all {\s+} $hexString ""]

    # Replace each group of four hexadecimal digits with a Unicode escape sequence
    regsub -all {([0-9A-Fa-f]{4})} $cleanedHexString {\\u\1} result

    # Return the result based on the value of 's'
    switch -- $outputFormat {
        0 -
        s {
            return {*}$result
        }
        1 -
        u {
            return $result
        }
        5 -
        l {
            return [list {*}$result $result]
        }
    }
}

if {[info script] eq $argv0} {
    # Examples to proc convertToUnicodeEscapes
    set example1 "23F0"
    set example2 "0023 FE0F 20E3"

    set unicode1 [convertToUnicodeEscapes $example1]
    set unicode2 [convertToUnicodeEscapes $example2]
    set unicode2a [lindex  [convertToUnicodeEscapes $example2] 0]
    set unicode3 [convertToUnicodeEscapes $example2 s]
    set unicode4 [convertToUnicodeEscapes $example2 u]
    set unicode5 [convertToUnicodeEscapes $example2 l]

    puts "1: $unicode1"
    puts "2: $unicode2"
    puts "2a: $unicode2a"
    puts "3: $unicode3"
    puts "4: $unicode4"
    puts "5: $unicode5"
}

parseEmojiSequencesFile

The parseEmojiSequencesFile procedure in Tcl parses a data file to create a dictionary of emoji sequences. It reads the file, extracts each line's code points, type, description, and comments, and stores them in a structured format. This function is useful for organizing and accessing detailed emoji information in Tcl applications.

#! /usr/bin/env tclsh

#emojisequences.tcl
# 2024.01.06
# parser data file

#expand 231A..231B to a list  {231A 231B} 
proc expandSeq {seq} {
    # Check if the input is a sequence or a single value
    if {[regexp {(.*)\.\.(.*)} $seq -> startHex endHex]} {
        # Convert the hexadecimal values to decimal values
        set startDec [scan $startHex %x]
        set endDec [scan $endHex %x]

        # Initialize the list
        set list [list]

        # Iterate through the sequence and add each value to the list
        for {set i $startDec} {$i <= $endDec} {incr i} {
            lappend list [format %X $i]
        }
    } else {
        # For a single value, add it directly to the list
        set list [list $seq]
    }

    # Return the list
    return $list
}


# Format:
# code_point(s) ; type_field ; description # comments
proc parseEmojiSequencesFile {filename} {
    if {![file exists $filename]} {
        error "File $filename not found"
    }
    set fileData [open $filename]
    fconfigure $fileData -encoding utf-8
    set emojiDict {}

    while {[gets $fileData line] >= 0} {
        # Skip comment lines and empty lines
        if {[regexp {^#|^\s*$} $line]} continue
        incr linecount ;#count row
        # Break down the line into its components
        set semiSplit [split $line ";"]
        if {[llength $semiSplit] != 3} {
            puts "error in $linecount"
            puts $line
        }
        #set semiNumberSign [split [lindex $semiSplit 2] "#"]
        set codepoints [lindex $semiSplit 0]
        set type [lindex $semiSplit 1]

        set rest [lindex $semiSplit 2]
        set firstHashPos [string first "#" $rest]
        if {$firstHashPos == -1} {
            set description $rest
            set comments ""
        } else {
            set description [string range $rest 0 [expr {$firstHashPos - 1}]]
            set comments [string range $rest [expr {$firstHashPos + 1}] end]
        }
        foreach codepoint [expandSeq $codepoints] {
            # Add the data to the dict
            dict set emojiDict [string trim $codepoint] [list $type $description $comments]
        }
    }
    close $fileData
    #       puts "rows: $linecount dict size: [dict size $emojiDict]"
    return $emojiDict
}

Tcl/Tk script for displaying the emoji from the data file of Unicode.org

#! /usr/bin/env tclsh

#emojiseqexample.tcl
#2024.01.06
package require Tk
#requirements Tcl/Tk 8.6.13
#Tcl/Tk 8.6.13 Release Announcement
#November 22, 2022
#Unicode 15
#
#https://unicode.org/Public/emoji/15.0/
#e.q. download under ../unicode.org
#
# 3 Tcl Scripts with the procs
#emojisequences.tcl
#unicodekonv.tcl
#emojicalc.tcl
#
#datafile emoji-sequences.txt 
#from https://unicode.org/Public/emoji/15.0/
set scriptDir [file dirname [info script]]
set dataDir unicode.org
set emojisequencesTxt emoji-sequences.txt

#source three file with the procs
source [file join $scriptDir emojisequences.tcl]
source [file join $scriptDir unicodekonv.tcl]
source [file join $scriptDir emojicalc.tcl]

#text widget
text .text -width 180 -height 22 -yscrollcommand ".scroll set"
scrollbar .scroll -orient verti -command [list .text yview]
grid .text -row 0 -column 0 -sticky nsew
grid .scroll -row 0 -column 1 -sticky ns
grid columnconfigure . 0 -weight 1
grid rowconfigure . 0 -weight 1

# Read in the file and create the dictionary
set emojiDict [parseEmojiSequencesFile [file join $scriptDir $dataDir $emojisequencesTxt]]

# Example of how to use the dictionary
.text insert end   "Tclversion: [info patchlevel] Encoding: [encoding system]"
.text insert end  \n
.text insert end  \n
set i 0
foreach codepoint [dict keys $emojiDict] {
  incr i
  set type [lindex [dict get $emojiDict $codepoint] 0]
  set description [string trim [lindex [dict get $emojiDict $codepoint] 1]]
  # commentspart:only the symbol from comment
  regexp {\((\S+)\)$}  [lindex [dict get $emojiDict $codepoint] 2] -> commentspart

  # unikonv
  set alluni [list]
  set length [llength $codepoint]
  for {set pos 0} {$pos < $length} { incr pos} {
    if {[string length [lindex $codepoint $pos]] == "4" } {
      set unikonv$pos [lindex $codepoint $pos]
    } else {
      set unikonv$pos [calculateSurrogatePair [lindex $codepoint $pos] s]
    }
    lappend alluni [set unikonv$pos]
  }
  set unikonv [convertToUnicodeEscapes $alluni s]

  # unikonv the symbol created using the procs
  # codepoint, commentspart, description from datafile
  .text insert end "$i Codepoint: $codepoint, unikonv: $unikonv ,Comments: $commentspart, Description: $description "
  .text insert end \n
}