emoji with Tcl/Tk 8.6
Emojis are graphical symbols used in digital communication, representing emotions, objects, or concepts. They are encoded in Unicode, a standard for text representation in computing systems. Unicode assigns a unique code point to each emoji. The Basic Multilingual Plane (BMP) is the first group of characters in Unicode, encompassing code points from U+0000 to U+FFFF. Most commonly used emojis are within the BMP; however, newer or more complex emojis, like those with skin tone variations and additional features, are located outside the BMP, requiring more advanced encoding techniques like surrogate pairs in UTF-16 encoding.
greg (2024.1.1,2024.01.08) Tried something with Tcl and emoji. I made the procs and examples to understand more about Unicode and emoji. Before using emoji in production code, you should read the tips beforehand.
The calculateSurrogatePair procedure in Tcl is designed to calculate surrogate pairs for Unicode code points beyond the Basic Multilingual Plane (BMP). It takes a hexadecimal Unicode code point as input, optionally prefixed with 'U+', '\u', or '\U'. The procedure outputs the corresponding high and low surrogate values in various formats, including strings and lists, depending on the specified output format. This function requires Tcl version 8.6.10 or higher and is useful for handling Unicode characters, like emojis, which are not represented in the BMP.
#! /usr/bin/env tclsh # emojicalc.tcl # 2024.01.07 # This procedure requires Tcl 8.6.10 or hight if {[package vcompare [package provide Tcl] 8.6.10] >= 0} { # example input: U+1F600 or 1F600 or \U1F600 # but also because it is used: \u1F600 or \U+1F600 # outputFormat:symbol:0,s u-Format:1,u Numbers:2,n List:5,l {default} # position in list: symbol:0, u-Format:1,Numbers:2 # return a list or symbol or u-Format or Numbers # Check whether the Tcl version is at least 8.6.10 proc calculateSurrogatePair {codePointHex {outputFormat l}} { # Regular expression that optionally recognizes and removes "U+","\u","\U" regexp {\\?[Uu]\+?([0-9A-F]+)} $codePointHex -> codePointHex # Converts the hexadecimal code point to a number scan $codePointHex %x codePoint # Subtracts 0x10000 to move the code point into the modifiable area set adjustedCodePoint [expr {$codePoint - 0x10000}] # Calculates the high surrogate set highSurrogate [expr {0xD800 + ($adjustedCodePoint >> 10)}] # Calculates the low surrogate set lowSurrogate [expr {0xDC00 + ($adjustedCodePoint & 0x3FF)}] # Converts the surrogates back to hexadecimal and returns them set uSurrogate [format "%04X %04X" $highSurrogate $lowSurrogate] set uSymbol "\\u[lindex $uSurrogate 0]\\u[lindex $uSurrogate 1]" set Numbers "[lindex $uSurrogate 0] [lindex $uSurrogate 1]" #output switch -- $outputFormat { 0 - s { return {*}$uSymbol } 1 - u { return $uSymbol } 2 - n { return $Numbers } 5 - l { return [list {*}$uSymbol $uSymbol $Numbers] } } } # example input: "D83D DE00" or {D83D DE00} or "\\uD83D\\uDE00" or {\uD83D\uDE00} # outputFormat: u-Format:1,u Numbers:2,n U+ Format:3,p List:5,l {default} # Symbol:0,s no sense only of completeness # position in list: Symbol:0 u-Format:1,Numbers:2 # return a list or symbol or u-Format or Numbers or U+ Format proc convertSurrogatesToCodePoint {arg {outputFormat l}} { # Normalize the input to get uniform surrogate pairs set normalizedArg [regsub -all {\\+u|{|}|\s+} $arg ""] set highSurrogate [string range $normalizedArg 0 3] set lowSurrogate [string range $normalizedArg 4 end] # Convert surrogate pairs from hexadecimal to decimal scan $highSurrogate %x highValue scan $lowSurrogate %x lowValue # Subtract the start values of high and low surrogates set highValue [expr {$highValue - 0xD800}] set lowValue [expr {$lowValue - 0xDC00}] # Combine the values and add 0x10000 to get the original code point set codePoint [expr {($highValue << 10) + $lowValue + 0x10000}] # Convert the code point back to hexadecimal format switch -- $outputFormat { 0 - s { return {*}[format "\\U%04X" $codePoint] } 1 - u { return [format "\\U%04X" $codePoint] } 2 - n { return [format "%04X" $codePoint] } 3 - p { return [format "U+%04X" $codePoint] } 5 - l { return [list {*}[format "\\u%04X" $codePoint] [format "\\U%04X" $codePoint] \ [format "%04X" $codePoint] [format "U+%04X" $codePoint]] } } } #Example, no symbol under Windows (console) if {[info script] eq $argv0} { puts "Tclversion: [info patchlevel] encoding: [encoding system]\n" puts [calculateSurrogatePair U+1F600] puts [calculateSurrogatePair U+1F600 l] puts [calculateSurrogatePair U+1F600 s] puts [lindex [calculateSurrogatePair U+1F600 l] 0] puts [calculateSurrogatePair U+1F600 u] puts [lindex [calculateSurrogatePair U+1F600 l] 1] puts [calculateSurrogatePair U+1F600 n] puts [lindex [calculateSurrogatePair U+1F600 l] 2] puts [calculateSurrogatePair [convertSurrogatesToCodePoint "\\uD83D\\uDE00" u] ] puts \n puts [convertSurrogatesToCodePoint [calculateSurrogatePair U+1F600 u]] puts [convertSurrogatesToCodePoint [calculateSurrogatePair U+1F600 n]] puts [convertSurrogatesToCodePoint "D83D DE00"] puts [convertSurrogatesToCodePoint "{D83D DE00}"] puts [convertSurrogatesToCodePoint {\uD83D \uDE00}] puts [convertSurrogatesToCodePoint "\\uD83D\\uDE00"] puts [convertSurrogatesToCodePoint {\\uD83D \\uDE00}] puts [convertSurrogatesToCodePoint {\uD83D\uDE00}] puts [convertSurrogatesToCodePoint {\uD83D\uDE00} s] puts [convertSurrogatesToCodePoint {\uD83D\uDE00} u] puts [convertSurrogatesToCodePoint {\uD83D\uDE00} n] puts [convertSurrogatesToCodePoint {\uD83D\uDE00} p] } } else { error "This procedure requires Tcl 8.6.10 or higher" }
No output of emoji on Windows.
#! /usr/bin/env tclsh # emojiexample1.tcl # 2024-01-06 # Example 1 # not symbol with Windows set scriptDir [file dirname [info script]] set emojicalc emojicalc.tcl source [file join $scriptDir $emojicalc] puts "Tclversion: [info patchlevel] encoding: [encoding system]\n" set codePointHex "U+1F600" set surrogatePair [calculateSurrogatePair $codePointHex] puts "Symbol, Surrogate Pair for $codePointHex: $surrogatePair" puts "\\uD83D\\uDE00: \uD83D\uDE00" puts "surrogatePair: $surrogatePair" puts \n puts "Lines a-f, different types of access" puts -nonewline "a : " puts [lindex $surrogatePair 1] puts -nonewline "b \": " puts "[lindex $surrogatePair 1]" puts "c subst: [subst [lindex $surrogatePair 1]]" puts "d1 catch: [catch {{*}[lindex $surrogatePair 1]} err]" puts "d2 catch err: $err" puts -nonewline "e \{*\}: " puts {*}[lindex $surrogatePair 1] puts "f \": {*}[lindex $surrogatePair 1]" puts \n puts "# outputFormat:symbol:0,s u-Format:1,u Numbers:2,n List:5,l {default}" foreach a {0 s 1 u 2 n 5 l} { puts "$a: [calculateSurrogatePair $codePointHex $a]" } puts "#position in List: symbol:0, u-Format:1,Numbers:2" foreach a {0 1 2} { puts "$a: [lindex [calculateSurrogatePair $codePointHex] $a]" }
#! /usr/bin/env tclsh package require Tk # emojiexample2.tcl # 2024.01.06 # Example 2 set scriptDir [file dirname [info script]] set emojicalc emojicalc.tcl source [file join $scriptDir $emojicalc] text .text -width 70 -height 20 -yscrollcommand ".scroll set" scrollbar .scroll -orient verti -command [list .text yview] grid .text -row 0 -column 0 -sticky nsew grid .scroll -row 0 -column 1 -sticky ns grid columnconfigure . 0 -weight 1 grid rowconfigure . 0 -weight 1 #excerpt from the website #https://wiki.tcl-lang.org/page/emoji set uList { set uc_grinning_face \u1F600 set uc_grinning_face_with_big_eyes \u1F603 set uc_grinning_face_with_sweat \u1F605 set uc_rolling_on_the_floor_laughing \u1F923 set uc_handbag \u1F45C set uc_clutch_bag \u1F45D set uc_shopping_bags \u1F6CD set uc_backpack \u1F392 set uc_mans_shoe \u1F45E set uc_running_shoe \u1F45F set uc_hiking_boot \u1F97E set uc_flat_shoe \u1F97F set uc_high-heeled_shoe \u1F460 set uc_womans_sandal \u1F461 set uc_womans_boot \u1F462 } proc transformList {uList} { set nList {} foreach line [split $uList \n] { set line [string trim $line] if {$line eq ""} { continue } # Checks whether the line corresponds to the pattern "set varName \uxxxx", also with leading spaces if {[regexp {^\s*set\s+([\w-]+)\s+\\u([0-9A-Fa-f]+)$} $line -> varName unicodeValue]} { lappend nList $varName $unicodeValue } else { puts "Warning: Line does not match the expected pattern: $line" } } return $nList } .text insert end "Tclversion: [info patchlevel] Encoding: [encoding system]\n" foreach { desc uni} [transformList $uList] { .text insert end "[calculateSurrogatePair $uni] $desc \\$uni" .text insert end \n }
#! /usr/bin/env tclsh package require Tk # emojiexample3.tcl # 2024.01.06 # # Example 3 # set scriptDir [file dirname [info script]] set emojicalc emojicalc.tcl source [file join $scriptDir $emojicalc] text .text -width 80 -height 40 -yscrollcommand ".scroll set" scrollbar .scroll -orient verti -command [list .text yview] grid .text -row 0 -column 0 -sticky nsew grid .scroll -row 0 -column 1 -sticky ns grid columnconfigure . 0 -weight 1 grid rowconfigure . 0 -weight 1 .text insert end "Tclversion: [info patchlevel] Encoding: [encoding system]" .text insert end \n .text insert end \n set codePointHex "U+1F600" set surrogatePair [calculateSurrogatePair $codePointHex] .text insert end "Symbol, Surrogate Pair for $codePointHex: $surrogatePair" .text insert end \n .text insert end "\\uD83D\\uDE00: \uD83D\uDE00" .text insert end \n .text insert end "surrogatePair: $surrogatePair" .text insert end \n .text insert end \n .text insert end "Lines a-f, different types of access" .text insert end \n .text insert end "a : " .text insert end [lindex $surrogatePair 1] .text insert end \n .text insert end "b \": " .text insert end "[lindex $surrogatePair 1]" .text insert end \n .text insert end "c subst: [subst [lindex $surrogatePair 1]]" .text insert end \n .text insert end "d1 catch: [catch {{*}[lindex $surrogatePair 1]} err]" .text insert end \n .text insert end "d2 catch err: $err" .text insert end \n .text insert end "e \{*\}: " .text insert end {*}[lindex $surrogatePair 1] .text insert end \n .text insert end "f \": {*}[lindex $surrogatePair 1]" .text insert end \n .text insert end \n .text insert end "# outputFormat:symbol:0,s u-Format:1,u Numbers:2,n List:5,l {default}" .text insert end \n foreach a {0 s 1 u 2 n 5 l} { .text insert end "$a: [calculateSurrogatePair $codePointHex $a]" .text insert end \n } .text insert end \n .text insert end "#position in List: symbol:0, u-Format:1,Numbers:2" .text insert end \n foreach a {0 1 2} { .text insert end "$a: [lindex [calculateSurrogatePair $codePointHex] $a]" .text insert end \n }
LES on 01-01-2024: None of this works for me. I also tried replacing all the puts statements with insertions into a text widget. I just get garbled text in every case. What I think is very strange is that I can copy and paste just about any Unicode symbol into a Tk text widget. So they are "displayable." Why can't we generate them? greg on 01-01-2024: It works for me under Linux. I had not tested it with Windows. LES I'm on Linux too. greg I have Linux with Tcl 8.6.13. Works with alited, geany and in the terminal directly.
greg on 01-01-2024.Works for me with Linux and with Tcl/Tk 8.6.10, Tcl/Tk 8.6.13 (see also Tcl interprets two adjacent surrogate code points as a character encoded using UTF-16 ). The symbols are not displayed under Windows. There is an error in the code or maybe it does not work with my procedure. If I understood the discussion in the link above correctly, my approach of using surrogate pairs in Tcl is not recommended for programs anyway.
greg on 02-01-2024: Check whether the Tcl version is at least 8.6.10 for the proc. Example3 works for me under Windows and Tcl/Tk 8.6.13
The convertToUnicodeEscapes procedure in Tcl converts a string of hexadecimal numbers into Unicode escape sequences. It supports various output formats including symbols, Unicode formats, numbers, and lists. This function is essential for processing and representing Unicode characters in Tcl scripts.
#! /usr/bin/env tclsh # unicodekonv.tcl # 2024.01.07 # Function to convert a string of hexadecimal numbers into Unicode escape sequences # outputFormat:symbol:0,s u-Format:1,u Numbers:2,n List:5,l {default} # position in list: symbol:0, u-Format:1,Numbers:2 proc convertToUnicodeEscapes {hexString {outputFormat l}} { # Remove any whitespace from the input set cleanedHexString [regsub -all {\s+} $hexString ""] # Replace each group of four hexadecimal digits with a Unicode escape sequence regsub -all {([0-9A-Fa-f]{4})} $cleanedHexString {\\u\1} result # Return the result based on the value of 's' switch -- $outputFormat { 0 - s { return {*}$result } 1 - u { return $result } 5 - l { return [list {*}$result $result] } } } if {[info script] eq $argv0} { # Examples to proc convertToUnicodeEscapes set example1 "23F0" set example2 "0023 FE0F 20E3" set unicode1 [convertToUnicodeEscapes $example1] set unicode2 [convertToUnicodeEscapes $example2] set unicode2a [lindex [convertToUnicodeEscapes $example2] 0] set unicode3 [convertToUnicodeEscapes $example2 s] set unicode4 [convertToUnicodeEscapes $example2 u] set unicode5 [convertToUnicodeEscapes $example2 l] puts "1: $unicode1" puts "2: $unicode2" puts "2a: $unicode2a" puts "3: $unicode3" puts "4: $unicode4" puts "5: $unicode5" }
The parseEmojiSequencesFile procedure in Tcl parses a data file to create a dictionary of emoji sequences. It reads the file, extracts each line's code points, type, description, and comments, and stores them in a structured format. This function is useful for organizing and accessing detailed emoji information in Tcl applications.
#! /usr/bin/env tclsh #emojisequences.tcl # 2024.01.06 # parser data file #expand 231A..231B to a list {231A 231B} proc expandSeq {seq} { # Check if the input is a sequence or a single value if {[regexp {(.*)\.\.(.*)} $seq -> startHex endHex]} { # Convert the hexadecimal values to decimal values set startDec [scan $startHex %x] set endDec [scan $endHex %x] # Initialize the list set list [list] # Iterate through the sequence and add each value to the list for {set i $startDec} {$i <= $endDec} {incr i} { lappend list [format %X $i] } } else { # For a single value, add it directly to the list set list [list $seq] } # Return the list return $list } # Format: # code_point(s) ; type_field ; description # comments proc parseEmojiSequencesFile {filename} { if {![file exists $filename]} { error "File $filename not found" } set fileData [open $filename] fconfigure $fileData -encoding utf-8 set emojiDict {} while {[gets $fileData line] >= 0} { # Skip comment lines and empty lines if {[regexp {^#|^\s*$} $line]} continue incr linecount ;#count row # Break down the line into its components set semiSplit [split $line ";"] if {[llength $semiSplit] != 3} { puts "error in $linecount" puts $line } #set semiNumberSign [split [lindex $semiSplit 2] "#"] set codepoints [lindex $semiSplit 0] set type [lindex $semiSplit 1] set rest [lindex $semiSplit 2] set firstHashPos [string first "#" $rest] if {$firstHashPos == -1} { set description $rest set comments "" } else { set description [string range $rest 0 [expr {$firstHashPos - 1}]] set comments [string range $rest [expr {$firstHashPos + 1}] end] } foreach codepoint [expandSeq $codepoints] { # Add the data to the dict dict set emojiDict [string trim $codepoint] [list $type $description $comments] } } close $fileData # puts "rows: $linecount dict size: [dict size $emojiDict]" return $emojiDict }
#! /usr/bin/env tclsh #emojiseqexample.tcl #2024.01.06 package require Tk #requirements Tcl/Tk 8.6.13 #Tcl/Tk 8.6.13 Release Announcement #November 22, 2022 #Unicode 15 # #https://unicode.org/Public/emoji/15.0/ #e.q. download under ../unicode.org # # 3 Tcl Scripts with the procs #emojisequences.tcl #unicodekonv.tcl #emojicalc.tcl # #datafile emoji-sequences.txt #from https://unicode.org/Public/emoji/15.0/ set scriptDir [file dirname [info script]] set dataDir unicode.org set emojisequencesTxt emoji-sequences.txt #source three file with the procs source [file join $scriptDir emojisequences.tcl] source [file join $scriptDir unicodekonv.tcl] source [file join $scriptDir emojicalc.tcl] #text widget text .text -width 180 -height 22 -yscrollcommand ".scroll set" scrollbar .scroll -orient verti -command [list .text yview] grid .text -row 0 -column 0 -sticky nsew grid .scroll -row 0 -column 1 -sticky ns grid columnconfigure . 0 -weight 1 grid rowconfigure . 0 -weight 1 # Read in the file and create the dictionary set emojiDict [parseEmojiSequencesFile [file join $scriptDir $dataDir $emojisequencesTxt]] # Example of how to use the dictionary .text insert end "Tclversion: [info patchlevel] Encoding: [encoding system]" .text insert end \n .text insert end \n set i 0 foreach codepoint [dict keys $emojiDict] { incr i set type [lindex [dict get $emojiDict $codepoint] 0] set description [string trim [lindex [dict get $emojiDict $codepoint] 1]] # commentspart:only the symbol from comment regexp {\((\S+)\)$} [lindex [dict get $emojiDict $codepoint] 2] -> commentspart # unikonv set alluni [list] set length [llength $codepoint] for {set pos 0} {$pos < $length} { incr pos} { if {[string length [lindex $codepoint $pos]] == "4" } { set unikonv$pos [lindex $codepoint $pos] } else { set unikonv$pos [calculateSurrogatePair [lindex $codepoint $pos] s] } lappend alluni [set unikonv$pos] } set unikonv [convertToUnicodeEscapes $alluni s] # unikonv the symbol created using the procs # codepoint, commentspart, description from datafile .text insert end "$i Codepoint: $codepoint, unikonv: $unikonv ,Comments: $commentspart, Description: $description " .text insert end \n }