Version 14 of BUG - 'string length' count also NON visible chars

Updated 2024-03-07 10:55:48 by aotto1968

https://wiki.tcl-lang.org/_edit/BUG+%2D+%27string+length%27+count+also+NON+visible+chars#

update 7 mar 2024

The CORE problem is not just the string length the CORE problem is the representation of the string in tcl

  • (and all other programming languages as well)

Let's start simple: the problem is just the same as the multibyte-chars-problem raised up ~30 years ago.

  • at the beginning it was only ascii and a string was char* after a while they figure out that are languages with more chars than plain ascii.
  • It took ~20 years to implement utf8 and finally utf16 into all programming languages.

tcl uses utf8 as script-encoding and utf16 internal as string representation because

  • a string is only useful if index operations can be performt like: string index $str 0 and utf16 is index-able and utf8 not.

The string length problem is not alone, there is also the index-operation-problem like:

  • string index or
  • string range or
  • format %40s

that doesn't work.

The CORE problem is that the terminal-chars are embedded into the string and not as an character attribute.

  • the terminal is the main presentation-layer of a tcl.
  • the gui-toolkit is the main presentation-layer of tk.

The current string length in tcl is doing something in-between string bytelength and string visual-length because multibyte-chars are reduced to a single-char but invisible console-ctrl-char are counted like a char.

solution

The solution is the same solution like in utf16 or tk. every char have to be presented by a data-structure that hold

  1. the multibyte-char (utf16)
  2. the terminal-encoding (color,bold,underline,...)

I call this utf32

The puts operate as:

  1. connected to a terminal just translate the utf32 string into a stream understood by the terminal.
  2. connected to a file just do a utf32 write into a file
    • (perhaps the utf32 can be encoded into utf8)
  3. finally you can write a terminal-colored-string into a file and read it again with the color information still available.

I understand that this is not a job for tcl alone because all programming languages are involved and that is the reason I assume a utf32 consortium is good and programmers of all languages start together to implement this shit.

finally

  1. I assume that 'string length' only return the visible-lenght
  2. I expect additional commands to add the extra attributes to the string like:
    • append str [string color "123" blue]
      • the new string is: "${str}123" with "123" have the additional blue attribute for every char
    • append str [string bold "abc"]
  3. I expect that every editor, like vim, opens a utf32 or utf32→utf8 file proper and display the additional attributes.

original problem

Hi,

I use a library to format and print a debugging-message which also uses the linux-terminal-color-code.

The CORE problem is that string length .. is used to format the message and finally add some additional information etc BUT the string length .. count the NON-visible color-code as well as visible chars that result in a MISS-format string

example: (color is not visible in this thread !! - but all is "green")

 -> outTXT<multi-line>
 ->   |   |DEBUG[1]: aa   -> LIST                                     |  <- count color codes
 ->   |   |DEBUG[1]:      ->   | a1                   : 111           |  <- count color codes
 ->   |   |DEBUG[1]:      ->   | a2                   : 222           |  <- count color codes
 ->   |   |DEBUG[1]:      ->   | a3                   : 333 |

a color-code is something like

  ## ------------------------------------------------------------------------
  ## DEBUG helper
  ##  Black        0;30     Dark Gray     1;30
  ##  Red          0;31     Light Red     1;31
  ##  Green        0;32     Light Green   1;32
  ##  Brown/Orange 0;33     Yellow        1;33
  ##  Blue         0;34     Light Blue    1;34
  ##  Purple       0;35     Light Purple  1;35
  ##  Cyan         0;36     Light Cyan    1;36
  ##  Light Gray   0;37     White         1;37
  ##
  ##  8bit (256) colors: https://stackoverflow.com/questions/4842424/list-of-ansi-color-escape-sequences
  variable CL_COLOR
  set  CL_COLOR(red)        "\[1;31m"
  set  CL_COLOR(green)      "\[1;32m"
  set  CL_COLOR(yellow)     "\[1;33m"
  set  CL_COLOR(blue)       "\[1;34m"
  set  CL_COLOR(purple)     "\[38;5;206m"
  set  CL_COLOR(cyan)       "\[1;36m"
  set  CL_COLOR(lightcyan)  "\[38;5;51m"
  set  CL_COLOR(white)      "\[1;37m"
  set  CL_COLOR(grey)       "\[38;5;254m"
  set  CL_COLOR(orange)     "\[38;5;202m"
  set  CL_COLOR(no)         ""
  variable  CL_RESET        "\[0;m"

code:

  set tstmsg  "$lib_debug::CL_COLOR(red)123$lib_debug::CL_RESET"
  puts "$tstmsg"
  puts [string length $tstmsg]

result:

  123        (the color of "123" is red)
  15         (this should be 3)

there is something MISSING in TCL and this is the string length ... only count the !! VISIBLE !! chars

mfg ao


GWL The definition of what is a !! VISIBLE !! chars depends on the device that is displaying the string. This is not something TCL is missing. This is very application specific.


JMN Processing strings containing ANSI SGR sequences can be pretty tricky. A good first step is to use a regex to split into a list of plaintext sections and code sections - possibly each individual ANSI code or possibly all lumped together depending on what you want to do.

If you want to manipulate things and join blocks of text together - you need to do ANSI resets and replays, which is possible, but escalates the complexity quickly.

Perl has Text::ANSI::Util and others which is good for some ideas.

For my ANSI handling in Tcl I do a perl style split to get an always zero-length or odd length list; starting and ending always with plaintext. You can then do a foreach {plaintext code} $ansisplits {dostuff...} which makes things manageable.