BUG - 'string length' count also NON visible chars

https://wiki.tcl-lang.org/_edit/BUG+%2D+%27string+length%27+count+also+NON+visible+chars#

update 7 mar 2024

The CORE problem is not just the string length the CORE problem is the representation of the string in tcl

(and all other programming languages as well)

Let's start simple: the problem is just the same as the multibyte-chars-problem raised up ~30 years ago.

at the beginning it was only ascii and a string was char* after a while they figure out that are languages with more chars than plain ascii.
It took ~20 years to implement utf8 and finally utf16 into all programming languages.

tcl uses utf8 as script-encoding and utf16 internal as string representation because

a string is only useful if index operations can be performt like: string index $str 0 and utf16 is index-able and utf8 not.

The string length problem is not alone, there is also the index-operation-problem like:

string index or
string range or
format %40s

that doesn't work.

The CORE problem is that the terminal-chars are embedded into the string and not as an character attribute.

the terminal is the main presentation-layer of a tcl.
the gui-toolkit is the main presentation-layer of tk.

The current string length in tcl is doing something in-between string bytelength and string visual-length because multibyte-chars are reduced to a single-char but invisible console-ctrl-char are counted like a char.

solution

The solution is the same solution like in utf16 or tk. every char have to be presented by a data-structure that hold

the multibyte-char (utf16)
the terminal-encoding (color,bold,underline,...)

I call this utf32

The puts operate as:

connected to a terminal just translate the utf32 string into a stream understood by the terminal.
connected to a file just do a utf32 write into a file
- (perhaps the utf32 can be encoded into utf8)
finally you can write a terminal-colored-string into a file and read it again with the color information still available.

I understand that this is not a job for tcl alone because all programming languages are involved and that is the reason I assume a utf32 consortium is good and programmers of all languages start together to implement this shit.

finally

I assume that 'string length' only return the visible-lenght
I expect additional commands to add the extra attributes to the string like:
- append str [string color "123" blue]
  - the new string is: "${str}123" with "123" have the additional blue attribute for every char
- append str [string bold "abc"]
I expect that every editor, like vim, opens a utf32 or utf32→utf8 file proper and display the additional attributes.

JMN I admire the inventiveness of this approach - but I'm not sure this heads in the right direction.

It may be useful to distinguish between what I call 'rendered' ANSI - which has no movements or terminal status control sequences; just the character attributes (ANSI SGR sequences) and raw ANSI input streams - which can jump all over the place in the terminal, and include things like cursor save and cursor restore which remember the attributes and row,column from one point and then jump back there.

The addition of a new way to represent characters with attributes won't solve the need to process complex ANSI sequences and manipulate those strings.

I don't consider the code sequences as being per character as such.. at least not until in some way 'rendered' Look at an old ANSI graphic image and see all the moving around it does for an example. Modern terminal applications also jump around and use cursor_save/restore etc.

original problem

Hi,

I use a library to format and print a debugging-message which also uses the linux-terminal-color-code.

The CORE problem is that string length .. is used to format the message and finally add some additional information etc BUT the string length .. count the NON-visible color-code as well as visible chars that result in a MISS-format string

example: (color is not visible in this thread !! - but all is "green")

 -> outTXT<multi-line>
 ->   |   |DEBUG[1]: aa   -> LIST                                     |  <- count color codes
 ->   |   |DEBUG[1]:      ->   | a1                   : 111           |  <- count color codes
 ->   |   |DEBUG[1]:      ->   | a2                   : 222           |  <- count color codes
 ->   |   |DEBUG[1]:      ->   | a3                   : 333 |

a color-code is something like

  ## ------------------------------------------------------------------------
  ## DEBUG helper
  ##  Black        0;30     Dark Gray     1;30
  ##  Red          0;31     Light Red     1;31
  ##  Green        0;32     Light Green   1;32
  ##  Brown/Orange 0;33     Yellow        1;33
  ##  Blue         0;34     Light Blue    1;34
  ##  Purple       0;35     Light Purple  1;35
  ##  Cyan         0;36     Light Cyan    1;36
  ##  Light Gray   0;37     White         1;37
  ##
  ##  8bit (256) colors: https://stackoverflow.com/questions/4842424/list-of-ansi-color-escape-sequences
  variable CL_COLOR
  set  CL_COLOR(red)        "\[1;31m"
  set  CL_COLOR(green)      "\[1;32m"
  set  CL_COLOR(yellow)     "\[1;33m"
  set  CL_COLOR(blue)       "\[1;34m"
  set  CL_COLOR(purple)     "\[38;5;206m"
  set  CL_COLOR(cyan)       "\[1;36m"
  set  CL_COLOR(lightcyan)  "\[38;5;51m"
  set  CL_COLOR(white)      "\[1;37m"
  set  CL_COLOR(grey)       "\[38;5;254m"
  set  CL_COLOR(orange)     "\[38;5;202m"
  set  CL_COLOR(no)         ""
  variable  CL_RESET        "\[0;m"

code:

  set tstmsg  "$lib_debug::CL_COLOR(red)123$lib_debug::CL_RESET"
  puts "$tstmsg"
  puts [string length $tstmsg]

result:

  123        (the color of "123" is red)
  15         (this should be 3)

there is something MISSING in TCL and this is the string length ... only count the !! VISIBLE !! chars

mfg ao

GWL The definition of what is a !! VISIBLE !! chars depends on the device that is displaying the string. This is not something TCL is missing. This is very application specific.

JMN Processing strings containing ANSI SGR sequences can be pretty tricky. A good first step is to use a regex to split into a list of plaintext sections and code sections - possibly each individual ANSI code or possibly all lumped together depending on what you want to do.

If you want to manipulate things and join blocks of text together - you need to do ANSI resets and replays, which is possible, but escalates the complexity quickly.

Perl has Text::ANSI::Util and others which is good for some ideas.

For my ANSI handling in Tcl I do a perl style split to get an always zero-length or odd length list; starting and ending always with plaintext. You can then do a foreach {plaintext code} $ansisplits {dostuff...} which makes things manageable.

I have some libraries in the works to make this stuff easier in Tcl. Here is an example session:

   % package require punk::ansi
   0.1.1
   % namespace path punk::ansi
   % set tstmsg [a+ red bold]123a\u03005[a]
   123à5
   % string length $tstmsg
   17
   % ansistring length $tstmsg
   6
   % ansistring COUNT $tstmsg
   5
   %

Note - as GWL hinted - what counts as length is application dependent.

The a with diacritic is 2 characters in terms of Tcl string length - but it's combined and sometimes useful in some contexts to treat as one.

This gets even more complex with Unicode grapheme clusters. punk::ansi etc don't fully support grapheme clusters yet - I think some additions to Tcl's regex handler to understand unicode character classes and properties would help there.

I have the COUNT submethod of ansistring to count visible glyphs + standard control string such as \n

If you are ok to play alpha level software - you can grab modules from: https://www.gitea1.intx.com.au/jn/punkshell

There are still a lot of cross dependencies between modules. To use punk::ansi you'll currently need also punk::char, textutil::wcswidth, punk::encmime

The full system is an ansi & unicode aware shell that is coming along reasonably. the punk::ansi module can read old 90's style ANSI art and mostly render them quite nicely to a static textblock that can then be placed side by side with others. (ie - ansi movements are handled and translated to a static 80-wide block)