https://wiki.tcl-lang.org/_edit/BUG+%2D+%27string+length%27+count+also+NON+visible+chars#
The CORE problem is not just the string length the CORE problem is the representation of the string in tcl
Let's start simple: the problem is just the same as the multibyte-chars-problem raised up ~30 years ago.
tcl uses utf8 as script-encoding and utf16 internal as string representation because
The string length problem is not alone, there is also the index-operation-problem like:
that doesn't work.
The CORE problem is that the terminal-chars are embedded into the string and not as an character attribute.
The current string length in tcl is doing something in-between string bytelength and string visual-length because multibyte-chars are reduced to a single-char but invisible console-ctrl-char are counted like a char.
The solution is the same solution like in utf16 or tk. every char have to be presented by a data-structure that hold
I call this utf32
The puts operate as:
I understand that this is not a job for tcl alone because all programming languages are involved and that is the reason I assume a utf32 consortium is good and programmers of all languages start together to implement this shit.
JMN I admire the inventiveness of this approach - but I'm not sure this heads in the right direction.
It may be useful to distinguish between what I call 'rendered' ANSI - which has no movements or terminal status control sequences; just the character attributes (ANSI SGR sequences) and raw ANSI input streams - which can jump all over the place in the terminal, and include things like cursor save and cursor restore which remember the attributes and row,column from one point and then jump back there.
The addition of a new way to represent characters with attributes won't solve the need to process complex ANSI sequences and manipulate those strings.
I don't consider the code sequences as being per character as such.. at least not until in some way 'rendered' Look at an old ANSI graphic image and see all the moving around it does for an example. Modern terminal applications also jump around and use cursor_save/restore etc.
Hi,
I use a library to format and print a debugging-message which also uses the linux-terminal-color-code.
The CORE problem is that string length .. is used to format the message and finally add some additional information etc BUT the string length .. count the NON-visible color-code as well as visible chars that result in a MISS-format string
example: (color is not visible in this thread !! - but all is "green")
-> outTXT<multi-line> -> | |DEBUG[1]: aa -> LIST | <- count color codes -> | |DEBUG[1]: -> | a1 : 111 | <- count color codes -> | |DEBUG[1]: -> | a2 : 222 | <- count color codes -> | |DEBUG[1]: -> | a3 : 333 |
a color-code is something like
## ------------------------------------------------------------------------ ## DEBUG helper ## Black 0;30 Dark Gray 1;30 ## Red 0;31 Light Red 1;31 ## Green 0;32 Light Green 1;32 ## Brown/Orange 0;33 Yellow 1;33 ## Blue 0;34 Light Blue 1;34 ## Purple 0;35 Light Purple 1;35 ## Cyan 0;36 Light Cyan 1;36 ## Light Gray 0;37 White 1;37 ## ## 8bit (256) colors: https://stackoverflow.com/questions/4842424/list-of-ansi-color-escape-sequences variable CL_COLOR set CL_COLOR(red) "\[1;31m" set CL_COLOR(green) "\[1;32m" set CL_COLOR(yellow) "\[1;33m" set CL_COLOR(blue) "\[1;34m" set CL_COLOR(purple) "\[38;5;206m" set CL_COLOR(cyan) "\[1;36m" set CL_COLOR(lightcyan) "\[38;5;51m" set CL_COLOR(white) "\[1;37m" set CL_COLOR(grey) "\[38;5;254m" set CL_COLOR(orange) "\[38;5;202m" set CL_COLOR(no) "" variable CL_RESET "\[0;m"
code:
set tstmsg "$lib_debug::CL_COLOR(red)123$lib_debug::CL_RESET" puts "$tstmsg" puts [string length $tstmsg]
result:
123 (the color of "123" is red) 15 (this should be 3)
there is something MISSING in TCL and this is the string length ... only count the !! VISIBLE !! chars
mfg ao
GWL The definition of what is a !! VISIBLE !! chars depends on the device that is displaying the string. This is not something TCL is missing. This is very application specific.
JMN Processing strings containing ANSI SGR sequences can be pretty tricky. A good first step is to use a regex to split into a list of plaintext sections and code sections - possibly each individual ANSI code or possibly all lumped together depending on what you want to do.
If you want to manipulate things and join blocks of text together - you need to do ANSI resets and replays, which is possible, but escalates the complexity quickly.
Perl has Text::ANSI::Util and others which is good for some ideas.
For my ANSI handling in Tcl I do a perl style split to get an always zero-length or odd length list; starting and ending always with plaintext. You can then do a foreach {plaintext code} $ansisplits {dostuff...} which makes things manageable.
I have some libraries in the works to make this stuff easier in Tcl. Here is an example session:
% package require punk::ansi 0.1.1 % namespace path punk::ansi % set tstmsg [a+ red bold]123a\u03005[a] 123à5 % string length $tstmsg 17 % ansistring length $tstmsg 6 % ansistring COUNT $tstmsg 5 %
Note - as GWL hinted - what counts as length is application dependent.
The a with diacritic is 2 characters in terms of Tcl string length - but it's combined and sometimes useful in some contexts to treat as one.
This gets even more complex with Unicode grapheme clusters. punk::ansi etc don't fully support grapheme clusters yet - I think some additions to Tcl's regex handler to understand unicode character classes and properties would help there.
I have the COUNT submethod of ansistring to count visible glyphs + standard control string such as \n
If you are ok to play alpha level software - you can grab modules from: https://www.gitea1.intx.com.au/jn/punkshell
There are still a lot of cross dependencies between modules. To use punk::ansi you'll currently need also punk::char, textutil::wcswidth, punk::encmime
The full system is an ansi & unicode aware shell that is coming along reasonably. the punk::ansi module can read old 90's style ANSI art and mostly render them quite nicely to a static textblock that can then be placed side by side with others. (ie - ansi movements are handled and translated to a static 80-wide block)