encoding

Difference between version 54 and 56 - Previous - Next
'''`[https://www.tcl-lang.org/man/tcl/TclCmd/encoding.htm%|%encoding]`''', a [tcl commands%|%built-in] Tcl [command], manages the conversion of strings to and from particular encodings, and controls how the definitions of particular encodings are located and what Tcl believes the [operating system%|%OS] uses for encoding [string%|%strings].



** See Also **

   [http://mimersbrunn.sourceforge.net/tcl_charset_iana%|%Mapping of Tcl-encoding names to IANA-list]:   

   [https://groups.google.com/forum/#!msg/comp.lang.tcl/L2qI8xQ0KcA/-d7YU1u_gDAJ%|%encoding to and from UTF-8]:   An example by [KBK].
   [https://groups.google.com/forum/#!topic/comp.lang.tcl/998_dCHL5Gk%|%Chinese characters in file name]:   [HaO] 2012-01-30: After a question about cChinese characters in a DOS BOX, in the [Microsoft Windows%|%windows] wish console and in [tkcon], [Donal Fellows] explains the console use of encodings.

   [Bug in regsub with "special chars"?]:   [AMG]: Having the wrong source encoding can interact badly with [regular expressions].

   [encodiff]:   

   [Encoding table generator]:   Also has code to work around a bug/feature: if an encoding file gives `0000` as target for a character, `[encoding convertfrom]` seems to fall through to iso8859-1 and get the Unicode from there... 

   [register new encodings at runtime]:   
   [http://www.tcl.tk/cgi-bin/tct/tip/258.html%|%TTP 258]:   Refined the behaviour of Tcl's `encoding` routines.

   [ycl%|%ycl string encode/decode]:   Alternative routines that return an error rather than losing information.




** Synopsis **

    :   '''[encoding convertfrom]''' ?''encoding''? ''data''

    :   '''[encoding convertto]''' ?''encoding''? ''string''

    :   '''[encoding dirs]''' ?''directoryList''?

    :   '''[encoding names]'''

    :   '''[encoding system]''' ?''encoding''?




** Description **


[HaO] 2020-05-13 - The encoding "unicode" may be 16 or 32 bit and little or big endian depending on the compile options and platform endianess. See also [https://core.tcl-lang.org/tips/doc/trunk/tip/547.md%|TIP 547%|%].

*** Internal encodings ***

The following encodings are available in TCL8.6 without encoding files:

   *   identify (depreciated, default encoding for TCL8.0 to TCL8.5)
   *   iso8859-1 (default encoding for TCL8.6)
   *   utf-8
   *   unicode

They are hard-coded in the file ''generic/tclEncoding.c''.

Information source: Don Porter on the core list 2020-05-11 titled "Re: [TCLCORE] Default fallback encoding in TCL8.6.10 not identity but iso8859-1 ?".

*** Why is iso8859-1 the default ***

[HaO] 2020-05-13: Kevin Kenny gave the following explanation on the core list on 2020-05-11 titled "Re: [TCLCORE] Default fallback encoding in TCL8.6.10 not identity but iso8859-1 ?":

ISO8859-1 enjoys the uniquely privileged position that 'binary' is the same encoding.
The first 256 code points of Unicode are precisely the 256 byte values of ISO8859-1.
The conversion between Tcl's byte arrays and Tcl's strings is therefore also ISO8859-1 equivalent.

The implication is that ISO8859-1 is at least lossless.
In particular, every UTF-8 string has some interpretation as an ISO8859-1 string, and on any Unix-based system, that's enough to represent path names so that encoding tables can be read.

The 'identity' encoding was a rather muddled attempt to produce the same effect, which did not work, and resulted in malformed UTF-8 if any data outside the seven-bit ISO646 range were present.

It's pretty subtle, but ISO8859-1 is pretty much The Right Thing anywhere that we thought 'identity' might be correct.
(With the exception of some code internal to the test suite whose purpose is to test that the Core does not crash if handed a malformed CESU-8 string by a C extension.).

** Discussion **

In a recent discussion in the [tcl chatroom%|%chat room], the frustrations of having characters without knowing their encoding, 
and wishing for a utility to assist in turning the characters into something nearly readable, resulted in the suggestion of :

[RS]: Larry: Yes. One might wrap that functionality in a text widget, into which you paste the suspicious page, and have a listbox to choose all [[encoding names] offers... a double click on a listbox item converts the text contents. 
[RS]: Rolf: "encoding convertfrom foo" turns the questionable characters into UTF-8, which you can inspect directly in your text widget. Writing it to a text file, or stdout, involved an intrinsic "encoding convertto [[encoding system]", which often is iso8859-1 or cp1252. So if the input had Russian or Greek characters, these will not come through to system encoding, but will be replaced by question marks.

Alas, the situation is this - [LV] receives emails, files, web pages, etc. on a regular basis without an encoding specified.  While much of the text characters appear correct, the punctuation is ''skewed'' - it shows up as ? or \x92,etc.
instead of things like " or ' or -, and so forth.  

All he wants to do is have the weird punctuation marks (which he figures in the editors of the creators look like various ''fancy'' punctuation characters) readable.

Here's the code [LV] is currently trying...

======
# translate those obnoxious windows characters to something better
while {! [eof stdin]} {
    gets stdin buffer
    set buffer2 [string map { \xa9 (c) \xd7 (c) \x85 - \x91 ' \x92 ' \x93 \" \x94 \" } $buffer]
    puts $buffer2
}
======

** Examples **

The Euro sign is represented in Windows cp1252 as \x80. 
If you get such strings in, you can see the real Euro sign with

======
encoding convertfrom cp1252 \x80
======

Back you go with

======
encoding convertto cp1252 \u20AC
======

----

To find out which encoding is used for communication with the OS (including file I/O):

======
encoding system
======

----

To see available encodings:

======
encoding names
======



** Converting non-[ASCII] Characters **

Can I use the 'encoding' command (or some appropriate `[fconfigure] -encoding`) to take a Tcl source file (*.tcl) 
in an arbitrary encoding and output a well-formed Tcl source file which is pure [ASCII] ?
(i.e. all chars > 127 have been converted to \uhhhh unicode escape sequences)?

[RS]: Sure, but some assembly required - like this:

======
proc u2x s {
    set res {} 
    foreach i [split $s {}] {
        scan $i %c int
        if {$int < 128} {
            append res $i
        } else {
            append res \\u[format %04.4X $int]
        }
    }
    set res
}
set fp [open $filename]
fconfigure $fp -encoding $originalEncoding
set data [u2x [read $fp [file size $filename]]]
close $fp
set fp2 [open $newFilename w]
puts -nonewline $fp2 $data
close $fp2 
======

The "u2x" functionality is easily done, but it's also somewhere built-in in Tk - 
on Unix, codes for which no font has a character are substituted in "\uxxxx" style... 
(Windows mostly shows an empty rectangle). 
See [Unicode and UTF-8]

----
[RS] revisits this page on 2004-03-19 and now would rather write it this way:

======
proc u2x s {
    set res {} 
    foreach c [split $s {}] {
        scan $c %c int
        append res [expr {$int<128? $c :"\\u[format %04.4X $int]"}]
    }
    set res
}
======



** Encodability **
Not every [unicode] character can be represented unambiguously in every encoding. 
In such cases, a fallback character (e.g. ?) is inserted instead, but on re-conversion the original character is lost. 
Here is a little tester for such cases, demonstrated with some German and Greek 
(out of laziness, produced by [Greeklish]) letters. 
Notice the quote way to use a command result as argument default, 
which assumes that the name of the system encoding is a single word ([RS]):
======
 proc encodable "s {enc [encoding system]}" {
    string eq $s [encoding convertfrom $enc [encoding convertto $enc $s]]
 }
======

======none
% encodable äöü ; # That's "\u00e4\u00f6\u00fc" in case this example needs to be repaired
1
% encodable [greeklish Aqhnai]
0
% encodable [greeklish Aqhnai] iso8859-7
1
% encodable äöü iso8859-7
0
======

See also:

   [https://core.tcl-lang.org/tcl/tktview/535705ffffffffffffff%|%RFE #535705], Tcl interface to stop on encoding errors:   

   [remove diacritic]:   




** Misc **
----
***Font problems***


Note that the lack of a mapping from some other encoding into Unicode can lead to effects 
that appear to be [font problems] though they can be fixed by the creation of a new encoding, 
as that page illustrates.
----***When do I need to include the .enc files in a distribution?***
'''When do I need to include the .enc files in a distribution?'''

If you want to use any of the encoding which are not build-in or provided by other sources (Example:Tk).

But what if you're not, but you are (for example) writing a Tk application 
that's going to be displaying UTF-8, or using the message catalog to localize for other languages, 
and want it to display correctly on systems in different countries?  Are they needed then?

[HaO] 2020-05-13: To my knowledge: no, all this may be handled internally.
Message catalogue files are sourced as utf-8 which is an internal encoding.The character drawing mecahanism does not need encodings (I am not sure about that and this may be platform dependent).
If you want to use the console, you need the system encoding to output non-ASCII characters (for example: "puts ÄÖÜ" on German Windows requires cp1525 which is not an internal encoding). Maybe, the Windows system console envolves to unicode (like many Linux consoles nowadays) and this will not be required any more.
----***IANA versus TCL encoding names***
[MG]: I've never really looked at encodings before, and just started doing so for the first time while trying to hack something into a client app. The problem I've just run into is that the server is using the encoding name "iso-8859-1" (which seems to be the correct IANA name), while Tcl calls it "iso8859-1" (without the first hyphen). Is there any reason for this? Or any simple (but correct) way around it? (My immediate thought was to try matching against the Tcl encoding names with the first hyphen edited out, if the encoding can't be found - definaitely simple, but also definaitely not correct...) Any help would be greatly appreciated. Thanks.

'''A.'''

This is indeed a problem... [http://naviserver.cvs.sourceforge.net/naviserver/naviserver/nsd/encoding.c?view=markup]
[LVHaO] 20207-105-183: You Hmavy use ythe (nout spubmlittshed) a bug reporutine aof the http:// package to cl.sf.onvert/ ?IANA names to TCL names:
[ZB]======
% 2008-03-04:  Mpaybe don't be too fckast with fixinge that "bug"; pay attrentquion,re that "iso-885tp
2.9-.1"
% is used f.e. in web httpages ::Cheaders, while operatiToEng systems are uscoding "iso-8859-1" (f.e. L
inux and xBSD systems). Such description is used by converting utilities (like iconv). I don't know the reason 8859- but it seems to be kind of "convention". So there seems not to be any need for change.1
======
Description in the file: Tries to map a given IANA charset to a tcl encoding.  If no encoding can be found, returns binary.

<<categories>> tcl commands | Binary Data | Human Language | Arts and Crafts of Tcl-Tk Programming