Utf-8 difference between Windows and Mac?

OK, I have a small test file that contains utf-8 codes. Here it is (the language is Wolof)

Fˆndeen d‘kk la bu ay wolof aki seereer a fa nekk. DigantŽem ak
Cees jur—om-benni kilomeetar la. MbŽyum gerte ‘pp ci diiwaan bi mu

that is what it looks like in a vanilla editor, but in hex it is:

$ xxd test.txt
0000000: 46cb 866e 6465 656e 2064 e280 986b 6b20  F..ndeen d...kk 
0000010: 6c61 2062 7520 6179 2077 6f6c 6f66 2061  la bu ay wolof a
0000020: 6b69 2073 6565 7265 6572 2061 2066 6120  ki seereer a fa 
0000030: 6e65 6b6b 2e20 4469 6761 6e74 c5bd 656d  nekk. Digant..em
0000040: 2061 6b0d 0a43 6565 7320 6a75 72e2 8094   ak..Cees jur...
0000050: 6f6d 2d62 656e 6e69 206b 696c 6f6d 6565  om-benni kilomee
0000060: 7461 7220 6c61 2e20 4d62 c5bd 7975 6d20  tar la. Mb..yum 
0000070: 6765 7274 6520 e280 9870 7020 6369 2064  gerte ...pp ci d
0000080: 6969 7761 616e 2062 6920 6d75 0d0a       iiwaan bi mu..

The second character [cb86] is a non-standard coding for a-grave [à] which is found quite consistently in web documents, although in 'real' utf-8, a-grave would be c3a0. Real utf-8 works beautifully on Macs and under Windows.

I handle the fake utf-8 by using a character map which included the pair { ˆ à } because that little caret is what cb86 generates, and everything works fine ON A MAC for displaying text (in a text widget) like this:

Fàndeen dëkk la bu ay wolof aki seereer a fa nekk. Digantéem ak
Cees juróom-benni kilomeetar la. Mbéyum gerte ëpp ci diiwaan bi mu

On a PC - using the same file (shared) the first three characters read in are 46 cb 20 (using no fconfigure). I have run through ALL the possible encodings and can never get the same map to work. [There are twenty that will allow 46 cb 86]

Sorry this is so long, but if anyone has a clue, I would love to hear it.

Tel Monks

DGP I assume your 'character map' is an application of the string map command. Take care that it is mapping the character you believe it is. In particular, check whether you are mapping \u005e or \u02c6 , both of which are variants of ^ but which Tcl string processing treats as distinct characters because they are distinct Unicode code points.


telgo - 2010-07-24 13:04:10

Thanks for your response.

Here is the whole of the code area

        set in [open $arg1 r]
        fconfigure $in -encoding utf-8
        set body [read -nonewline $in]
        if {$platform eq "windows" && $arg1 eq "test.txt" } {
                regexp {^F(.*)ndeen\y} $body matched thing
                puts $matched
                set thinghex [convert.string.to.hex $thing]
                puts "<$thing>=$thinghex"
                set body [ string map { $thing à Ž é ‘ ë – ñ „ Ñ — ó ƒ É è Ë Ë? à} $body]
                puts $body
        } else {
                set body [ string map { ˆ à Ž é ‘ ë – ñ „ Ñ — ó ƒ É è Ë Ë? à} $body]
        }

I did this in an attempt to ensure that WHATEVER was in that position would be map translated. I get a <c6> but the a-grave never appears - it stays as caret. The console and the display show the same thing (as I would expect)

DGP Wow. Where to begin?

1) It is not helpful if you provide exact code, but then provide only paraphrased descriptions of what it outputs. If you don't reproduce the verbatim output, you're wasting our time.

2) You appear to have forgotten what brace-quoting does to variable substitution. (It prevents it!). So, your [string map { $thing ...} ...] does not do what you think it will.

3) You have non-ASCII literal characters in this script. It matters a great deal how this script finds its way into the interp for evaluation. If the encoding in control of that operation is not one that can support all the characters in your script, you will get surprising results.

4) The real root of the whole problem is attempting to use the utf-8 encoding to read files that are not stored in that encoding. What is the real encoding and/or format of these files? Instead of complaining how hard it is to pound nails with a screwdriver, why not get the hammer you really need?


telgo - 2010-07-24 13:26:58

I am happy we can do this dialogue, but should we try a more direct method, without boring other people?

1) I work and do email on a mac - and the output was on a PC - so I was lazy about copying the entire output. Here it is:

Fàndeen
<à>=c6
Fàndeen dëkk la bu ay wolof aki seereer a fa nekk. Digantéem ak
Cees juróom-benni kilomeetar la. Mbéyum gerte ëpp ci diiwaan bi mu

2) I see your point. Good one. Let me reflect on that.

3) Remember - this all works perfectly on a Mac, so the non-ascii characters do not seem to bother that,

4) And exactly how would I find out what the "real" encoding of these files is? I believe these files were originally web pages, but know nothing more. I do know that there is a consistent mapping going on, and I can produce the fake-unicode inputs that are bothering me, but thought using just one as an example would do.

You seem to have lost patience with me. That's too bad.


telgo - 2010-07-24 13:28:08

ok, I now see that when the output is put on a mac and pasted into email - it looks ok. What on earth is going on with the Windows machine (Vista)


telgo - 2010-07-25 03:10:46

For the sake of possibly helping another with this problem, I will just say that someone on a more helpful forum pointed out that the solution is

set fixed [encoding convertfrom macRoman [encoding convertto cp1252 $body]]

These files may have been created on an old Mac and ported to a PC, then later back to a Mac.