Pages containing invalid UTF-8 sequences

KBK 2007-06-12

Owing to various problems that happened over the years, the Wiki is known to have a number of Pages containing invalid UTF-8 sequences. People who are interested in improving the Wiki are invited to attempt to repair the text of these pages.

Note that all invalid UTF-8 sequences have been replaced with the character � (\ufffd) - searching for that character will locate the damage within the page. ([L1 ] is a link to pages with problems, and this page too!)


KBK 2007-06-29 The problem with the remove diacritic page is that the testing for "valid" UTF-8 is intentionally overzealous. When I reviewed the damaged pages, a great many of them contained the dreaded "double encoding" - ISO8859-1 expanded to UTF-8, with the result interpreted as ISO8859-1 and expanded to UTF-8 a second time. The result of this "double encoding" is that a character such as é (\u00e9) would be expanded into the two-byte UTF-8 sequence C3 89, and that sequence would be interpreted as the spurious combination \u00c3\u0089. The page in question was, as far as I can tell, the only case of either of the characters \u00c2 (upper-case Latin letter A with circumflex) and \u00c3 (upper-case Latin letter A with tilde) appearing on the Wiki other than as the result of this process; these two characters are extremely uncommon even in natural languages that use them. (French, for instance, often omits accents from capital letters other than É.) So it seemed wise to reject these two characters, rather than having, say, broken browsers silently convert ü to the presumptively valid pair of characters \u00c3\u00bc (upper-case Latin letter A with tilde followed by the vulgar fraction ¼).

Given the large number of browsers out there that appear to get it wrong, I really don't know what else to do. I'm open to suggestions.

LV Perhaps in the case where there is a possibility of a character being correct, the user should be prompted with an "are you certain" type prompt.

Lars H: Try adding a hidden field (like the O field used for page versions to detect edit conflicts) to the edit page form, which contains some non-ASCII characters (e.g. those occurring in the page already). If the browser gets it wrong for the text to edit, there's a fair chance it gets all form field wrong in the same way. Since the server can know what went out in this extra field, it can verify that it gets the same thing back.

Hmm... Looking at the code for this edit, there is a hidden item named _charset_ which doesn't appear to have any value:

  <input type='hidden' name='_charset_'>

Is this an incomplete implementation of the idea I propose?

Lars H: My edit #124 was bad -- attempting repair. Oddly, this browser (Safari) didn't have the encoding problem with the old Wiki.

Lars H: Edit trying to diagnose encoding problem. Will surely disturb the contents further.


jdc 29-nov-2007 : I used the following script on the wiki database to detect invalid UTF-8 sequence:

lappend auto_path /home/decoster/tcl/Wub/Utilities

package require Mk4tcl
package require utf8

mk::file open db wikit.tkd

mk::loop i db.pages {
    lassign [mk::get $i name page] name page
    set data [encoding convertto identity $page]
    set point [utf8::findbad $data]
    if { $point >= 0 && $point < [string length $page] - 1 } {
        puts "\[$name\] at position $point:"
        puts "======"
        puts [encoding convertfrom identity [string range $data [expr {$point-50}] [expr {$point}]]]
        puts "======"
    }
}

mk::file close db

exit

This reported the following pages:

bad utf8: db.pages!2957 / 9075
bad utf8: db.pages!2987 / 2143
bad utf8: db.pages!4588 / 5130
bad utf8: db.pages!8410 / 292
bad utf8: db.pages!8442 / 5608
bad utf8: db.pages!8788 / 886
bad utf8: db.pages!9112 / 4925
bad utf8: db.pages!9281 / 554
bad utf8: db.pages!12169 / 4736
bad utf8: db.pages!14525 / 2935
bad utf8: db.pages!15412 / 4059
bad utf8: db.pages!15599 / 3036
bad utf8: db.pages!19658 / 310
bad utf8: db.pages!19693 / 9485

LV Any way for the above code to display a bit of context - or is there some option in the various web browsers to display what character of the page is being displayed? It's just tough to figure out what needs to be fixed with the info here. And is that utf8 package available here on the wiki some place?

jdc I updated the script so it generates wiki markup you can paste here for easier access. An example:

timeentry at position 18117:

pace behavior, I suppose I could hold off on the 00

DKF - The above pages are
tkWorld 0.2 fixed
Tclworld fixed
Oratcl Logon Dialog fixed
Traffic lights fixed
iFile: a little file system browser fixed
iRead: a Gutenberg eBook reader fixed
Steve Redler IV fixed
A triangle toy fixed
timeliner fixed
XO fixed
Extending the eTcl console fixed
colorChooser for pocketPC/etcl fixed
Wiki UTF-8 problem test fixed
Newton-Raphson Equation Solver fixed

I fix such problems by cut-n-pasting the wikitext version of the page into emacs and using interactive regexp searching (C-M-s) to find things that match [^^J -~] (where ^J is a literal newline character, typed using C-q C-j). That highlights problems rapidly and efficiently.