[KBK] 2007-06-12

Owing to various problems that happened over the years, the Wiki is known to have a number of [Pages containing invalid UTF-8 sequences].  People who are interested in improving the Wiki are invited to attempt to repair the text of these pages.

Note that all invalid UTF-8 sequences have been replaced with the character   (\ufffd) - searching for that character will locate the damage within the page.

This page seems to be damaged repeatedly with the \ufffd character above being replaced with a \u00fd (Ã½) character.  The matter needs investigation.

Go ahead and remove entries from this list as pages are repaired.

   * [Wikit Problems] - 2007-06-05 06:23:25 UTC - ([DKF]: trying to fix this page a bit at a time runs into the wiki's anti-corruption checker, and some characters never seem to have been recorded correctly in the page history so recovering them is hard.)
   * [remove diacritic] - 2005-12-18 01:31:52 UTC - inspection of the ISO-8859-2 table suggests that the missing character is U00C2 (capital A with circumflex).  Wikit rejects my edit, reporting bad UTF-8 (though my browser is set to UTF-8).  KJN.  [CMcC] provides another datapoint - the character UC2 doesn't seem to be accepted as valid UTF-8.  Is this the accurate encoding for the character?  ([KBK] 2007-06-29 - See below.)
   * [A little RSS reaper] - 2005-03-21 01:17:26 UTC - ([DKF]: Doesn't need much of a fix, but I don't know what the fix should be...)
   * [Man Tcl - po polsku] - 2005-02-18 18:36:25 UTC - ([DKF]: Needs a Polish speaker to fix!)

[LES] tried to fix [Wikit Problems] on 2007-06-14, but it only got worse, although other pages were fixed successfuly.

'''CLEANED:''' (I'm not removing them from the list, I hope someone can double check.)
   * ''all checked clean - [DKF]''

----
[DKF] 2007-06-29

Current status is that ''almost'' all pages are fixed. The remaining ones are difficult to fix, either because they require extensive changes by someone who understands the subject matter and language, or because of wiki software issues. If you can help, '''we would welcome it!'''

Also, don't forget to look at http://wiki.tcl.tk/_search?S=%EF%BF%BD*&_charset_=UTF-8
----
[KBK] 2007-06-29 The problem with the [remove diacritic] page is that the testing for "valid" UTF-8 is intentionally
overzealous.  When I reviewed the damaged pages, a great many of them contained the dreaded "double encoding" - ISO8859-1 expanded to UTF-8, with the result interpreted as ISO8859-1 and expanded to UTF-8 a second time.  The result of this "double encoding" is that a character such as ü (\u00e9) would be expanded into the two-byte UTF-8 sequence C3 89, and
that sequence would be interpreted as the spurious combination \u00c3\u0089.  The page in question was, as far as I can tell, the ''only'' case of either of the characters \u00c2 (upper-case Latin letter A with circumflex) and \u00c3 (upper-case Latin letter A with tilde) appearing on the Wiki ''other'' than as the result of this process; these two characters are extremely uncommon even in natural languages that use them.  (French, for instance, often omits accents from capital letters other than É.)  So it seemed wise to reject these two characters, rather than having, say,
broken browsers silently convert ü to the presumptively valid pair of characters \u00c3\u00bc (upper-case Latin letter A with tilde followed by the vulgar fraction ¼).

Given the large number of browsers out there that appear to get it wrong, I really don't know what else to do.
I'm open to suggestions.

[LV] Perhaps in the case where there is a possibility of a character being correct, the user should be prompted with an "are you certain" type prompt.

[Lars H]: Try adding a hidden field (like the O field used for page versions to detect edit conflicts) to the edit page form, which contains some non-ASCII characters (e.g. those occurring in the page already). If the browser gets it wrong for the text to edit, there's a fair chance it gets all form field wrong in the same way. Since the server can know what went out in this extra field, it can verify that it gets the same thing back.

Hmm... Looking at the code for this edit, there is a hidden item named _charset_ which doesn't appear to have any value:
  <input type='hidden' name='_charset_'>
Is this an incomplete implementation of the idea I propose?

[Lars H]: My edit #124 was bad -- attempting repair. Oddly, this browser (Safari) didn't have the encoding problem with the old Wiki.

[Lars H]: Edit trying to diagnose encoding problem. Will surely disturb the contents further.
----
[[
[Category Characters] |
[Category Wikit]
]]