Version 4 of Wikit DB Repair

Updated 2007-05-09 13:21:30 by wdb

CMcC 9May07 - this is the code I used to repair the wiki from history.

Can someone advise as to why it mucks up unicode in titles?

   package require Mk4tcl
   package require fileutil

   encoding system utf-8

   set dbf [lindex $argv 0]
   set histdir [lindex $argv 1]

   foreach f [glob -tails -directory $histdir *] {
    if {[string match .* $f]} {
        continue
    }
    lassign [split $f -] id date who
    if {![info exists diffs($id)]
        || $date > [lindex $diffs($id) 0]
    } {
        set diffs($id) [list $date $id $who $f]
    }
   }

   mk::file open db $dbf

   foreach id [lsort -integer [array names diffs]] {
    #lappend repairs [lindex $diffs($id) 1]
    lassign $diffs($id) date id1 who f
    set content [split [fileutil::cat -encoding utf-8 [file join $histdir $f]] \\n]
    set title [lindex $content 0]
    set content [join [lrange $content 4 end] \n]
    if {$id >= [mk::view size db.pages]} {
        set title [string trim [lindex [split $title :] 1]]
        puts "adding $id '$title'"
        mk::row append db.pages name $title page $content date $date who $who
    } else {
        puts "modding $id"
        mk::set db.pages!$id page $content date $date who $who
    }
   }

   mk::file commit db
   mk::file close db

wdb Just a try -- as far as I understand, meta stores ASCII only -- perhaps it makes sense, before write to db, the unicodes convert in Tcl conventions such as \u004f, and after read back, perform a subst -novariable -nocommand $title?


EMJ Page contents have also been mucked up - see e.g. http://wiki.tcl.tk/18008 which I had fixed not long ago - and it also seems to have changed its page number (was 18012). Also if you look at http://wiki.tcl.tk/_ref/17213 you will see many pages listed which do not actually contain such a reference - I edited a couple, which forced them of the list, but most of them do not contain the reference and are still there.