wiki-reaper

Michael A. Cleverly 2002-11-19: my browser seems to not be able to copy and paste more than 4,096 characters at a time, which makes copying & pasting large chunks of code from the Wiki difficult. Here is An HTTP robot in Tcl that will fetch page(s) from the Wiki and spits out just the stuff between <pre></pre> tags (which, generally speaking, is just Tcl code). It also trims the first leading space that the Wiki requires.

Due to DDoS mitigation this program no longer works reliably.

Usage

First you have to get the above code into a file somehow. You have to start somewhere ;-). Save the code on this page into a file called "wiki-reaper".
Make certain that the file is going to be found when you attempt to run it. On Unix-like systems, that involves putting the file into one of the directories in $PATH.
wiki-reaper wiki-reaper causes wiki-reaper to fetch itself... :-)

Previous Revisions

edit 34 , replaced 2014-12-30: the original code by Michael A. Cleverly

edit 81 , replaced 2018-10-17: the last version for the old wiki

Code

#!/usr/bin/env tclsh

package require Tcl 8.5-10

package require cmdline    1
package require htmlparse  1
package require http       2
package require ncgi       1
package require struct     2
package require textutil   0-2
package require treeql     1
package require try        1

namespace eval wiki-reaper {
    variable version {5.2.0 (2018-10-18)}
    variable useCurl 0
    variable hostname wiki.tcl-lang.org

    try {
        package require tls
        http::register https 443 [list tls::socket -tls1 1]
    } on error err {
        try {
            exec curl --version
            set useCurl 1
        } on error _ {
            error {wiki-reaper needs either a cURL executable\
                   or TclTLS to work}
        }
        unset err
    }
}

proc wiki-reaper::url-encode str {
    variable reserved {
        { } +
        !  %21
        #  %23
        $  %24
        &  %26
        '  %27
        (  %28
        )  %29
        *  %2A
        +  %2B
        ,  %2C
        /  %2F
        :  %3A
        ;  %3B
        =  %3D
        ?  %3F
        @  %40
        [  %5B
        ]  %5D
    }
    return [string map $reserved $str]
}

proc wiki-reaper::url-decode str {
    return [ncgi::decode $str]
}

proc wiki-reaper::output args {
    set ch stdout
    switch -exact -- [llength $args] {
        1 { lassign $args data }
        2 { lassign $args ch data}
        default {
            error {wrong # args: should be "output ?channelId? string"}
        }
    }
    # Don't throw an error if $ch is closed halfway through.
    catch { puts $ch $data }
}

proc wiki-reaper::fetch url {
    variable useCurl
    # The cookie is necessary when you want to retrieve page history.
    set cookie wikit_e=wiki-reaper
    if {$useCurl} {
        set data [exec curl --silent --fail --cookie $cookie $url]
    } else {
        set connection [http::geturl $url -headers [list Cookie $cookie]]
        set data [http::data $connection]
        http::cleanup $connection
    }
    return $data
}

proc wiki-reaper::history-url page {
    variable hostname
    set url https://$hostname/history/[url-encode $page]
    return $url
}

proc wiki-reaper::page-url {page revision} {
    variable hostname
    set url https://$hostname/revision/[url-encode $page]?V=$revision
    return $url
}

proc wiki-reaper::with-parsed {html documentVarName treeqlCmdName script} {
    upvar 1 $documentVarName doc

    try {
        set doc [struct::tree]
        htmlparse::2tree $html $doc
        treeql $treeqlCmdName -tree $doc

        uplevel 1 $script
    } finally {
        catch { $treeqlCmdName destroy }
        catch { $doc destroy }
    }
}

proc wiki-reaper::parse-history html {
    with-parsed $html doc tq {
        tq query \
            tree \
            withatt type td \
            children \
            get data
        set revHistory [tq result]
    }

    if {![regexp {\?V=(\d+)} [lindex $revHistory 0] _ latest]} {
        error {can't parse page history}
    }

    return [dict create \
        latest $latest \
    ]
}

proc wiki-reaper::error-if-not-found html {
    if {[regexp {<title>Page not found</title>} $html]} {
        if {![regexp {Page '(.+?)' could not be found.} \
                     $html \
                     _ \
                     page]} {
            error [list wiki page wasn't found plus \
                        error page can't be parsed]
        }
        error [list wiki page [htmlparse::mapEscapes $page] wasn't found]
    }
    return $html
}

proc wiki-reaper::extract-code-blocks html {
    with-parsed $html doc tq {
        tq query \
            tree \
            withatt type pre \
            children \
            withatt type PCDATA \
            get data
        set encodedCodeBlocks [tq result]
    }

    set codeBlocks {}
    foreach encodedCodeBlock $encodedCodeBlocks {
        lappend codeBlocks [htmlparse::mapEscapes $encodedCodeBlock]
    }

    return $codeBlocks
}

proc wiki-reaper::parse-revision-page html {
    set parsed {}

    dict set parsed codeBlocks [extract-code-blocks $html]

    if {![regexp {Updated (\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d) by} \
                 $html \
                 _ \
                 timestamp]} {
        error [list can't extract timestamp from revision page]
    }
    dict set parsed timestamp $timestamp

    regexp {<title>Version (\d+) of ([^<]+)</title>} $html _ revision title
    dict set parsed revision $revision
    dict set parsed title $title

    return $parsed
}

proc wiki-reaper::get-page-title-by-number n {
    variable hostname

    set redirect [error-if-not-found [fetch https://$hostname/$n]]
    if {![regexp {href="/page/([^"]+)"} $redirect _ encodedPage]} {
        error [list can't parse redirect page]
    }

    return [url-decode [htmlparse::mapEscapes $encodedPage]]
}

proc wiki-reaper::print {pageData {hashbang 0}} {
    set allBlocks {{} all}
    set now [clock format [clock seconds] -format {%Y-%m-%d %H:%M:%S} -gmt 1]

    if {$hashbang} {
        output {#! /usr/bin/env tclsh}
    }

    output #####
    output #
    output "# \"[dict get $pageData revisionPage title]\"\
              ([dict get $pageData pageUrl])"

    set block [dict get $pageData block]
    if {$block in $allBlocks} {
        output "# All code blocks"
    } else {
        output "# Code block $block"
    }

    output #
    output "# Wiki page revision [dict get $pageData revisionPage revision],\
              updated [dict get $pageData revisionPage timestamp] GMT"
    output "# Tcl code harvested on $now GMT"
    output #
    output #####

    if {$block ni $allBlocks} {
        output {}
    }

    set i 0
    foreach codeBlock [dict get $pageData revisionPage codeBlocks] {
        if {$block in $allBlocks} {
            output "\n# Code block $i\n"
        }
        output $codeBlock
        incr i
    }

    output {# EOF}
}

proc wiki-reaper::reap {page block {revision {}} {flags {}}} {
    if {[string is integer -strict $page]} {
        set page [get-page-title-by-number $page]
    }

    set history [parse-history [error-if-not-found [fetch [history-url $page]]]]
    set latest [dict get $history latest]

    if {$revision in {{} latest}} {
        set revision $latest
    }

    if {![string is integer -strict $revision]} {
        error [list can't understand revision $revision]
    }
    if {$revision < 0} {
        error [list revision can't be negative]
    }
    if {$revision > $latest} {
        output stderr [list warning: revision $revision greater than \
                            latest $latest]
    }

    set pageUrl [page-url $page $revision]
    set html [error-if-not-found [fetch $pageUrl]]
    set revisionPage [parse-revision-page $html]

    return [dict create \
        history $history \
        revisionPage $revisionPage \
        pageUrl $pageUrl \
        block $block \
    ]
}

proc wiki-reaper::main {argv} {
    variable protocol
    variable version

    set options {
        {x "Output '#!/usr/bin/env tclsh' as the first line"}
        {v "Print version and exit"}
    }
    set usage "?options? page ?codeBlock? ?revision?"

    if {$argv in {/? -? -h -help --help}} {
        output stderr [cmdline::usage $options $usage]
        exit 0
    }
    try {
        set flags [cmdline::getoptions argv $options $usage]
    } on error err {
        output stderr $err
        exit 1
    }
    lassign $argv page block revision

    if {[dict get $flags {v}]} {
        output $version
        exit 0
    }
    if {$page eq {}} {
        output stderr [cmdline::usage $options $usage]
        exit 0
    }

    set reaped [reap $page $block $revision $flags]
    print $reaped [dict get $flags {x}]
}

proc wiki-reaper::main-script? {} {
    # From https://wiki.tcl-lang.org/page/main%20script.
    global argv0

    if {[info exists argv0]
        && [file exists [info script]]
        && [file exists $argv0]} {
        file stat $argv0        argv0Info
        file stat [info script] scriptInfo
        expr {$argv0Info(dev) == $scriptInfo(dev) &&
              $argv0Info(ino) == $scriptInfo(ino)}
    } else {
        return 0
    }
}

if {[wiki-reaper::main-script?]} {
    wiki-reaper::main $argv
}

Security

Anyone can edit the wiki, so the code may change between when you look it and when you download it. Be sure to inspect the code you fetch with wiki-reaper before you run it.

Discussion

jcw 2002-11-22:

This could be the start of something more, maybe...

I've been thinking about how to make the wiki work for securely re-usable snippets of script code. Right now, doing a copy-and-paste is tedious (the above solves that), but also risky: what if someone decides to play tricks and hide some nasty chage in there. That prospect is enough to make it quite tricky to re-use any substantial pieces, other than after careful review - or simply as inspiration for re-writing things.

Can we do better? Maybe we could. What if a "wiki snippet repository" were added to this site - here's a bit of thinking-out-loud:

if verbatim text (text in <pre>...</pre> form) starts off with a certain marker, it gets recognized as being a "snippet"
snippets are stored in a separate read-only area, and remain forever accessible, even if the page changes subsequently
the main trick is that snippets get stored on basis of their MD5 sum
each snippet also includes: the wiki page#, the IP of the submitter, timestamp, and a tag
the tag is extracted from the special marker that introduces a snippet, it's a "name" for the snippet, to help deal with multiple snippets on a page

Now where does this all lead to? Well, it's rough thinking, but here's a couple of comments about it:

if you have an MD5, you can retrieve a snippet, without risk of it being tampered with, by an url, say http://mini.net/wikisnippet/<this-is-the-32-character-md5-in-hex >
the IP stored with it is the IP of the person making the change, and creating the snippet in the first place, so it is a reliable indicator of the source of the snippet
if you edit a page and don't touch snippet contents, nothing happens to them
if you do alter one, it gets a new MD5 and other info, and gets stored as a new snippet
if you delete one, it stops being on the page, but the old one is retrievable as before

Does this mean all authentication is solved? No. It becomes a separate issue to manage snippet-md5's, but what the author needs to do is pick a way to get that to others. I could imagine that in basic use, authors maintain an annotated list of snippets on their page - but frankly, this becomes a matter of key management. How do you tell someone that you have something for them if the channel cannot be trusted? Use a different channel: chat, email, a secured (https/ssl) website, whatever.

This approach would not solve everything. But what it would do is that *if* I have a snippet reference, and I trust it, then I can get the contents at any time. Snippets will have the same property as wiki pages that they never break, but with the added property that they never change.

On top of that, who knows... a catalog? With user logins, authentication, pgp, anything can be done. A tool to grab the snippet given its md5, and a tool to locate snippets based on simple tag or content searches, it's all pretty trivial to do.

Is this a worthwhile avenue to explore further? You tell me... :o)

SB 2002-11-23: If you for a minute forget about the validation of code integrity and think about the possibility to modify program code independent of location, then it sounds like a very good idea. An example is to show progress of coding. The start is a very simple example code, then the example is slightly modified to show how the program can be improved. With this scheme, every improvement of code can be backtracked to the very beginning, and, hence, work as a tutorial for new programmers. If we then think about trust again, there are too many options for code fraud that I do not know.

escargo 2002-11-23: I have to point out that the IP address of the source is subject to a bunch of qualifications. Leaving out the possibility of the IP address being spoofed, I get different IP addresses because of the different locations I use to connect to the wiki; with subnet masking it's entirely possible that my IP addresses could look very different at different times even when I am connected from the same system.

Aside from that issue, could such a scheme be made to work well with a version of the unknown proc and mounting the wiki, or part of the wiki, through VFS? This gets back to the TIP dealing with metadata for a repository.

This in turn leads me to wonder, how much of a change would it be to add a page template capability to the wiki? In practice now, when we create a new page, it is always the same kind of page. What if there was a policy change that allowed for creating each new page selected from a specific set of types of pages. The new snippet page would be one of those types. Each new page would have metadata associated with it. Instead of editing pages always in a text box, maybe there would be a generated form. Is that possible? How hard would it be? This could lead from a pure wiki to a web-based application, but I don't know if that is a bad thing or not. Just a thought. (Tidied up 5 May 2003 by escargo.)

LV 2003-05-05: with regards to the snippet ideas above, I wonder if, with the addition of CSS support here on the wiki, some sort of specialized marking would not only enable snipping code, but would also enable some sort of special display as well - perhaps color coding to distinguish proc names from variables from data from comments, etc.

CJU 2004-03-07: In order to do that, you would need to add quite a bit of extra markup to the HTML. I once saw somewhere that one of the unwritten "rules" of wikit development was that preformatted text should always be rendered untouched from the original wiki source (with the exception of links for URLs). I don't particularly agree with it, but as long as it's there, I'm not inclined to believe that the developer(s) are willing to change.

Now, straying away from your comment for a bit, I would rather have each preformatted text block contain a link to the plaintext within that block. This reaping is an entertaining exercise, but it's really just a work-around for the fact that getting just the code out of an HTML page is inconvenient for some people. I came to this conclusion when I saw a person suggest that all reapable pages on the wiki should have hidden markup so that the reaper could recognize whether the page was reapable or not. To me, it's a big red flag when you're talking about manually editing hundreds or thousands of pages to get capability that should be more or less automatic.

I'm looking at toying around with wikit in the near future, so I'll add this to my list of planned hacks.

LV 2007-10-08:

Well, I changed the mini.net reference to tcl.wiki. But there is a bug that results in punctuation being encoded. I don't know why that wasn't a problem before. But I changed one string map into a call to ::htmlparse::mapEscapes to take care of of the problem.

tb 2009-06-16

Hm... - I still get unmapped escape sequences, when reaping from this page, using kbskit-8.6. I don't get them, when reaping from a running wikit. Am I missing something?

LV 2009-06-17 07:37:08:

Is anyone still using this program? Do any of the wiki's enhancements from the past year or two provide a way to make this type of program easier?

jdc 2009-06-17 08:34:23:

Fetching <pagenumber>.txt will get you the Wiki markup. Best start from there when you want to parse the wiki pages yourself. Another option is to fetch <pagenumber>.code to only get the code blocks. Or use TWiG.

dbohdan 2014-12-30: The code could not retrieve anything when I tried it, so I updated it to work with the wiki as it is today. It now uses <pagenumber>.code, which jdc has mentioned above. Other changes:

Do not use nstcl-http or exec magic.
Include page URL in the output.
Format date of retrieval consistently with dates on the wiki.
Can retrieve a specific code block from the page if you tell it to. The usage is now wiki-reaper page ?codeBlock?.
Can no longer fetch more than one page at a time due to the above.

Note that I have replaced the original code with my update. If someone wants to preserve the original code on the page I can include mine separately.

PYK 2014-12-30: It's nice to see older code on the wiki get some maintenance. I like the idea of a Previous Revisions section like the one I added above to record bigger changes.

dbohdan 2014-12-30: I like it as well. The page looks much nicer in general after your last edit.

I'm thinking of outputting the current revision number of the page being reaped along with the time when it was last updated. Is there any way to get the current revision number for a page without having to log in? Faking a log in to access page history seems excessive for such a little script.

dbohdan 2014-12-30: Never mind--it's a simple matter of a plain-text cookie. I've added it in the new version of the script.

dbohdan 2015-01-21: Version 2.2.0 includes revision support, i.e., you can get the content of any code block on a page at any given revision number. Edit: updated to save the wiki some traffic.

EF 2016-08-01: Version 2.2.1 includes a new hostname variable pointing to the Web server at which to find the wiki, as tcl.wiki was not working anymore.

stevel tcl.wiki is a backup in case the .tk domain gets hijacked again. It redirects to wiki.tcl.tk.

dbohdan 2017-08-14: wiki-reaper broke when the wiki switched from single to double quotes in the HTML markup on the history page. Version 2.6.2 fixes that.