''[Michael A. Cleverly] 19 Nov 2002'' -- my browser seems to not be able to copy and paste more than 4,096 characters at a time, which makes copying & pasting large chunks of code from the Wiki difficult. Here is [An HTTP robot in Tcl] that will fetch page(s) from the Wiki and spits out just the stuff between
 tags (which, generally speaking, is just Tcl code).  It also trims the first leading space that the Wiki requires.

----

 #!/bin/sh
 # -*- tcl -*- \
 exec tclsh $0 ${1+"$@"}
 
 package require Tcl 8.3
 
 if {[llength $argv] == 0} {
     puts stderr "usage: wiki-reaper page ?page ...?"
     exit 1
 }
 
 if {![catch { package require nstcl-html }] &&
     ![catch { package require nstcl-http }]} {
     namespace import nstcl::*
 } else {
     package require http
 
     proc ns_geturl {url} {
         set conn [http::geturl $url]
         set html [http::data $conn]
         http::cleanup $conn
         return $html
     }
 
     proc ns_striphtml {-tags_only html} {
         regsub -all -- {<[^>]+>} $html "" html
         return $html ;# corrected a typo here
     }
 
     proc ns_urlencode {string} {
         set allowed_chars  {[a-zA-Z0-9]}
         set encoded_string ""
 
         foreach char [split $string ""] {
             if {[string match $allowed_chars $char]} {
                 append encoded_string $char
             } else {
                 scan $char %c ascii
                 append encoded_string %[format %02x $ascii]
             }
         }
 
         return $encoded_string
     }
 }
 
 
 proc output {data} {
     # we don't want to throw an error if stdout has been closed
     catch { puts $data }
 }

 proc reap {page} {
     package require htmlparse

     set url  http://wiki.tcl.tk/[ns_urlencode $page]
     set now  [clock format [clock seconds] -format "%e %b %Y, %H:%M" -gmt 1]
     set html [ns_geturl $url]
 
     # can't imagine why these characters would be in here, but just to be safe
     set html [string map [list \x00 "" \x0d ""] $html]
     set html [string map [list 
 \x00 
\x0d] $html] if {![regexp -nocase {([^<]*)} $html => title]} { set title "(no title!?)" } if {![regexp -nocase {Updated on ([^G]+ GMT)} $html => updated]} { set updated "???" } output "#####" output "#" output "# \"$title\"" output "#" output "# Tcl code harvested on: $now GMT" output "# Wiki page last updated: $updated" output "#" output "#####" output \n set html [ns_striphtml -tags_only $html] foreach chunk [regexp -inline -all {\x00[^\x0d]+\x0d} $html] { set chunk [string range $chunk 1 end-1] set chunk [::htmlparse::mapEscapes $chunk] foreach line [split $chunk \n] { if {[string index $line 0] == " "} { set line [string range $line 1 end] } output $line } } output \n output "# EOF" output \n } foreach page $argv { reap $page } ---- Sample usage: 1. First you have to get the above code into a file somehow. You have to start somewhere ;-) . So somehow save this page into a file called "wiki-reaper", and edit the contents to remove comments, etc. 2. Make certain that the file is going to be found when you attempt to run it. On Unix like systems, that involves putting the file into one of the directories in $PATH. 2. ''wiki-reaper 4718'' causes wiki-reaper to fetch itself... :-) '''22nov02''' [jcw] - This could be the start of something more, maybe... I've been thinking about how to make the wiki work for securely re-usable snippets of script code. Right now, doing a copy-and-paste is tedious (the above solves that), but also risky: what if someone decides to play tricks and hide some nasty chage in there. That prospect is enough to make it quite tricky to re-use any substantial pieces, other than after careful review - or simply as inspiration for re-writing things. Can we do better? Maybe we could. What if a "wiki snippet repository" were added to this site - here's a bit of thinking-out-loud: * if verbatim text (text in
...
form) starts off with a certain marker, it gets recognized as being a "snippet" * snippets are stored in a separate read-only area, and remain forever accessible, even if the page changes subsequently * the main trick is that snippets get stored on basis of their MD5 sum * each snippet also includes: the wiki page#, the IP of the submitter, timestamp, and a tag * the tag is extracted from the special marker that introduces a snippet, it's a "name" for the snippet, to help deal with multiple snippets on a page Now where does this all lead to? Well, it's rough thinking, but here's a couple of comments about it: * if you have an MD5, you can retrieve a snippet, without risk of it being tampered with, by an url, say http://mini.net/wikisnippet/ * the IP stored with it is the IP of the person making the change, and creating the snippet in the first place, so it is a reliable indicator of the source of the snippet * if you edit a page and don't touch snippet contents, nothing happens to them * if you do alter one, it gets a new MD5 and other info, and gets stored as a new snippet * if you delete one, it stops being on the page, but the old one is retrievable as before Does this mean all authentication is solved? No. It becomes a separate issue to manage snippet-md5's, but what the author needs to do is pick a way to get that to others. I could imagine that in basic use, authors maintain an annotated list of snippets on their page - but frankly, this becomes a matter of key management. How do you tell someone that you have something for them if the channel cannot be trusted? Use a different channel: chat, email, a secured (https/ssl) website, whatever. This approach would not solve everything. But what it would do is that *if* I have a snippet reference, and I trust it, then I can get the contents at any time. Snippets will have the same property as wiki pages that they never break, but with the added property that they never change. On top of that, who knows... a catalog? With user logins, authentication, pgp, anything can be done. A tool to grab the snippet given its md5, and a tool to locate snippets based on simple tag or content searches, it's all pretty trivial to do. Is this a worthwhile avenue to explore further? You tell me... :o) ---- SB 2002-11-23: If you for a minute forget about the validation of code integrity and think about the possibility to modify program code independent of location, then it sounds like a very good idea. An example is to show progress of coding. The start is a very simple example code, then the example is slightly modified to show how the program can be improved. With this scheme, every improvement of code can be backtracked to the very beginning, and, hence, work as a tutorial for new programmers. If we then think about trust again, there are too many options for code fraud that I do not know. ---- ''[escargo] 23 Nov 2002'' - I have to point out that the IP address of the source is subject to a bunch of qualifications. Leaving out the possibility of the IP address being spoofed, I get different IP addresses because of the different locations I use to connect to the wiki; with subnet masking it's entirely possible that my IP addresses could look very different at different times even when I am connected from the same system. Aside from that issue, could such a scheme be made to work well with a version of the '''unknown''' proc and mounting the wiki, or part of the wiki, through [VFS]? This gets back to the TIP dealing with metadata for a repository. This in turn leads me to wonder, how much of a change would it be to add a page template capability to the wiki? In practice now, when we create a new page, it is always the same kind of page. What if there was a policy change that allowed for creating each new page selected from a specific set of types of pages. The new ''snippet page'' would be one of those types. Each new page would have metadata associated with it. Instead of editing pages always in a text box, maybe there would be a generated form. Is that possible? How hard would it be? This could lead from a pure wiki to a web-based application, but I don't know if that is a bad thing or not. Just a thought. ''(Tidied up 5 May 2003 by [escargo].)'' ---- [LV] May 5, 2003 - with regards to the snippet ideas above, I wonder if, with the addition of CSS support here on the wiki, some sort of specialized marking would not only enable snipping code, but would also enable some sort of special display as well - perhaps color coding to distinguish proc names from variables from data from comments, etc. '''CJU''' March 7, 2004 - In order to do that, you would need to add quite a bit of extra markup to the HTML. I once saw somewhere that one of the unwritten "rules" of wikit development was that preformatted text should always be rendered untouched from the original wiki source (with the exception of links for URLs). I don't particularly agree with it, but as long as it's there, I'm not inclined to believe that the developer(s) are willing to change. Now, straying away from your comment for a bit, I would rather have each preformatted text block contain a link to the plaintext within that block. This reaping is an entertaining exercise, but it's really just a work-around for the fact that getting just the code out of an HTML page is inconvenient for some people. I came to this conclusion when I saw a person suggest that ''all reapable pages on the wiki'' should have hidden markup so that the reaper could recognize whether the page was reapable or not. To me, it's a big red flag when you're talking about manually editing hundreds or thousands of pages to get capability that should be more or less automatic. I'm looking at toying around with wikit in the near future, so I'll add this to my list of planned hacks. ---- [LV] 2007 Oct 08 Well, I changed the mini.net reference to wiki.tcl.tk. But there is a bug that results in punctuation being encoded. I don't know why that wasn't a problem before. But I changed one string map into a call to ::[htmlparse]::mapEscapes to take care of of the problem. ---- [tb] 2009 Jun 16 Hm... - I still get unmapped escape sequences, when reaping from '''this''' page, using kbskit-8.6. I don't get them, when reaping from a running [wikit]. Am I missing something? ---- See also: * [wish-reaper] * [wiki-runner] * [TWiG] * fetch .txt to get the Wiki markup iso the html ---- [[ [Category Application] | [Category Internet] | [Category Wikit] ]]