TWiG

TWiG is a tool for extracting blocks of data from pages retrieved from the web.

TWiG is designed mainly for retrieving blocks of code from the "Tcler's Wiki" and presenting them as output suitable for program input to a 'tclsh' or 'wish' shell.

TWiG is easy to use for simple retrievals as well as having a flexible configuration for assembling more elaborate configurations, with tools to assist in the process. A 'more elaborate configuration' could result in a program being generated by combining bits of (usually) checksummed bits of code or other programs or data into a final program, in a way enabled a distributed development environment. TWiG will not generate any output until successful completion of processing, avoiding the possibility of partially-executed programs and their consequences.

TWiG started off as a simple Wiki page retriever/scraper and grew a bit from there. Much thanks to the Tcl and Wiki people.

Directions to find it at Stu
See also: TkTwig, Twigging and Twigs.


NEW FOR VERSION 0.7
Unbroke echo:// and anything else I broke in the last couple of versions.
Code tweak(s).
New and improved 'echo://'! An 'echo://' url will be split into
blocks on the ':' (colon) character.
For example, 'echo://set a 1:set a 2:set a 3' contains three blocks which
may be accessed and/or summed/checked as a group (block 0) or individually.

With block 0 (default):
$ twig 'echo://set a 1:set a 2:set a 3' 
set a 1
set a 2
set a 3

Selecting a block:
$ twig 'echo://set a 1:set a 2:set a 3' 2
set a 2

Summing:
$ twig -g 'echo://set a 1:set a 2:set a 3'  
{echo://set a 1:set a 2:set a 3} 0 36CA8E69C36F060AD80EDA887320FAA5
$ twig -g 'echo://set a 1:set a 2:set a 3' 2
{echo://set a 1:set a 2:set a 3} 2 C4192B48ABAE6711A7F1ADAE44F91959

Check:
$ twig -s C4192B48ABAE6711A7F1ADAE44F91959 'echo://set a 1:set a 2:set a 3' 2
set a 2
Or:
$ echo '{echo://set a 1:set a 2:set a 3} 2 C4192B48ABAE6711A7F1ADAE44F91959' | twig -c -
set a 2
Does it really work?
$ twig -s 11111111111111111111111111111111 'echo://set a 1:set a 2:set a 3' 2
twig: Bad checksum!
echo://set a 1:set a 2:set a 3 2 11111111111111111111111111111111 C4192B48ABAE6711A7F1ADAE44F91959

Note: The idea behind 'echo://' is to be able to add small snippets of code
or timestamps, extra comments ... stuff like that. I supposed it could be abused
to contain entire programs, but that's a bit nuts.

NOTE: for fetching code from the wiki, TWiG is no longer the best way: see WubWikit for why. Briefly, you can add .code to the end of a URL to grab the code blocks of that page only. This way is recommended as TWiG might not know about changes to the wiki's HTML generation in future.

Stu 2008-10-23 TWiG has been updated to use the .code method via the default scheme wiki://. Although adding .code to the URL does allow one to obtain the code blocks on a page it does nothing to split them up, combine or checksum them like TWiG does. TWiG's ability to extract data from HTML pages permits usage on other wikis, not only the English Tcl wiki (the French Tcl wiki, for example, does not have the .code functionality). TWiG can also retrieve data from other sources such as ftp or local files, and has the useful echo:// scheme as well. TWiG can also be integrated into TkChat with Herring. Finally, although less probable than changes to the HTML output, it is possible for the output of .code to change in the future.
To think of TWiG as a scraper is to do it injustice; TWiG is a distributed/collaborative development tool.


TWiG is now simply Twig.


README

Twig

Stuart Cassoff

September/October 2007
Version 0.1

October 2008
Version 0.2, 0.3, 0.4

November 2008
Version 0.5, 0.6, 0.7


TWiG is a tool for extracting blocks of data from pages retrieved from the web.
TWiG is designed mainly for retrieving blocks of code from the "Tcler's Wiki"
and presenting them as output suitable for program input to a 'tclsh' or 'wish' shell.
TWiG is easy to use for simple retrievals as well as having a flexible configuration
for assembling more elaborate configurations, with tools to assist in the process.
A 'more elaborate configuration' could result in a program being generated by
combining bits of (usually) checksummed bits of code or other programs or data
into a final program, in a way enabled a distributed development environment.
TWiG will not generate any output until successful completion of processing,
avoiding the possibility of partially-executed programs and their consequences.
TWiG started off as a simple Wiki page retriever/scraper and grew a bit from there.
Much thanks to the Tcl and Wiki people.

TWiG can:
* Retrieve pages via http:// or ftp:// or file://.
* Retrieve pages via wiki:// (special method currently works only with wiki.tcl.tk)
* Retrieve one or more pages by using a config file.
* Retrieve one or more blocks from a page.
* Verify supplied checksums against retrieved blocks.
* Generate its own config files.
* Query a page, noting any blocks retrieved as an aid to the config file creator.
* Override default options with config file options.
* Override default and config file options with command-line options.
* Generate code or data at runtime using the echo:// uri.

TWiG uses the 'http' extension which is a part of the standard Tcl distribution and the 'uri',
'ftp', 'htmlparse', 'md5' and 'textutil' extensions which are part of the 'Tcllib' library.

Note that even though the URLs, page numbers, names, checksums, etc. were fairly
accurate at the time of writing, they should be considered as examples only. Play safe.
As an added bonus, this document is extensively loquacious!
It also doesn't keep up to date with the program as well as it should.


SECURITY CONSIDERATIONS
The act of downloading and running code from external sources is fraught with
security hazards. TWiG attempts to mitigate these hazards by the use of checksums.
Config file writers should carefully examine any code to be processed before
including it in any project.


NEW FOR VERSION 0.2
Since TWiG was originally developed, a new scheme for retrieving blocks from
wiki.tcl.tk has been developed. By adding the text '.code' to the url of a
wiki page, the Tcler's Wiki will return a text file containing all the blocks
on the page, delimited with an identifying string. The new scheme 'wiki://'
works almost identically to 'http://' in that the page will still be retrieved
using the 'http' protocol but the special '.code' text will automatically
be appended to the url and the resulting retrieved page will be processed
using the parameters descibed above. The results of using 'http://' or
'wiki://' on a page from wiki.tcl should be identical.
The scheme 'wiki://' is now the default.

NEW FOR VERSION 0.3
Version 0.3 is a bugfix release.

NEW FOR VERSION 0.4
Block ranges can be specified: 3-6 vs {3 4 5 6}.
New command line option: -re <regexp>
Originally called 'TWiG - Tcl Wiki Getter'.
For a while I considered renaming it to 'Moka'.
A moka is type of a coffeemaker (that I use daily) and also a pun on 'Teapot'.
Anyway, I ended up liking the name 'Twig' so now it's simply 'Twig', no
funny capitaliZation and it's not an acronym for anything.
Stupid terminology: Ok, so 'twigs' were going to be called 'beans' and
a group of beans a 'bean bag'. So that's another reason for staying with 'Twig':
coffee names are overused and tired. A 'twig' (for lack of a better term) is a 3-5
(usually 5) element proper Tcl list as described in detail somewhere below.
If you have a file with one or more twigs, that is called a 'bundle' ('bundle of
twigs' - get it? wonderful). The file can also have some global config info,
so it could also be called a 'config' file. Or maybe it's a 'twig' file?
I dunno - it's all about the same thing - I'll work it out eventually.

NEW FOR VERSION 0.5
Improved block ranges: 1,2-4,10 vs {1 2 3 4 10}
Name now also acquired from .code retrievals (if requested).
Blocks are now assembled in order given, no longer in ascending
numerical order.*
Query output formatted differently and as a Tcl comment so that
queried blocks may be executed without a Twig-induced error.
Wiki block cleaning has improved.*
Improved code.
* These changes may incompatibilitize your twigs. To update,
simply verify sums with older version then regen with new version.
I'm hoping not to disturb things like this too often, but Twig is
still evolving at this point in time.

NEW FOR VERSION 0.6
Fixed bug in query output.
Code tweak(s).

NEW FOR VERSION 0.7
Unbroke echo:// and anything else I broke in the last couple of versions.
Code tweak(s).
New and improved 'echo://'! An 'echo://' url will be split into
block in the ':' (colon) character.
For example, 'echo://set a 1:set a 2:set a 3' contains three blocks which
may be accessed and/or summed/checked as a group (block 0) or individually.

With block 0 (default):
$ twig 'echo://set a 1:set a 2:set a 3' 
set a 1
set a 2
set a 3

Selecting a block:
$ twig 'echo://set a 1:set a 2:set a 3' 2
set a 2

Summing:
$ twig -g 'echo://set a 1:set a 2:set a 3'  
{echo://set a 1:set a 2:set a 3} 0 36CA8E69C36F060AD80EDA887320FAA5
$ twig -g 'echo://set a 1:set a 2:set a 3' 2
{echo://set a 1:set a 2:set a 3} 2 C4192B48ABAE6711A7F1ADAE44F91959

Check:
$ twig -s C4192B48ABAE6711A7F1ADAE44F91959 'echo://set a 1:set a 2:set a 3' 2
set a 2
Or:
$ echo '{echo://set a 1:set a 2:set a 3} 2 C4192B48ABAE6711A7F1ADAE44F91959' | twig -c -
set a 2
Does it really work?
$ twig -s 11111111111111111111111111111111 'echo://set a 1:set a 2:set a 3' 2
twig: Bad checksum!
echo://set a 1:set a 2:set a 3 2 11111111111111111111111111111111 C4192B48ABAE6711A7F1ADAE44F91959

Note: The idea behind 'echo://' is to be able to add small snippets of code
or timestamps, extra comments ... stuff like that. I supposed it could be abused
to contain entire programs, but that's a bit nuts.


QUICK START
Get and run 'Colliding Balls', page 8573:
$ twig 8573 | wish

Get and run clock #3 from 'An analog clock in Tk', page 1011:
$ twig 1011 3 | wish

Get and run 'bikegear.tcl' from ftp.tcl.tk:
$ twig ftp://ftp.tcl.tk/pub/tcl/misc/bikegear.tcl | wish

Get and run 'someprog.tcl' from the local filesystem:
$ twig file://~/myprogs/someprog.tcl | tclsh

Get the current date and time:
$ twig echo://'The time is: [clock format [clock seconds]]'
The time is: Mon Oct 22 16:28:37 EDT 2007

Generate and use a config file:
$ twig -g 8573 > cballs.twig
$ twig -c cballs.twig | wish

Generate a config and pipe it to the config input of another TWiG.
$ twig -g 8573 | twig -c - | wish


COMMAND-LINE
Arguments starting with '-' can be in any order.

twig.tcl [-u url] [-t tag] [-s sum] page [blocks]
twig.tcl [-u url] [-t tag] -c file
twig.tcl [-u url] [-t tag] [-n] -g page [blocks] [name]
twig.tcl [-u url] [-t tag] [-n] -q lines page [name]
twig.tcl [-u url] [-t tag] -o [name]

Defaults (if any) are in '[]':
-u <url>   : Retrieve pages from <url>. [https://wiki.tcl-lang.org/]
-o         : Generate config globals.
-g         : Generate config line.
-q <lines> : Query page, displaying at most <lines> lines of each block.
-t <tag>   : Use HTML tag <tag> to delimit blocks. [pre]
-s <sum>   : Verify with sum <sum>.
-n         : Retrieve the name from the '<title></title>' block of the page. []
-r <re>    : Specify the regular expression used to split '.code' blocks.  [### <code_block id=\d+> #+\n*]


GLOBAL DEFAULTS
url:  'wiki://wiki.tcl.tk/'
tag:  'pre'
name: ''
re '### <code_block id=\d+> #+\n*'

PAGES AND BLOCKS
A 'page' is whatever 'valid' data is returned by whatever server the retrieval request is sent to.
A 'page' retrieved via wiki:// is normally considered to be a specially delimited text file.
A 'page' retrieved via http:// is normally considered to be an HTML document.
A 'page' retrieved via ftp:// or file:// is normally considered to be a text file. 
A 'block' in an HTML document (page) is an HTML 'fragment',
                normally some text enclosed in '<pre></pre>' tags.
                The tag can be changed with the '-t' command-line option.
A 'block' in a text file is the declaration, arguments and body of a [proc].

A page to retrieve is specified on the command-line or in a config file.
One or more blocks to retrieve can be specified. Blocks are numbered
starting from 1. The default is 0 which will cause TWiG to retrieve all
the blocks on the specified page for http:// retrievals or the entire page
for ftp:// or file:// retrievals. Blocks can be specified by a number (37,4,0,12,etc.)
or by a list of numbers separated by spaces ("7 3 10","4 9",etc.). Multiple blocks
from the same page are received in one operation and checksummed as a unit.
Regardless of the order of the blocks in a block list, blocks are returned
in ascending numerical order.


PAGE SYNTAX
A page specified as s simple string without any url specification
("8573", "index.html",etc.) will be appended to the server specification
(either the default or the default overridden by a config file or the -u
command-line option); the page "8573" will result in the url "wiki://wiki.tcl.tk/8573".
Pages specified in this manner will be retrieved via wiki:// and processed
as html using the '<pre>' tag. This is the default behavior.
Urls can be specified in full; ("https://wiki.tcl-lang.org/8573",
"ftp://ftp.tcl.tk/pub/tcl/misc/bikegear.tcl", "file://myprog.tcl", etc).
This will cause retrieval using the mentioned protocols and processing as described above.

ECHO
When TWiG processes an echo:// uri it will take whatever text is after the '//',
checksum it, verify the checksum if necessary, then [subst] and return it.
Note that only this uri scheme will checksum the text before any processing (in this
case that means running the text through [subst]), the other uri schemes will checksum
after processing (usually the simple accumulation of data). This allows checksumming
Tcl scripts that have varying output (echo://'# Generated on [clock format [clock seconds]]',etc).

URLS
If a config file line has a url, it will override the page url.
If a url is specified on the command-line, it will override the page url.
The page url will be overriden this way whether it is specified on the
command-line or in a config file.
The last component of the page url will be added to the overridden url to
form the new url.

INFORMATION RETRIEVAL
A page is retrieved from the server and using the 'scheme' specified by
(after some processing) the 'url', the blocks are extracted, their
checksum verified, then output.
TWiG will print an error message (or simply crash) and exit if there is
any problem at any point in this process.
Pages are retrieved using whatever happens to be the default encoding for
whichever method is used to retrieve pages. This should be ok most of the time.
Pages received via ftp:// use 'passive' ftp.


MANUAL VERIFICATION (-s)
If a checksum is supplied on the command-line then it will be compared
against the retrieved data's checksum. The data will still be output as
with a regular retrieval. TWiG will stop and generate an error message
if the checksums don't match. If the checksum (from command-line or config file)
is the special value 'DO-NOT-VERIFY-SUM' then no checksumming or verification
will be done.

$ twig -g 8573
8573 0 6054C886923FBC509D3EBF58864FEABD

Test and discard retrieved output:
$ twig -s 6054C886923FBC509D3EBF58864FEABD 8573 >/dev/null

With a bad sum:
The output is: "url blocks supplied-sum retrieved-sum"
$ twig -s wrongsum 8573 >/dev/null
twig: Bad checksum!
wiki://wiki.tcl.tk/ 8573 0 wrongsum 6054C886923FBC509D3EBF58864FEABD

TWiG will return appropriate success/failure codes to the shell:
$ twig -s wrongsum 8573 >/dev/null 2>&1 || echo badsum!
badsum!

Don't checksum:
$ twig -g -s DO-NOT-VERIFY-SUM 8573
8573 0 DO-NOT-VERIFY-SUM


GENERATE CONFIG LINE (-g)
TWiG's -g (generate) option will cause TWiG to generate one line of output
suitable for inclusion in a TWiG config file.
A config line is a proper Tcl list containing at the minimum three elements.
If the default url is overriden, the supplied url will be included in the config line.
If a name is supplied (directly on the command-line or automatically retrieved
using the -n option), the name will be included in the config line. To supply
a name on the command-line requires also specifying a blocklist.

Simple:
$ twig -g 8573
8573 0 6054C886923FBC509D3EBF58864FEABD

Multi-block:
$ twig -g 1011 '5 2'
1011 {5 2} 09782D207920CD3477E4D974E18A10D1

With name supplied on the command-line:
$ twig -g 1011 '5 2' 'A couple of clocks'
1011 {5 2} 09782D207920CD3477E4D974E18A10D1 {} {A couple of clocks}

With name taken from the page's title:
$ twig -g -n 1011 '5 2'
1011 {5 2} 09782D207920CD3477E4D974E18A10D1 {} {An analog clock in Tk}

Override url:
$ twig -g -u https://wiki.tcl-lang.org -n 1011 '5 2'
1011 {5 2} 09782D207920CD3477E4D974E18A10D1 https://wiki.tcl-lang.org {An analog clock in Tk}


GENERATE CONFIG GLOBALS (-o)
Configurable globals are output along with their values
in a format suitable for inclusion in a config file.

Defaults:
$ twig -o
url wiki://wiki.tcl.tk/
name {}
tag pre
re {### <code_block id=\d+> #+\n*}

Override defaults, use -u and -t to overried the defaults
and put the name as the page parameter.
$ twig -o -u www.cpan.org -t script 'Perl Place' -r '<pre>.*?</pre>'
url www.cpan.org
name {Perl Place}
tag script
re <pre>.*?</pre>


CONFIG FILE (-c)
A config file contains one or more 'config line' (-g) lines and
optional 'config globals' (-o) lines. The '-c' option is used to
specify the name of the config file; TWiG will read the config
from standard input if the file name is a '-' character.
From a config file,
TWiG will process:
* config global lines  (should be [list]s).
* config lines         (should be [list]s).
TWiG will ignore:
* whitespace-only or zero-length lines.
* lines where the first non-whitespace character is a '#' character.
TWiG will cough up an error and die on:
* anything else.

Generate a config line and save it in a file:
$ twig -g 8573 > bballs.twig
$ cat bballs.twig
8573 0 6054C886923FBC509D3EBF58864FEABD

Use the config file:
$ twig -c bballs.twig | wish

Standard input can be specified instead of a regular file by using '-' for the file name:
$ twig -g 8573 | twig -c - | wish

If a different URL than the default is desired, use -o to
generate the necessary config globals lines.

Replace all defaults:
$ twig -t div -u wfr.tcl.tk -o 'Le wiki tcl francophone' > myconfig.twig
$ cat myconfig.twig
url wfr.tcl.tk
name {Le wiki tcl francophone}
tag div
re {### <code_block id=\d+> #+\n*}


TAG (-t)
Since the tag can be specified, it should be possible to extract other types of blocks.
This might be useful.

$ twig -t title 1011                                                                       
An analog clock in Tk

$ twig -u www.tcl.tk -t title index.html  
Tcl Developer Site


QUERYING (-q)
To help make finding blocks easier, the -q (query) option will output
and label every block found in the page. Each block is labeled with
the page number, the block number and an optional name. The name may
be supplied on the command-line or taken from the title of the web page
by using the '-n' option. The 'lines' parameter to the '-q' option determines
how many lines of each block to display (use 0 to display the entire block).

Let's start off by checking the page title:
$ twig -t title 1011                                                                       
An analog clock in Tk

Output the first line all blocks on the page:
$ twig -q 1 1011                                         
1011:1  grid [canvas .c -width 200 -height 200]
1011:2  proc every {ms body} {
1011:3  set radius 35
1011:4  set radius 35
1011:5  set radius 10
1011:6  package require Tk
1011:7    set pi {}

That doesn't assist much in determining which block we'd like.
If the first line of the block would have been a comment describing
the block then this summary would be of greater use.
We can try looking at 2 lines per block, then 3 etc. until
we find what we want but let's simply look at all of it.

Browse all the blocks, showing entire blocks (output not shown):
$ twig -q 0 1011 | less

Browse all the blocks, showing the first five lines of each block (output not shown):
$ twig -q 5 1011 | less

It's nice to see all that code but without any comments we
still have to refer to the web page to understand everything.
It's probably a good idea to read the page first in any case.

We've decided on clock #6; let's try it out:
$ twig 1011 6 | wish

Ooh, that's pretty.

Now let's take a break and have a bit of fun:
$ twig -s 6D3E93EA90FF3F77736081A20976BA4E 8607 | wish


# EOF

Source

#! /bin/sh
# \
exec tclsh $0 ${1+"[email protected]"}

# Twig
#
# Stuart Cassoff
#
# License: ISC. See the LICENSE file for details.
#
# September/October 2007
# Version 0.1 (TWiG)
#
# October 2008
# Versions 0.2, 0.3, 0.4
#
# November 2008
# Version 0.5, 0.6, 0.7
#

package require Tcl

namespace eval [subst [string repeat {[format %c [expr {65 + int(rand() * 26)}]]} 10]] {

proc usage {cfg} {
        upvar 1 $cfg c
        puts $c(stderr) [join [list "$c(prog) $c(ver) usage:" \
                {[-s sum] page [blocks]} {-c file} \
                {[-n] -g page [blocks] [name]} {[-n] -q lines page [name]} \
                {-o [name]}] [append {} "\n$c(me)" { [-u url] [-t tag] [-r re] }]]
}

proc errExit {cfg msg} {upvar 1 $cfg c; puts $c(stderr) "$c(me): $msg"; exit 1 }

proc setupPackages {} {
        foreach p [list uri http ftp htmlparse md5 textutil] { package require $p }
        uri::register echo {variable schemepart {//(.*)}}
        proc ::uri::SplitEcho {url} { return [list path [regsub {^//} $url {}]] }
        proc ::uri::JoinEcho {args} { array set a [list path {}]; array set a $args; return [append {} {echo://} $a(path)] }
        uri::register wiki {variable schemepart {//(.*)}}
        proc ::uri::SplitWiki {url} { return [::uri::SplitHttp $url] }
        proc ::uri::JoinWiki {args} { return [::uri::JoinHttp {*}$args] }
}

proc setup {cfg} {
        if {$::argc == 0} { usage $cfg; exit 1 }
        upvar 1 $cfg c
        set ( ); set elem {page}
        foreach arg $::argv { switch -exact -- ${(} {
                ) { switch -exact -- $arg {
                        -g                { set c(mode) {gentwig} }
                        -o                { set c(mode) {genopts} }
                        -n                { set c(gettitle) 1 }
                        -w                { set c(wc) {} }
                        -s - -t - -u - -c - -r - -q -
                        -b                { set ( $arg }
                        -? - --help        { usage $cfg; exit 0 }
                        default {        set c($elem) $arg
                                        switch $elem {        page    { set elem {blocks} }
                                                        blocks  { set elem {name} }
                                                        name    { set elem {} }
                                                        default { usage $cfg; exit 1 } }}}}
                -s { set ( ); set c(sum)   $arg }
                -t { set ( ); set c(tag)   $arg }
                -u { set ( ); set c(url)   $arg; set c(urloverride) 1 }
                -c { set ( ); set c(cfgfn) $arg; set c(mode) {getconfig} }
                -r { set ( ); set c(re)    $arg }
                -q { set ( ); set c(query) $arg }
                -b { set ( ); set c(bs)    $arg } }}
        setupPackages
        return $cfg
}

proc randvar {} { return [namespace current]::[subst [string repeat {[format %c [expr {97 + int(rand() * 26)}]]} 10]] }
proc adjUrl {url scheme} { return [expr {$url ne {} && ![regexp {^.+://} $url] ? "$scheme://" : ""}]$url }
proc openIt {fn} { return [expr {($fn eq {-} || $fn eq {stdin}) ? {stdin} : [open $fn r]}] }
proc closeIt {f} { return [expr {$f eq {stdin} ? {} : [close $f]}] }
proc inhaleIt {fn} { return [read [set f [openIt $fn]]][closeIt $f] }
proc loadHuh? {cfg l} { errExit $cfg "Huh? I don't understand this:\n$l" }

proc loadConfig {cfg} {
        upvar 1 $cfg c
        foreach l [split [inhaleIt $c(cfgfn)] \n] {
                if {[set l [string trim $l]] eq {} || [string index $l 0] eq {#}} { continue }
                if {[catch {llength $l} ll]} { loadHuh? $cfg $l }
                if {$ll == 2} {
                        lassign $l opt val
                        if {$opt ni $c(cfgopts)} { loadHuh? $cfg $l }
                        if {$opt eq {url} && $c(urloverride)} { continue }
                        set c($opt) $val
                } elseif {$ll > 2 && $ll < 6} {
                        lappend c(reaplist) $l
                } else {
                        loadHuh? $cfg $l
                }
        }
}

proc wc {data} {
        set cleaned 0
        while {[expr {([regexp -all -line {^ } $data] - [regexp -all {\n } $data]) == 1}]} {
                set data [regsub -all -line {^ } $data {}]
                set cleaned 1
        }
        if {$cleaned} { return [string trimright $data] } else { return $data }
}

proc expandRange {range} {
        set expanded {}
        foreach chunk [split $range ,] {
                foreach hi [lassign [split $chunk -] chunk] {
                        for {set lo $chunk; set chunk {}} {$lo <= $hi} {incr lo} { lappend chunk $lo }
                }
                lappend expanded {*}$chunk
        }
        return $expanded
}

proc expandBlocks {rc} {
        upvar 1 $rc r
        set r(eblocks) {}
        foreach block $r(blocks) { lappend r(eblocks) {*}[expandRange $block] }
}

proc addBlock {rc data args} {
        upvar 1 $rc r; upvar 1 $r(cfg) c
        if {[incr r(nblock)] in $r(eblocks) || $r(all)} {
                foreach zap $args { set data [$zap $data] }
                set r(acqblocks) [dict append r(acqblocks) $r(nblock) $data]
        }
}

proc postprocessEcho {rc} {
        upvar 1 $rc r
        set r(data) [subst $r(data)]
}

proc htmlParseCallback {rc tag slash parms text} {
        upvar 1 $rc r; upvar 1 $r(cfg) c
        if {$tag eq $c(tag) && $slash eq {}} {
                addBlock $rc $text htmlparse::mapEscapes {*}$c(wc)
        } elseif {$c(gettitle) && $tag eq {title} && $slash eq {}} {
                set r(name) $text
        }
}
proc parseHtml {rc} {
        htmlparse::parse -cmd [list [namespace which htmlParseCallback] $rc] [set ${rc}(rawData)]
}
proc parseWiki {rc} {
        upvar 1 $rc r; upvar 1 $r(cfg) c
        if {$c(gettitle)} { set r(name) [lindex [regexp -inline [string map [list '\[ '(\[ *' *)'] $c(re)] [string range $r(rawData) 0 [string first \n $r(rawData) 2]]] 1] }
        foreach b [lrange [textutil::splitx $r(rawData) $c(re)] 1 end] { addBlock $rc $b {*}$c(wc) }
}
proc parseTcl {rc} {
        upvar 1 $rc r; upvar 1 $r(cfg) c
        set inproc 0; set text {}
        foreach l [split $r(rawData) \n] {
                if {!$inproc} { if {[regexp {^( *proc *)(.*?)( .*)} $l]} { set inproc 1 } }
                if {!$inproc} { continue }
                if {[info complete [append text $l \n]]} { addBlock $rc $text; set inproc 0; set text {} }
        }
}
proc parseEcho {rc} {
        upvar 1 $rc r
        foreach l [split $r(rawData) :] { addBlock $rc $l }
}

proc reapHttp {rc} {
        upvar 1 $rc r
        set tok [http::geturl $r(url)]
        if {[http::ncode $tok] != 200} { errExit $r(cfg) "HTTP problem: ([http::code $tok])" }
        set r(rawData) [http::data $tok]
        http::cleanup $tok
}
proc reapFtp {rc} {
        upvar 1 $rc r
        if {[set f [ftp::Open $r(host) $r(user) $r(pwd) -port $r(port) -mode passive]] == -1} { errExit $r(cfg) "FTP problem: (Can't connect to $r(url))" }
        if {[ftp::Get $f $r(path) -variable r(rawData)] == 0} { ftp::Close $f; errExit $r(cfg) "FTP problem: (Can't retrieve $r(path))" }
        ftp::Close $f
}
proc reapFile {rc} {
        upvar 1 $rc r
        set r(rawData) [inhaleIt $r(path)]
}
proc reapEcho {rc} {
        upvar 1 $rc r
        set r(rawData) $r(path)
}

proc assembleBlocks {rc} {
        upvar 1 $rc r; upvar 1 $r(cfg) c
        if {![info exists r(acqblocks)]} { return }
        set s [dict size $r(acqblocks)]
        set rblocks [expr {!$r(all) ? $r(eblocks) : [expandRange 1-$s]}]
        if {$c(query) < 0} {
                foreach i $rblocks { lappend blob [dict get $r(acqblocks) $i] }
        } else {
                append label "#<" $r(page) : %[string length $s]d>
                if {$r(name) ne {}} { append label { } {"} $r(name) {"} }
                if {$c(query) != 1} { append label \n }
                set ind [expr {$c(query) == 0 ? "end" : ($c(query) - 1)}]
                foreach i $rblocks {
                        lappend blob [format $label $i][join [lrange [split [dict get $r(acqblocks) $i] \n] 0 $ind] \n]
                }
        }
        set r(data) [join $blob [string repeat \n $c(bs)]]
}

proc calcSum {rc} {
        upvar 1 $rc r
        set r(calc_sum) [md5::md5 -hex $r(data)]
}
proc verifySum {rc} {
        upvar 1 $rc r
        if {[string tolower $r(sum)] ne [string tolower $r(calc_sum)]} { errExit $r(cfg) "Bad checksum!\n$r(url) $r(blocks) $r(sum) $r(calc_sum)" }
}

proc reap {rc} {
        upvar 1 $rc r
        foreach p $r(process) { $p $rc }
        return $rc
}

proc setupReapContext {rc rcline} {
        upvar 1 $rc r; upvar 1 $r(cfg) c
        lassign $rcline r(page) r(blocks) r(sum) r(url) r(name)
        array set a [list host {} port {} user {} pwd {}]
        array set a [uri::split $r(page) $c(scheme)]
        if {$c(urloverride) || $r(url) ne {}} {
                set u [expr {$c(urloverride) ? $c(url) : $r(url)}]
                set path [file tail $a(path)]
                array set a [uri::split [adjUrl $u $c(scheme)] $c(scheme)]
                append a(path) $path
        }
        switch -exact -- $a(scheme) { echo {
                foreach e [list scheme path] { set r($e) $a($e) }
                set r(url) [::uri::join scheme $r(scheme) path $r(path)]
        } file {
                set a(path) [append a(host) $a(path)]
                foreach e [list scheme path] { set r($e) $a($e) }
                set r(url) [::uri::join scheme $r(scheme) path $r(path)]
        } ftp {
                if {$a(port) eq {}} { set a(port) {21} }
                if {$a(user) eq {}} { set a(user) {anonymous} }
                if {$a(pwd)  eq {}} { set a(pwd)  {[email protected]} }
                foreach e [list scheme host port path user pwd] { set r($e) $a($e) }
                set r(url) [::uri::join scheme $r(scheme) host $r(host) port $r(port) path $r(path) user $r(user) pwd $r(pwd)]
        } wiki - http {
                if {$a(host) eq {}} {
                        array set aa [uri::split $c(url) $c(scheme)]
                        foreach e [list scheme host] { set a($e) $aa($e) }
                }
                foreach e [list scheme host port path] { set r($e) $a($e) }
                set r(url) [::uri::join scheme $r(scheme) host $r(host) port $r(port) path $r(path)]
                if {$a(scheme) eq {wiki}} { append r(url) .code }
        } default { errExit $r(cfg) "Huh? ($a(scheme))" } }
        if {$r(name) eq {}} { set r(name) $c(name) }
        if {$c(sum) ne {}} { set r(sum) $c(sum) }
        array set r [list calc_sum {} data {} nblock 0 process {} parse [string map $c(parsemap) $r(scheme)]]
        lappend r(process) reap[string totitle $r(scheme)] parse[string totitle $r(parse)] assembleBlocks
        if {$r(sum) eq $c(noverifysum)} {
                set r(calc_sum) $r(sum)
        } else {
                lappend r(process) calcSum
                if {$r(sum) ne {} || $c(cfgfn) ne {}} { lappend r(process) verifySum }
        }
        if {$a(scheme) eq {echo}} { lappend r(process) postprocess[string totitle $r(parse)] }
        expandBlocks $rc
        set r(all) [expr {[llength $r(eblocks)] == 1 && [lindex $r(eblocks) 0] == 0}]
        return $rc
}
proc newReapContext {cfg} {
        upvar 1 $cfg c
        upvar 1 [set rc [randvar]] r
        array set r [list cfg $cfg page $c(page) blocks $c(blocks) sum {} url $c(url) name $c(name)]
        return [setupReapContext $rc [list $c(page) $c(blocks) {} {}]]
}

proc go {cfg} {
        upvar 1 $cfg c
        switch $c(mode) {
        getpage { puts $c(stdout) [set [reap [newReapContext $cfg]](data)] }
        getconfig {
                loadConfig $cfg
                upvar 1 [set rc [newReapContext $cfg]] r
                foreach get $c(reaplist) {
                        reap [setupReapContext $rc $get]
                        puts $c(stdout) $r(data)
                }
        }
        gentwig {
                upvar 1 [reap [newReapContext $cfg]] r
                set l [list $r(page) $r(blocks) $r(calc_sum)]
                if {$c(urloverride)} { lappend l [adjUrl $r(host) $r(scheme)] } elseif {$r(name) ne {}} { lappend l {} }
                if {$r(name) ne {}} { lappend l $r(name) }
                puts $c(stdout) $l
        }
        genopts { set c(name) $c(page); foreach opt $c(cfgopts) { puts $c(stdout) [list $opt $c($opt)] } }
        default { errExit $cfg "Huh? ($c(mode))" }
        }
}

proc init {} {
        upvar 1 [set cfg [randvar]] c
        array set c [list \
                prog {Twig} ver {0.7} me [file tail $::argv0]    \
                name {} reaplist {} page {} cfgfn {} sum {}      \
                gettitle {0} urloverride {0} query {-1}          \
                scheme {wiki} parse {wiki} mode {getpage}        \
                tag {pre} url {wiki://wiki.tcl.tk/}              \
                re {### <code_block id=\d+ title='[^']*'> #+\n*} \
                wc {wc} bs {1} blocks [list 0]                   \
                cfgopts [list url name tag re]                   \
                parsemap [list http html ftp tcl file tcl]       \
                stdin {stdin} stdout {stdout} stderr {stderr}    \
                noverifysum {DO-NOT-VERIFY-SUM}                  \
                author {Stuart Cassoff}                          \
        ]
        interp alias {} [namespace current]::reapWiki {} [namespace current]::reapHttp
        return $cfg
}

go [setup [init]]

}; #End of namespace

# EOF