Web scraping

Web scraping is the practice of getting information from a web page and reformatting it.

Description

[clt postings from jooky and David Welton and Larry Virden.]

When a service API is not available, sometimes the only recourse is to scrape a web interface instead. An alternative is to work with the web host and arrange details of a Web Service that would provide useful information programmatically.

Web scraping is often employed for small tasks where an API (such as sending updates to a pager/WAP phone, etc., emailing one's personal account status to an account, or moving data from a website into a local database) is not available

For non-Tcl tools for scraping the web, see projects like http://sitescooper.org/ , http://www.plkr.org/ , or Python-based mechanize , a descendant of Andy Lester's WWW::Mechanize for Perl,

Etiquette

NEM: To be a good web-citizen any web robots should follow some guidelines:

  • set the user-agent to some unique identifier for your program and include some contact details (email, website) so that site admins can contact you if they have any issues (you can do this with http::config -useragent);
  • fetch the /robots.txt file from the server you are scraping and check every URL accessed against it (see [L1 ]).

(probably more - feel free to expand this list). Checking the robots.txt is relatively simple but still requires a bit of effort. As I'm currently doing some web-scraping I may package up the code I'm using for this into a webrobot package.

Examples

Amazon.de PreOrder
LV This isn't technically web scraping, but I'm uncertain where else to reference it - it is making use of a web site's CGI functionality, from a Tk application, from what I can tell
An HTTP Robot in Tcl
How to query a CGI script with the http package and regexp.
A little rain forecaster
Daily Dilbert
Downloading pictures from Flickr
Downloading your utility usage from Pacific Gas and Electric using TCL
Getting stock quotes over the internet
Googling with SOAP
Hacker News
Scraping Hacker News to get the front page stories.
HTML to DOM via http/htmlparse/struct::tree packages
Picture of the Day, by Keith Vetter, 2023
Scrape images from Wikipedia.
Scraping timeentry.kforce.com
An example of scraping a password-protected website over ssl.
Synchronizing System Time
tcllib/demos/oscon
Uses the htmlparse module (among others) to parse the schedule pages for OSCON 2001 and convert them into reformatted HTML, ASCII table and CSV files (the latter usable by Excel and other applications). the demo requires some fixes as noted on OSCON 2001
LemonTree branch uses htmlparse module also.
Web2Desktop
Downloads the latest daily User Friendly comic strip and sets it as the desktop background on Windows.
TcLeo
Query the English <=> German web dictionary at http://dict.leo.org from the command line.
tDOM's HTML-parser + XPath expression
Web scraping example for EBay, presented at the First European Tcl/Tk Users Meeting ( http://sdf.lonestar.org/~loewerj/tdom2001.ppt and http://www.tu-harburg.de/skf/tcltk/tclum2001.pdf.gz ).
ucnetgrab
Forum post text scraping with tDOM.
TaeglicherStrahlungsBericht
Daily updated radiation map of German government's (BfS) sensor installations.
tinyurl
Web Scraping for Web Services: Why and How , Cameron Laird, 2002 (?)
Web Scraping with htmlparse
Scraping the Tcler's Wiki with htmlparse and several ways of processing the result.
wiki-reaper (and wish-reaper)
Extract code samples from the Tcler's wiki. The latest version no longer scrapes HTML due to the wiki's new features but the previous one (which can be examined via page history) does.

Tools

htmlparse
A Tcllib package that implements generic HTML parsing as well as conversion of HTML structure into a struct::tree. The resulting tree can be queried with treeql (a Cost-like DSL for manipulating tree objects that is also in Tcllib) or treeselect (CSS-like selectors).
tclwebtest
A tool to write automated tests for web applications.
TWiG
A tool for extracting blocks of data from pages retrieved from the web.
Getleft
A web site grabber.
pop3line
Fetches e-mails from the web-mailer of T-Online.
webrobot - a package for web scraping
A simple snit wrapper around the http package, adding support for authenticating proxies, following redirects,and rudimentary cache ancd cookie support.
A little RSS reaper
RS: loads an RSS page, renders it to HTML, plus it compacts the referenced pages into the same document with local links, while trying to avoid ads and noise. Not perfect, but I use it daily to reap news sites onto my iPaq :)
LemonTree branch
A GUI to browse an HTML document.
grabchat
A tool to grab yesterday's tcler's wiki chat room log.
AutoIt
A simple tool to simulate key presses, mouse movements, and window commands.

Non-Tcl Tools

Beautiful Soup
A web scraping library for Python.
iMacros for Firefox
spidermonkey
Kaitzschu has suggested that this deserves a Tcl binding.

See Also

screenscrape
websearch
TkChat
A Tk/Tcl web scraper for this web site's chat room.
lapecheronza
Monitor your favorite websites for updates and changes.
Tutorial for Web scraping with regexp and tdom
web crawler

Resources

  • Web * * * scraping is easy , Cameron Laird and Kathryn Soraiz, 2003:
  • Apt comments on the technical and business difficulty of Web scraping, along with mentions of WebL and NQL, appear here [L2 ].
  • Perl probably has the most current Web-scraping activity (even more than tclwebtest?), especially with WWW::Mechanize , although Curl also has its place.
  • In 2004, WWWGrab [L3 ] looks interesting for those in a position to work under Windows.
An almost perfect real-world hack
Louis Brandy describes using Python and iMacros to do some web scraping.

Example by Ian, 2007, comp.lang.tcl

LV 2007-11-01: On comp.lang.tcl, during Oct 31, 2007, in a thread [L4 ] about someone wanting to extract data from an html page, a user by the name of Ian posted the following snippet of code as an example of what they do to deal with a page of html that has some data in it that

package require htmlparse
package require struct

proc html2data s {
   ::struct::tree x
   ::htmlparse::2tree $s x
   ::htmlparse::removeVisualFluff x

   set data [list]

   x walk root q {
       if {([x get $q type] eq "PCDATA") &&
           [string match R\u00e6kke/pulje [x get $q data]]} {

           set p $q
           for {set i 3} {$i} {incr i -1} {set p [x parent $p]}
           foreach {row} [lrange [x children $p] 1 end] {

           ......
           }
           break
       }
   }
   return $data
}

LemonTree branch uses this technique.

Misc

[It seems many of us do our own "home-grown" solutions for these needs.]

Also, some web hosts provide XML versions (RSS), or specially formatted versions for use with Avantgo or the plucker command, with the goal of aiding people who need some sort of specialized format for small devices, etc.

It would be great if someone making legitimate use of some of these sources would share some of their code to do this sort of thing.