[[clt postings from jooky and [David Welton] and [Larry Virden].]] Web scraping is the practice of getting information from a web page and reformatting it. Some reasons one may do this are to send updates to a pager/WAP phone, etc., email one's personal account status to an account, or move data from a website into a local database. See projects like http://sitescooper.org/ , http://www.plkr.org/ , or [Python]-based mechanize [http://wwwsearch.sourceforge.net/mechanize/] (itself a descendant of Andy Lester's '''WWW::Mechanize''' for [Perl]) for non-[Tcl] tools for scraping the web. "Web Scraping ..." [http://cedar.intel.com/cgi-bin/ids.dll/content/content.jsp?cntKey=Generic+Editorial%3a%3aws_scraping&cntType=IDS_EDITORIAL&cat=CJA] "Web scraping is easy" [http://www.unixreview.com/documents/s=7822/ur0302h/] ---- * [An HTTP Robot in Tcl] * [websearch] * [Daily Dilbert] * [Web2Destop] * [Getting stock quotes over the internet] * Also see '''[tcllib]/examples/oscon''' which uses the [htmlparse] module (among others) to parse the schedule pages for OSCON 2001 and convert them into [CSV] files usable by [Excel] and other applications. [LemonTree branch] uses this technique. * See also tDOM's HTML-parser + XPath expression in the web scraping example for EBAY presented at the [First European Tcl/Tk Users Meeting] ( http://sdf.lonestar.org/~loewerj/tdom2001.ppt and http://www.tu-harburg.de/skf/tcltk/tclum2001.pdf.gz) * [TkChat] is one Tk/Tcl web scraper for this web site's chat room! * [TcLeo] allows querying the English <=> German web dictionary at http://dict.leo.org from the command line. * [Getleft] * [wiki-reaper] (and [wish-reaper]) * [A little rain forecaster] * [pop3line] fetches e-mails from the web-mailer of T-Online. * [tclwebtest] * [tinyurl] * [lapecheronza] monitor your favorite websites for updates and changes. * Kaitzschu has suggested that SpiderMonkey [http://www.mozilla.org/js/spidermonkey/] deserves Tcl [binding]s. * [Downloading pictures from Flickr] * Tutorial for Web scraping with regexp and tdom([http://www.vogel-nest.de/wiki/Main/WebScraping1]) * [Synchronizing System Time] * [grabchat] * [TWiG] * [ucnetgrab] * [web crawler] * [TaeglicherStrahlungsBericht] Daily updated radiation map of german government's (BfS) sensor installations ---- * Apt comments on the technical and business difficulty of Web scraping, along with mentions of WebL and NQL, appear here [http://lambda.weblogs.com/discuss/msgReader$3567]. * Some people go in a record-and-playback direction with such tools as [AutoIt]. * [Perl] probably has the most current Web-scraping activity (even more than tclwebtest?), especially with WWW::Mechanize [http://www.perl.com/pub/a/2003/01/22/mechanize.html], although Curl [http://www.unixreview.com/documents/s=1820/uni1011713175619/0201i.htm] also has its place. * In 2004, WWWGrab [http://www.wwwgrab.com/] looks interesting for those in a position to work under [Windows]. ---- [[It seems many of us do our own "home-grown" solutions for these needs.]] ---- An alternative to web scraping is to work with the web host to work out details of a [Web Service] that would provide useful information programatically. Also, some web hosts provide [XML] versions ([RSS]), or specially formatted versions for use with Avantgo or the plucker command, with the goal of aiding people who need some sort of specialized format for small devices, etc. It would be great if someone making legitimate use of some of these sources would share some of their code to do this sort of thing. [RS]: [A little RSS reaper] loads an RSS page, renders it to HTML, plus it compacts the referenced pages into the same document with local links, while trying to avoid ads and noise. Not perfect, but I use it daily to reap news sites onto my iPaq :) ---- [LV] 2007 Nov 1 On comp.lang.tcl, during Oct 31, 2007, in a thread [http://groups.google.com/group/comp.lang.tcl/browse_thread/thread/0d2d015491a231bc/75a399da09ae569f#75a399da09ae569f] about someone wanting to extract data from an html page, a user by the name of Ian posted the following snippet of code as an example of what they do to deal with a page of html that has some data in it that package require htmlparse package require struct proc html2data s { ::struct::tree x ::htmlparse::2tree $s x ::htmlparse::removeVisualFluff x set data [list] x walk root q { if {([x get $q type] eq "PCDATA") && [string match R\u00e6kke/pulje [x get $q data]]} { set p $q for {set i 3} {$i} {incr i -1} {set p [x parent $p]} foreach {row} [lrange [x children $p] 1 end] { ...... } break } } return $data } [LemonTree branch] uses this technique. ---- See also [screenscrape], [download files via http], [parallel geturl]. ---- [LV] This isn't technically web scraping, but I'm uncertain where else to reference it - it is making use of a web site's cgi functionality, from a Tk application, from what I can tell.... [Amazon.de PreOrder]. Or even [googling with SOAP]. ---- [[Mention Beautiful Soup here, or perhaps in the vicinity of [htmlparse].]] ---- [NEM] notes that to be a good web-citizen any web robots should follow some guidelines: * set the user-agent to some unique identifier for your program and include some contact details (email, website) so that site admins can contact you if they have any issues (you can do this with http::config -useragent); * fetch the /robots.txt file from the server you are scraping and check every URL accessed against it (see [http://www.robotstxt.org/]). (probably more - feel free to expand this list). Checking the robots.txt is relatively simple but still requires a bit of effort. As I'm currently doing some web-scraping I may package up the code I'm using for this into a webrobot package. ---- While "An almost perfect real-world hack" [http://lbrandy.com/blog/2009/08/an-almost-perfect-hack/] rests on [Python] rather than Tcl, the mention of iMacro is equally apt for us, and Python is essentially equivalent to Tcl for these purposes, anyway. ---- !!!!!! %| [Category Internet] |% !!!!!!