[HJG] Someone has uploaded a lot of pictures to [Flickr], and I want to show them someplace where no internet is available. The pages at Flickr have a lot of links, icons etc., so a simple recursive download with e.g. wget would fetch lots of unwanted stuff. Of course, I could tweak the parameters for calling wget (-accept, -reject, etc.), or get the html-pages, then filter their contents with [awk] or perl, but doing roughly the same thing in Tcl looks like more fun :-) Moreover, with a Tcl-script I can also get the titles and descriptions of the images. So the first step is to download the html-pages from that person, extract the links to the photos from them, then download the photo-pages (containing titles and complete descriptions), and the pictures in the selected size (Thumbnail=100x75, Small=240x180, Medium=500x375, Large=1024x768, Original=as taken). Then we can make a [Flickr Offline Photoalbum] out of them, or just use a program like IrfanView [http://www.irfanview.com] to present the pictures as a slideshow. This is the beta-version of the downloader: ---- #!/bin/sh # Restart with tcl: -*- mode: tcl; tab-width: 4; -*- \ exec wish $0 ${1+"$@"} # FlickrDownload.tcl - HaJo Gurt - 2006-01-20 - http://wiki.tcl.tk/15303 #: Download webpages and images for a photo-album from flickr.com # # 2005-11-22 First Version # 2005-11-23 entry # 2005-11-24 checkbuttons # 2005-11-25 save data to file # Todo: # * Save infos to file for next stage (album-maker) # * expand Analyze1 to recognize set-pages, search-result-pages etc. # * Bug: !! End of Multiline-Descriptions not detected # * ?? FetchImage: check status package require Tk package require Img package require http proc Init {} { #: Initialize Values global Prg Const set Prg(Title) "Flickr-Download" set Prg(Version) "v0.32" set Prg(Date) "2006-01-26" set Prg(Author) "Hans-Joachim Gurt" set Prg(Contact) [string map -nocase {: @ ! .} gurt:gmx!de] set Prg(About) "Download pictures from a photo-album at Flickr.com" ;#: #%% set Const(Prefix1) "s" #set Const(Prefix1) "page" ;# page01.html set Const(Datafile) slides.txt } #########1#########2#########3#########4#########5#########6#########7##### proc Print { Str {Tag ""} } { #: Output to text-window #puts $Str .txt1 insert end "\n" .txt1 insert end "$Str" $Tag .txt1 see end ;# scroll to bottom update } proc Log { Str {Tag ""} } { ##: Debug-Output #Print $Str $Tag ;## } proc ShowOpt {} { ##: Debug: Show Options global Opt return ;## Print "" foreach key [array names Opt] { Print "$key : $Opt($key)" } } #########1#########2#########3#########4#########5#########6#########7##### proc GetPage { url } { #: Fetch a webpage from the web set token [::http::geturl $url] set page [::http::data $token] ::http::cleanup $token return $page } proc FetchImage { url fname } { #: Fetch a picture from the web # See also: [Polling web images with Tk] #puts -nonewline "Fetch: \"$url\" " Print "Fetch: \"$url\" " DL set Stat "skip" ## %% Deactivate for offline-testing: if 1 { set f [open $fname w] fconfigure $f -translation binary set imgtok [http::geturl $url -binary true -channel $f] #set Stat [::http::error $imgtok] set Stat [::http::status $imgtok] # ?? Errorhandling ?? flush $f close $f http::cleanup $imgtok } Print " Status: $Stat " Ok ;# ?? true status } #########1#########2#########3#########4#########5#########6#########7##### proc Analyse1 { url1 page } { #: Analyse flickr album-webpage, # like http://www.flickr.com/photos/PERSON # or http://www.flickr.com/photos/PERSON/page2 global PicNr Const Opt Data set filename [format "%s%02d.html" $Const(Prefix1) $page ] if ($page==1) { set url $url1 } else { set url [format "$url1/page%d" $page ] } set base $url1 set p1 [ string first "//" $url 0 ]; incr p1 2 set p2 [ string first "/" $url $p1 ]; incr p2 -1 set p1 0 set base [ string range $url $p1 $p2 ] #Log "$base: $p1 $p2: '$base'" Log "# filename: $filename" ;## Log "# url : $url" ;## Log "# base: $base" ;## ## %% Deactivate for offline-testing: if 1 { set page [ GetPage $url ] #puts "$page" ;## set fileId [open $filename "w"] puts -nonewline $fileId $page close $fileId } set fileId [open $filename r] set page [read $fileId] close $fileId foreach line [split $page \n] { # Flickr: Photos from ALBUM if {[regexp -- "" $line]} { #Log "1: $line"; set p1 [ string first ":" $line 0 ]; incr p1 14 set p2 [ string first "</title" $line $p1 ]; incr p2 -1 set sA [ string range $line $p1 $p2 ] Log "Album: $p1 $p2: '$sA'" Print "Album: '$sA'" set Data(0.Album) $sA set Data2(Album) $sA } # <h4>stilles Seitental</h4> if {[regexp -- "<h4>" $line]} { #Log "2: $line"; incr PicNr set p1 [ string first "<h4>" $line 0 ]; incr p1 4 set p2 [ string first "</h4>" $line $p1 ]; incr p2 -1 set sH [ string range $line $p1 $p2 ] Log "\n" Log "$PicNr - Header: $p1 $p2: '$sH'" Hi Print "$PicNr - Header: '$sH'" Hi set Data($PicNr.Head) $sH set Data($PicNr.Desc) "" } # <p class="Photo"><a href="/photos/PERSON/87654321/"> # <img src="http://static.flickr.com/42/87654321_8888_m.jpg" width="240" height="180" /></a></p> if {[regexp -- (class="Photo") $line]} { #Log "3: $line"; #incr n set p1 [ string first "href=" $line 0 ]; incr p1 6 set p2 [ string first "img" $line $p1 ]; incr p2 -4 set sL [ string range $line $p1 $p2 ] Log "Link : $p1 $p2: '$sL'" #set Data($PicNr.Link) $sL set p1 [ string first "src=" $line 0 ]; incr p1 5 set p2 [ string first "jpg" $line $p1 ]; incr p2 2 set sP [ string range $line $p1 $p2 ] Log "Photo: $p1 $p2: '$sP'" #Print "Photo: '$sP'" #set Data($PicNr.Photo) $sP set url2 $sL ;# /photos/PERSON/87654321/ set nr $url2 ;# /photos/PERSON/87654321/ #Log "#> '$nr'" set p2 [ string last "/" $url2 ]; incr p2 -1 set p1 [ string last "/" $url2 $p2 ]; incr p1 1 set nr [ string range $url2 $p1 $p2 ] Log "#>Nr: $p1 $p2 '$nr'" #set filename [format "p%04d.html" $PicNr ] set filename [format "%s.html" $nr ] ;# Filename for local photo-page set Data($PicNr.Name) $nr Print "Name : $nr" set filename0 [format "%s_0100.jpg" $nr ] ;# 100x75 - Thumbnail set sP0 [ string map {_m _t} $sP ] if { $Opt(100x75) } { FetchImage $sP0 $filename0 } set filename1 [format "%s_0240.jpg" $nr ] ;# 240x180 - Small if { $Opt(240x180) } { FetchImage $sP $filename1 } set filename2 [format "%s_0500.jpg" $nr ] ;# 500x375 - Medium set sP2 [ string map {_m ""} $sP ] if { $Opt(500x375) } { FetchImage $sP2 $filename2 } set filename3 [format "%s_1024.jpg" $nr ] ;# 1024x768 - Large set sP3 [ string map {_m _b} $sP ] if { $Opt(1024x768) } { FetchImage $sP3 $filename3 } #break ;## set filename4 [format "%s_2048.jpg" $nr ] ;# Original Size, e.g. 2560x1920 set sP4 [ string map {_m _o} $sP ] if { $Opt(MaxSize) } { FetchImage $sP4 $filename4 } } # <p class="Desc">im Khao Sok</p> # <p class="Desc">Figuren aus dem alten China, auf<a href="/photos/PERSON/87654321/">...</a></p> if {[regexp -- (class="Desc") $line]} { #Log "4: $line"; set p1 [ string first "Desc" $line 0 ]; incr p1 6 #set p2 [ string first "</p>" $line $p1 ]; incr p2 -1 set p2 [ string first "<" $line $p1 ]; incr p2 -1 set sD [ string range $line $p1 $p2 ] Log "Descr: $p1 $p2: '$sD'" #Print "Descr: '$sD'" set Data($PicNr.Desc) $sD ;# gets replaced again in Analyse2 } # <a href="/photos/PERSON/page12/" class="end">12</a> # <a href="/photos/PERSON/" class="end">1</a> if {[regexp -- (page.*class="end") $line]} { #Log "5: $line"; #incr n; set p1 [ string first "page" $line 0 ]; incr p1 4 set p2 [ string first "/" $line $p1 ]; incr p2 -1 set s9 [ string range $line $p1 $p2 ] Log "End: $p1 $p2: '$s9'" return [incr s9 0] #break } # <p class="Activity"> if {[regexp -- (class="Activity") $line]} { ;# now get photo-page Analyse2 $base $sL $filename #break } # <!-- ### MAIN NAVIGATION ### --> if {[regexp -- "<!-- ### MAIN" $line]} { break ;# Nothing interesting beyond this point } } return 0 } #########1#########2#########3#########4#########5#########6#########7##### proc Analyse2 { url1 url2 filename } { #: Analyse a flickr photo-webpage (which shows a single photo), # like http://www.flickr.com/photos/PERSON/87654321/ # # @url1 : first part of the url, e.g. "http://www.flickr.com/" # @url2 : 2nd part of the url, e.g. "/photos/PERSON/87654321/" # @filename: filename for local copy of webpage global PicNr Data set url "$url1$url2" ## %% Deactivate for offline-testing: if 1 { set page [ GetPage $url ] #Log "$page" ;## set fileId [open $filename "w"] puts -nonewline $fileId $page close $fileId } set fileId [open $filename r] set page [read $fileId] close $fileId foreach line [split $page \n] { # page_current_url if {[regexp -- "page_current_url" $line]} { #Log "1>> $line"; } # <li class="Stats"> # Taken with an Olympus C5050Z. if {[regexp -- "Taken with an" $line]} { #Log "2>> $line"; set p1 [ string first "with" $line 0 ]; incr p1 8 set p2 [ string first "<br /" $line $p1 ]; incr p2 -3 set sC [ string range $line $p1 $p2 ] Log ">> Camera: $p1 $p2: '$sC'" } # <p class="DateTime"> Uploaded on <a href="/photos/PERSON/archives/date-posted/2006/01/07/" # style="text-decoration: none;">Jan 7, 2006</a></p> if {[regexp -- "Uploaded on" $line]} { #Log "3>> $line"; set p1 [ string first "date-posted" $line 0 ]; incr p1 12 set p2 [ string first "style" $line $p1 ]; incr p2 -4 set sU [ string range $line $p1 $p2 ] set sU [ string map {/ -} $sU ] Log ">> Upload: $p1 $p2: '$sU'" } # Taken on <a href="/photos/PERSON/archives/date-taken/2006/01/10/" # style="text-decoration: none;">January 10, 2006</a> if {[regexp -- "archives/date-taken" $line]} { #Log "4>> $line"; set p1 [ string first "date-taken" $line 0 ]; incr p1 11 set p2 [ string first "style" $line $p1 ]; incr p2 -4 set sS [ string range $line $p1 $p2 ] set sS [ string map {/ -} $sS ] set Data($PicNr.Date) $sS Log ">> Shot: $p1 $p2: '$sS'" Print "Date: '$sS'" } # <h1 id="title_div87654321">stilles Seitental</h1> if {[regexp -- "<h1" $line]} { #Log "H1: $line"; set p1 [ string first ">" $line 0 ]; incr p1 1 set p2 [ string first "</h1>" $line $p1 ]; incr p2 -1 set sH [ string range $line $p1 $p2 ] Log ">> $PicNr - Header: $p1 $p2: '$sH'" } # <div id="description_div87654321" class="photoDescription">im Khao Sok</div> # <div id="description_div73182923" class="photoDescription">Massiert wird überall und immer..., # viel Konkurrenz bedeutet kleine Preise: 1h Fußmassage = 120Bt (3€)<br /> # Es massieren die Frauen, die tragende Säule der Gesellschaft.</div> # if {[regexp -- (class="photoDescription") $line]} { #Log "D: $line"; set p1 [ string first "Desc" $line 0 ]; incr p1 13 set p2 [ string first "</div>" $line $p1 ] # !! Multiline-Descriptions: get at least the first line: if {$p2 > $p1} { incr p2 -1 } else { set p2 [string length $line] } set sD [ string range $line $p1 $p2 ] set Data($PicNr.Desc) $sD Log ">> Descr: $p1 $p2: '$sD'" Print "Descr: '$sD'" } # Abort scanning of current file (nothing of interest below): if {[regexp -- "upload_form_container" $line]} { Print "-" Log "##> $PicNr : $Data($PicNr.Name) #\ $Data($PicNr.Date) #\ $Data($PicNr.Head) #\ $Data($PicNr.Desc)" global Data2 #%% set key $Data($PicNr.Name) set Data2($key.Date) $Data($PicNr.Date) set Data2($key.Head) $Data($PicNr.Head) set Data2($key.Desc) $Data($PicNr.Desc) break } } } #########1#########2#########3#########4#########5#########6#########7##### proc Go {url} { #: Start processing after user entered url global PicNr Const Opt Data set StartPage 1 Print "" Print "Flickr-Download from $url" Hi set PicNr 0 set filename [ format "%s%02d.html" $Const(Prefix1) $StartPage ] ;# page01.html set MaxPage [ Analyse1 $url $StartPage ] incr StartPage 1 #set MaxPage 2 ;## if { $Opt(All_Pages) } { for {set page $StartPage} {$page <= $MaxPage} {incr page} { Analyse1 $url $page } } Print "" Print "Done !" Hi #: Show collected Data about pictures: Print "" set line -1 #%% global Data2 set line 0 foreach key [lsort -dictionary [ array names Data2 ]] { Print "$key : $Data2($key)" [expr [incr line]%3] } arr'dump Data $Const(Datafile) arr'dump Data2 data2.txt Print "" } proc arr'dump { _arr fn } { #: Dump array to file, in a format ready to be loaded via 'source' upvar 1 $_arr arr set f [open $fn w] puts $f "array set $_arr \{" foreach key [ lsort [array names arr] ] { puts $f [ list $key $arr($key) ] } puts $f "\}" close $f } #########1#########2#########3#########4#########5#########6#########7##### #: Main : Init #catch {console show} ;## pack [frame .f1] pack [frame .f2] label .lab1 -text "URL:" entry .ent1 -textvar e -width 80 text .txt1 -yscrollcommand ".scr1 set" -width 100 -height 40 -bg white -wrap word scrollbar .scr1 -command ".txt1 yview" button .but0 -text "Clear Log" -command { .txt1 delete 0.0 end } button .but1 -text "Go" -command { Go $e } pack .lab1 .ent1 .but0 .but1 -in .f1 -side left -padx 2 label .lab2 -text "Options:" pack .lab2 -in .f2 -side left set AllPages "All Pages" lappend Options 100x75 240x180 500x375 1024x768 MaxSize Get_from_Web All_Pages foreach size $Options { set cl [label .sz$size -text $size ] set cc [checkbutton .cb$size -variable Opt($size) -command ShowOpt ] pack $cl -in .f2 -side left -anchor e pack $cc -in .f2 -side left -anchor w } .txt1 tag configure "Hi" -background red -foreground white .txt1 tag configure "DL" -background lightblue -underline 1 .txt1 tag configure "Ok" -background green -underline 0 .txt1 tag configure 1 -background cyan Print " $Prg(Title) $Prg(Version) - $Prg(Date) " Hi Print "$Prg(About)" Print "(c) $Prg(Author) - $Prg(Contact)" Ok set Opt(100x75) 0 set Opt(All_Pages) 0 set Opt(Get_from_Web) 1 ShowOpt ;## set Data(0.Album) "Flickr" pack .scr1 -side right -fill y pack .txt1 -side right bind .ent1 <Return> { Go $e } bind . <Key-F1> { console show } set e http://www.flickr.com/photos/ #set e http://www.flickr.com/photos/siegfrieden #set e http://www.flickr.com/photos/siegfrieden/page2 wm title . $Prg(Title) focus -force .ent1 #. ---- Now with a nice GUI: enter '''URL''' of first album-page, check the '''options''' you want, then press the '''GO'''-button. Checkboxes for the image-sizes to download are obvious. When "Get from Web" is not checked, no internet-access happens and local files (from a previous download) are used. When "All Pages" is not checked, processing stops after the first page. CJL wonders whether the Flickr-generated RSS feeds for an album might be a quicker way of getting at the required set of image URLs. HJG: I don't think so - the data in the RSS lists only the most recently uploaded images, it misses some details (i.e. date when picture was taken), and the description-field looks messy. Here is a more quick'n'dirty way to get just the pictures, using wget and awk: * Visit the first page of the album with a browser [http://www.getfirefox.com] * Save this page as s01.html (html-only is enough) * At the bottom of the album-page, right-click each "Page X"-link, and save-link-as s02.html, etc. (ok, more than about a dozen of these would get tiresome...) * awk -f flickr.awk s*.html > links.txt * wget -w1 -i links.txt With this minimal flickr.awk (i.e. it does not extract title, headers, descriptions etc. ) : BEGIN { FS="\"" Found=0; print "# flickr-Download:" } /class="Photo/ { Found++ sub( "^.*http", "http", $0) sub( "_m", "_b", $1) # _b : large picture = 1024x768 print $1 next } END { print "# Found:", Found } Next step: [Flickr Offline Photoalbum]. ---- [schlenk] wonders if using [htmlparse] or [tdom] in html mode would make the page parsing code look nicer. HJG: Are there any examples of these tools here on the wiki (or elsewhere), with a demo of how to parse a fairly complex webpage ? Of course, it is hard to see how the webpage to be parsed looked like, when only the parsing code is there. I admit that my code is more "working" than "elegant"... 2006-02-02: After the first successful use of this program, some problems showed up: * Descriptions for photos can be longer than one line * Flickr-pages with defined "sets" have extra title-entries (should be filtered out) * For some pictures, the selected size might not available (e.g. only 640x480). * No checks yet if the download of an image is successful * There are other types of webpages at flickr (e.g. Set, Calendar, Tags...) that cannot be parsed yet. * I have not yet decided on the design for the data to pass to the viewer. Development will continue in the near future... ---- See also: * [http] - [Download file via HTTP] - [Polling web images with Tk] * [A little file searcher] - [owh - a fileless tclsh] * [Serializing an array] * - * [Parsing HTML] - [Regexp HTML Attribute Parsing] * [Stephen Uhler's HTML parser in 10 lines] * [websearch] ---- [Category Application] - [Category Internet] - [Category Multimedia] [Category File] - [Category String Processing]