Version 34 of Downloading pictures from Flickr

HJG Someone has uploaded a lot of pictures to Flickr, and I want to show them someplace where no internet is available.

The pages at Flickr have a lot of links, icons etc., so a simple recursive download with e.g. wget would fetch lots of unwanted stuff. Of course, I could tweak the parameters for calling wget (-accept, -reject, etc.), or get the html-pages, then filter their contents with awk or perl, but doing roughly the same thing in Tcl looks like more fun :-) Moreover, with a Tcl-script I can also get the titles and descriptions of the images.

So the first step is to download the html-pages from that person, extract the links to the photos from them, then download the photo-pages (containing titles and complete descriptions), and the pictures in the selected size (Thumbnail=100x75, Small=240x180, Medium=500x375, Large=1024x768, Original=as taken).

Then we can make a Flickr Offline Photoalbum out of them, or just use a program like IrfanView [L1 ] to present the pictures as a slideshow.

This is the alpha-version of the downloader:

 #!/bin/sh
 # Restart with tcl: -*- mode: tcl; tab-width: 4; -*- \
 exec wish $0 ${1+"$@"}

 # FlickrDownload.tcl - HaJo Gurt - 2006-01-20 - http://wiki.tcl.tk/15303
 #: Download webpages and images for a photo-album from flickr.com
 #
 # 2005-11-22 First Version

 # Todo:
 # * Save infos to file for next stage (album-maker)
 # * Bug: !! End of Multiline-Descriptions not detected

  package require Tk
  package require Img
  package require http

 #########1#########2#########3#########4#########5#########6#########7#####

  proc Print { Str {tag ""} } {
  #: Output to text-window
    # puts $Str
     .txt1 insert end "\n"
     .txt1 insert end "$Str" $tag
     .txt1 see end          ;# scroll to bottom
     update
  }

 #########1#########2#########3#########4#########5#########6#########7#####

  proc getPage { url } {
  #: Fetch a webpage from the web
    set token [::http::geturl $url]
    set data [::http::data $token]
    ::http::cleanup $token
    return $data
  }

  proc FetchImage {url fname } {
  #: Fetch a picture from the web / see also: [Polling web images with Tk]
   #puts -nonewline "Fetch: \"$url\" "
    Print "Fetch: \"$url\" " DL

    set f [open $fname w]
    fconfigure $f -translation binary
    set imgtok [http::geturl $url -binary true -channel $f]
    flush $f
    close $f
    http::cleanup $imgtok

    Print " ok." Ok
  }

 #########1#########2#########3#########4#########5#########6#########7#####

  proc Analyse1 { url1 page } {
  #: Analyse flickr album-webpage, 
  # like http://www.flickr.com/photos/PERSON
  # or   http://www.flickr.com/photos/PERSON/page2

    global PicNr

    set filename [format "s%02d.html" $page ]
    if ($page==1) {
      set url $url1
    } else {
      set url [format "$url1/page%d" $page ] 
    }

    set base $url1
    set p1   [ string first "//" $url  0  ]; incr p1  2
    set p2   [ string first "/"  $url $p1 ]; incr p2 -1
    set p1 0
    set base [ string range      $url $p1 $p2 ]
   #Print "$base: $p1 $p2: '$base'"
    Print "# filename: $filename"         ;##
    Print "# url : $url"                ;##
    Print "# base: $base"                ;##

    # Deaktivate for offline-testing:

if 1 {

    set data  [ getPage $url ]
   #puts "$data"                ;##
    set fileId [open $filename "w"]
    Print -nonewline $fileId $data
    close $fileId

}

    set fileId [open $filename r]
    set data [read $fileId]
    close $fileId

    foreach line [split $data \n] {
      # <title>Flickr: Photos from ALBUM</title>
      if {[regexp -- "<title>"  $line]} { 
        #Print "1: $line";
         set p1 [ string first ":"       $line  0  ]; incr p1 14
         set p2 [ string first "</title" $line $p1 ]; incr p2 -1
         set sA [ string range $line $p1 $p2 ]
         Print "Album: $p1 $p2: '$sA'"
      }
      # <h4>stilles Seitental</h4>
      if {[regexp -- "<h4>"  $line]} { 
        #Print "2: $line";
         incr PicNr
         set p1 [ string first "<h4>"    $line  0  ]; incr p1  4
         set p2 [ string first "</h4>"   $line $p1 ]; incr p2 -1
         set sH [ string range $line $p1 $p2 ]
         Print "\n"
         Print "$PicNr - Header: $p1 $p2: '$sH'" Hi
      }
      # <p class="Photo"><a href="/photos/PERSON/9999/">
      # <img src="http://static.flickr.com/42/9999_8888_m.jpg" width="240" height="180" /></a></p>
      if {[regexp -- (class="Photo")  $line]} { 
        #Print "3: $line";
         set p1 [ string first "href=" $line  0  ]; incr p1  6
         set p2 [ string first "img"   $line $p1 ]; incr p2 -4
         set sL [ string range $line $p1 $p2 ]
         Print "Link : $p1 $p2: '$sL'"

         set p1 [ string first "src=" $line  0  ]; incr p1  5
         set p2 [ string first "jpg"  $line $p1 ]; incr p2  2
         set sP [ string range $line $p1 $p2 ]
         Print "Photo: $p1 $p2: '$sP'"

        set url2 $sL    ;# /photos/PERSON/9999/
        set nr   $url2  ;# /photos/PERSON/9999/
       #Print "#> '$nr'"
        set p2   [ string last "/"  $url2     ]; incr p2 -1
        set p1   [ string last "/"  $url2 $p2 ]; incr p1  1
        set nr   [ string range     $url2 $p1 $p2 ]
        Print "#>Nr: $p1 $p2 '$nr'"
       #set filename [format "p%04d.html" $PicNr ]
        set filename [format "%s.html" $nr ]       ;# Filename for local photo-page

        set filename0 [format "%s_0100.jpg" $nr ]  ;#  100x75 - Thumbnail
        set sP0 [ string map {_m _t} $sP ]
        FetchImage $sP0 $filename0

        set filename1 [format "%s_0240.jpg" $nr ]  ;#  240x180 - Small
       #FetchImage $sP $filename1

        set filename2 [format "%s_0500.jpg" $nr ]  ;#  500x375 - Medium
        set sP2 [ string map {_m ""} $sP ]
       #FetchImage $sP2 $filename2

        set filename3 [format "%s_1024.jpg" $nr ]  ;# 1024x768 - Large
        set sP3 [ string map {_m _b} $sP ]
        FetchImage $sP3 $filename3
       #break        ;##
      }
      # <p class="Desc">im Khao Sok</p>
      # <p class="Desc">Figuren aus dem alten China, auf<a href="/photos/PERSON/9999/">...</a></p>
      if {[regexp -- (class="Desc")   $line]} { 
        #Print "4: $line"; 
         set p1 [ string first "Desc"    $line  0  ]; incr p1  6
         set p2 [ string first "<"       $line $p1 ]; incr p2 -1
         set sD [ string range $line $p1 $p2 ]
         Print "Descr: $p1 $p2: '$sD'"
      }

      # <a href="/photos/PERSON/page12/" class="end">12</a>
      # <a href="/photos/PERSON/" class="end">1</a>
    #-if {[regexp -- (class="end")        $line]} { ... }
      if {[regexp -- (page.*class="end")  $line]} { 
        #Print "5: $line";
         set p1 [ string first "page" $line  0  ]; incr p1  4
         set p2 [ string first "/"    $line $p1 ]; incr p2 -1
         set s9 [ string range $line $p1 $p2 ]
         Print "\nEnd: $p1 $p2: '$s9'"
         return [incr s9 0]
        #break
      }

      # <p class="Activity">
      if {[regexp -- (class="Activity")    $line]} { ;# now get photo-page
         Analyse2 $base $sL $filename
        #break
      }

      # <!-- ### MAIN NAVIGATION ### -->
      if {[regexp -- "<!-- ### MAIN"    $line]} { 
        break    ;# Nothing interesting beyond this point
      }
    }
    return 0
  }

 #########1#########2#########3#########4#########5#########6#########7#####

  proc Analyse2 { url1 url2 filename } {
  #: Analyse a flickr photo-webpage,
  # like http://www.flickr.com/photos/PERSON/9999/
    global PicNr

    set url "$url1$url2"

    # Deaktivate for offline-testing:

if 1 {

    set data  [ getPage $url ]
   #Print "$data"                ;##
    set fileId [open $filename "w"]
   #puts -nonewline $fileId $data
    Print $fileId $data
    close $fileId

}

    set fileId [open $filename r]
    set data [read $fileId]
    close $fileId

    foreach line [split $data \n] {
    # strings to look for:
      # page_current_url
      if {[regexp -- "page_current_url" $line]} { 
        #Print "1>> $line";
      }

      # <li class="Stats">
      # Taken with an Olympus C5050Z.
      if {[regexp -- "Taken with an" $line]} { 
        #Print "2>> $line";
         set p1 [ string first "with"   $line  0  ]; incr p1  8
         set p2 [ string first "<br /"  $line $p1 ]; incr p2 -3
         set sC [ string range $line $p1 $p2 ]
         Print ">> Camera: $p1 $p2: '$sC'"
      }

      # <p class="DateTime"> Uploaded on <a href="/photos/PERSON/archives/date-posted/2006/01/07/"
      # style="text-decoration: none;">Jan 7, 2006</a></p>
      if {[regexp -- "Uploaded on" $line]} { 
        #Print "3>> $line";
         set p1 [ string first "date-posted"  $line  0  ]; incr p1 12
         set p2 [ string first "style"        $line $p1 ]; incr p2 -4
         set sU [ string range $line $p1 $p2 ]
         set sU [ string map {/ -} $sU ]
         Print ">> Upload: $p1 $p2: '$sU'"
      }

      # Taken on <a href="/photos/PERSON/archives/date-taken/2006/01/10/"
      # style="text-decoration: none;">January 10, 2006</a>
      if {[regexp -- "archives/date-taken" $line]} { 
        #Print "4>> $line";
         set p1 [ string first "date-taken"   $line  0  ]; incr p1 11
         set p2 [ string first "style"        $line $p1 ]; incr p2 -4
         set sS [ string range $line $p1 $p2 ]
         set sS [ string map {/ -} $sS ]
         Print ">> Shot: $p1 $p2: '$sS'"
      }

      # <h1 id="title_div87654321">stilles Seitental</h1>
      if {[regexp -- "<h1"           $line]} { 
        #Print "H1: $line";
         set p1 [ string first ">"       $line  0  ]; incr p1  1
         set p2 [ string first "</h1>"   $line $p1 ]; incr p2 -1
         set sH [ string range $line $p1 $p2 ]
         Print ">> $PicNr - Header: $p1 $p2: '$sH'"
      }

      # <div id="description_div87654321" class="photoDescription">im Khao Sok</div>
      # <div id="description_div73182923" class="photoDescription">Massiert wird überall und immer...,
      # viel Konkurrenz bedeutet kleine Preise: 1h Fußmassage = 120Bt (3€)<br />
      # Es massieren die Frauen, die tragende Säule der Gesellschaft.</div>
      #
      if {[regexp -- (class="photoDescription")   $line]} { 
        #Print "D: $line"; 
         set p1 [ string first "Desc"    $line  0  ]; incr p1 13
         set p2 [ string first "</div>"  $line $p1 ]
      # !! Multiline-Descriptions: get at least the first line:
         if {$p2 > $p1} { incr p2 -1 } else { set p2 [string length $line] }

         set sD [ string range $line $p1 $p2 ]
         Print ">> Descr: $p1 $p2: '$sD'"
      }

      # Abort scanning of current file (nothing of interest below):
      if {[regexp -- "upload_form_container"  $line]} { 
        break 
      }  
    }
  }

 #########1#########2#########3#########4#########5#########6#########7#####

  #: Main :

 #catch {console show}        ;##

  text      .txt1  -yscrollcommand ".scr1 set"  -width 100 -height 40  -bg white -wrap word
  scrollbar .scr1  -command ".txt1 yview"
  pack .scr1 -side right  -fill y
  pack .txt1 -side right 
  wm title . "Flickr-Download"
  focus -force .txt1

  .txt1 tag configure "Hi"  -background red       -foreground white
  .txt1 tag configure "DL"  -background lightblue -underline 1
  .txt1 tag configure "Ok"  -background green     -underline 0

  set url http://www.flickr.com/photos/siegfrieden
 #set url http://www.flickr.com/photos/siegfrieden/page2
  set StartPage 1

  Print "Flickr-Download from $url" Hi

  set PicNr     0
  set filename [ format "s%02d.html" $StartPage ]         ;# s01.html
  set MaxPage  [ Analyse1 $url $StartPage ]
  incr StartPage 1

 #set MaxPage  2  ;##

if 1 {

  for {set page $StartPage} {$page <= $MaxPage} {incr page} {
    Analyse1 $url $page
  }

}

  Print "Done !" Hi

Everything hardwired, and quite some debugging-aids left in, but it did the job...

CJL wonders whether the Flickr-generated RSS feeds for an album might be a quicker way of getting at the required set of image URLs.

HJG: I don't think so - the data in the RSS lists only the most recently uploaded images, it misses some details (i.e. date when picture was taken), and the description-field looks messy.

Here is a more quick'n'dirty way to get just the pictures, using wget and awk:

Visit the first page of the album with a browser [L2 ]
Save this page as s01.html (html-only is enough)
At the bottom of the album-page, right-click each "Page X"-link, and save-link-as s02.html, etc. (ok, more than about a dozen of these would get tiresome...)
awk -f flickr.awk s*.html > links.txt
wget -w1 -i links.txt

With this minimal flickr.awk (i.e. it does not extract title, headers, descriptions etc. ) :

  BEGIN           { FS="\""
                    Found=0;
                    print "# flickr-Download:"
                  }
  /class="Photo/  { Found++
                    sub( "^.*http", "http", $0)
                    sub( "_m", "_b", $1)        # _b : large picture = 1024x768
                    print $1
                    next
                  }
  END             { print "# Found:", Found }

Next step: Flickr Offline Photoalbum.

schlenk wonders if using htmlparse or tdom in html mode would make the page parsing code look nicer.