Downloading pictures from Flickr

Summary

HJG: Someone has uploaded a lot of pictures to Flickr, and I want to show them someplace where no internet is available.

The pages at Flickr have a lot of links, icons etc., so a simple recursive download with e.g. wget would fetch lots of unwanted stuff. Of course, I could tweak the parameters for calling wget (-accept, -reject, etc.), or get the html-pages, then filter their contents with awk or perl, but doing roughly the same thing in Tcl looks like more fun :-) Moreover, with a Tcl-script I can also get the titles and descriptions of the images.

So the first step is to download the html-pages from that person, extract the links to the photos from them, then download the photo-pages (containing titles and complete descriptions), and the pictures in the selected size (Thumbnail=100x75, Small=240x180, Medium=500x375, Large=1024x768, Original=as taken).

Then we can make a Flickr Offline Photoalbum out of them, or just use a program like IrfanView [L1 ] to present the pictures as a slideshow.

Code

This is the beta-version of the downloader:

 #!/bin/sh
 # Restart with tcl: -*- mode: tcl; tab-width: 4; -*- \
 exec wish $0 ${1+"$@"}

 # FlickrDownload.tcl - HaJo Gurt - 2006-01-20 - https://wiki.tcl-lang.org/15303
 #: Download webpages and images for a photo-album from flickr.com
 #
 # 2005-11-22 First Version
 # 2005-11-23 entry
 # 2005-11-24 checkbuttons
 # 2005-11-25 save data to file

 # Todo:
 # * Save infos to file for next stage (album-maker)
 # * expand Analyze1 to recognize set-pages, search-result-pages etc. 
 # * Bug: !! End of Multiline-Descriptions not detected
 # * ?? FetchImage: check status

  package require Tk
  package require Img
  package require http


  proc Init {} {
  #: Initialize Values

    global Prg Const

    set Prg(Title)    "Flickr-Download"
    set Prg(Version)  "v0.32"
    set Prg(Date)     "2006-01-26"
    set Prg(Author)   "Hans-Joachim Gurt"
    set Prg(Contact)  [string map -nocase {: @ ! .} gurt:gmx!de]
    set Prg(About)    "Download pictures from a photo-album at Flickr.com" ;#:
  #%%
    set Const(Prefix1)  "s"
   #set Const(Prefix1)  "page"   ;# page01.html
    set Const(Datafile)  slides.txt
  }

 #########1#########2#########3#########4#########5#########6#########7#####

  proc Print { Str {Tag ""} } {
  #: Output to text-window
    #puts $Str
     .txt1 insert end "\n"
     .txt1 insert end "$Str" $Tag
     .txt1 see end          ;# scroll to bottom
     update
  }

  proc Log { Str {Tag ""} } {
  ##: Debug-Output
   #Print $Str $Tag        ;##
  }

  proc ShowOpt {} {
  ##: Debug: Show Options
    global Opt
    return        ;##

    Print ""
    foreach key [array names Opt] {
      Print "$key : $Opt($key)"
    }
  }

 #########1#########2#########3#########4#########5#########6#########7#####

  proc GetPage { url } {
  #: Fetch a webpage from the web
    set token [::http::geturl $url]
    set page  [::http::data $token]
    ::http::cleanup $token
    return $page
  }

  proc FetchImage { url fname } {
  #: Fetch a picture from the web 
  #  See also: [Polling web images with Tk]
   #puts -nonewline "Fetch: \"$url\" "
    Print "Fetch: \"$url\" " DL
    set Stat "skip"

    ## %% Deactivate for offline-testing:
if 1 {
    set f [open $fname w]
    fconfigure $f -translation binary
    set imgtok [http::geturl $url -binary true -channel $f]
   #set Stat [::http::error  $imgtok]
    set Stat [::http::status $imgtok]
   # ?? Errorhandling ??
    flush $f
    close $f
    http::cleanup $imgtok
}
    Print " Status: $Stat " Ok   ;# ?? true status
  }

 #########1#########2#########3#########4#########5#########6#########7#####

  proc Analyse1 { url1 page } {
  #: Analyse flickr album-webpage, 
  # like http://www.flickr.com/photos/PERSON
  # or   http://www.flickr.com/photos/PERSON/page2

    global PicNr Const Opt Data

    set filename [format "%s%02d.html" $Const(Prefix1) $page ]
    if ($page==1) {
      set url $url1
    } else {
      set url [format "$url1/page%d" $page ] 
    }

    set base $url1
    set p1   [ string first "//" $url  0  ]; incr p1  2
    set p2   [ string first "/"  $url $p1 ]; incr p2 -1
    set p1 0
    set base [ string range      $url $p1 $p2 ]
   #Log "$base: $p1 $p2: '$base'"
    Log "# filename: $filename"         ;##
    Log "# url : $url"                 ;##
    Log "# base: $base"                ;##

    ## %% Deactivate for offline-testing:
if 1 {
    set page  [ GetPage $url ]
   #puts "$page"                ;##
    set fileId [open $filename "w"]
    puts -nonewline $fileId $page
    close $fileId
}
    set fileId [open $filename r]
    set page [read $fileId]
    close $fileId

    foreach line [split $page \n] {
      # <title>Flickr: Photos from ALBUM</title>
      if {[regexp -- "<title>"  $line]} { 
        #Log "1: $line";
         set p1 [ string first ":"       $line  0  ]; incr p1 14
         set p2 [ string first "</title" $line $p1 ]; incr p2 -1
         set sA [ string range $line $p1 $p2 ]
         Log   "Album: $p1 $p2: '$sA'"
         Print "Album: '$sA'"
         set Data(0.Album) $sA
         set Data2(Album)  $sA
      }

      # <h4>stilles Seitental</h4>
      if {[regexp -- "<h4>"  $line]} { 
        #Log "2: $line";
         incr PicNr
         set p1 [ string first "<h4>"    $line  0  ]; incr p1  4
         set p2 [ string first "</h4>"   $line $p1 ]; incr p2 -1
         set sH [ string range $line $p1 $p2 ]
         Log "\n"
         Log   "$PicNr - Header: $p1 $p2: '$sH'" Hi
         Print "$PicNr - Header: '$sH'" Hi
         set Data($PicNr.Head) $sH
         set Data($PicNr.Desc) ""
      }

      # <p class="Photo"><a href="/photos/PERSON/87654321/">
      # <img src="http://static.flickr.com/42/87654321_8888_m.jpg" width="240" height="180" /></a></p>
      if {[regexp -- (class="Photo")  $line]} { 
        #Log "3: $line";
        #incr n
         set p1 [ string first "href=" $line  0  ]; incr p1  6
         set p2 [ string first "img"   $line $p1 ]; incr p2 -4
         set sL [ string range $line $p1 $p2 ]
         Log "Link : $p1 $p2: '$sL'"
        #set Data($PicNr.Link) $sL

         set p1 [ string first "src=" $line  0  ]; incr p1  5
         set p2 [ string first "jpg"  $line $p1 ]; incr p2  2
         set sP [ string range $line $p1 $p2 ]
         Log   "Photo: $p1 $p2: '$sP'"
        #Print "Photo: '$sP'"
        #set Data($PicNr.Photo) $sP

        set url2 $sL    ;# /photos/PERSON/87654321/
        set nr   $url2  ;# /photos/PERSON/87654321/
       #Log "#> '$nr'"
        set p2   [ string last "/"  $url2     ]; incr p2 -1
        set p1   [ string last "/"  $url2 $p2 ]; incr p1  1
        set nr   [ string range     $url2 $p1 $p2 ]
        Log "#>Nr: $p1 $p2 '$nr'"
       #set filename [format "p%04d.html" $PicNr ]
        set filename [format "%s.html" $nr ]       ;# Filename for local photo-page
         set Data($PicNr.Name) $nr
         Print "Name : $nr"

        set filename0 [format "%s_0100.jpg" $nr ]  ;#  100x75 - Thumbnail
        set sP0 [ string map {_m _t} $sP ]
        if { $Opt(100x75) } { FetchImage $sP0 $filename0 }

        set filename1 [format "%s_0240.jpg" $nr ]  ;#  240x180 - Small
        if { $Opt(240x180) } { FetchImage $sP $filename1 }

        set filename2 [format "%s_0500.jpg" $nr ]  ;#  500x375 - Medium
        set sP2 [ string map {_m ""} $sP ]
        if { $Opt(500x375) } { FetchImage $sP2 $filename2 }

        set filename3 [format "%s_1024.jpg" $nr ]  ;# 1024x768 - Large
        set sP3 [ string map {_m _b} $sP ]
        if { $Opt(1024x768) } { FetchImage $sP3 $filename3 }
       #break        ;##

        set filename4 [format "%s_2048.jpg" $nr ]  ;# Original Size, e.g. 2560x1920
        set sP4 [ string map {_m _o} $sP ]
        if { $Opt(MaxSize) } { FetchImage $sP4 $filename4 }
      }

      # <p class="Desc">im Khao Sok</p>
      # <p class="Desc">Figuren aus dem alten China, auf<a href="/photos/PERSON/87654321/">...</a></p>
      if {[regexp -- (class="Desc")   $line]} { 
        #Log "4: $line"; 
         set p1 [ string first "Desc"    $line  0  ]; incr p1  6
        #set p2 [ string first "</p>"    $line $p1 ]; incr p2 -1
         set p2 [ string first "<"       $line $p1 ]; incr p2 -1
         set sD [ string range $line $p1 $p2 ]
         Log   "Descr: $p1 $p2: '$sD'"
        #Print "Descr: '$sD'"
         set Data($PicNr.Desc) $sD   ;# gets replaced again in Analyse2
      }

      # <a href="/photos/PERSON/page12/" class="end">12</a>
      # <a href="/photos/PERSON/" class="end">1</a>
      if {[regexp -- (page.*class="end")    $line]} { 
        #Log "5: $line";
        #incr n; 
         set p1 [ string first "page" $line  0  ]; incr p1  4
         set p2 [ string first "/"    $line $p1 ]; incr p2 -1
         set s9 [ string range $line $p1 $p2 ]
         Log "End: $p1 $p2: '$s9'"
         return [incr s9 0]
        #break
      }

      # <p class="Activity">
      if {[regexp -- (class="Activity") $line]} { ;# now get photo-page
         Analyse2 $base $sL $filename
        #break
      }

      # <!-- ### MAIN NAVIGATION ### -->
      if {[regexp -- "<!-- ### MAIN" $line]} { 
        break    ;# Nothing interesting beyond this point
      }
    }
    return 0
  }

 #########1#########2#########3#########4#########5#########6#########7#####

  proc Analyse2 { url1 url2 filename } {
  #: Analyse a flickr photo-webpage (which shows a single photo),
  # like http://www.flickr.com/photos/PERSON/87654321/
  #
  # @url1    : first part of the url, e.g. "http://www.flickr.com/"
  # @url2    : 2nd   part of the url, e.g. "/photos/PERSON/87654321/"
  # @filename: filename for local copy of webpage

    global PicNr Data

    set url "$url1$url2"

    ## %% Deactivate for offline-testing:
if 1 {
    set page  [ GetPage $url ]
   #Log "$page"                ;##
    set fileId [open $filename "w"]
    puts -nonewline $fileId $page
    close $fileId
}
    set fileId [open $filename r]
    set page [read $fileId]
    close $fileId

    foreach line [split $page \n] {
      # page_current_url
      if {[regexp -- "page_current_url" $line]} { 
        #Log "1>> $line";
      }

      # <li class="Stats">
      # Taken with an Olympus C5050Z.
      if {[regexp -- "Taken with an" $line]} { 
        #Log "2>> $line";
         set p1 [ string first "with"   $line  0  ]; incr p1  8
         set p2 [ string first "<br /"  $line $p1 ]; incr p2 -3
         set sC [ string range $line $p1 $p2 ]
         Log ">> Camera: $p1 $p2: '$sC'"
      }

      # <p class="DateTime"> Uploaded on <a href="/photos/PERSON/archives/date-posted/2006/01/07/"
      # style="text-decoration: none;">Jan 7, 2006</a></p>
      if {[regexp -- "Uploaded on" $line]} { 
        #Log "3>> $line";
         set p1 [ string first "date-posted"  $line  0  ]; incr p1 12
         set p2 [ string first "style"        $line $p1 ]; incr p2 -4
         set sU [ string range $line $p1 $p2 ]
         set sU [ string map {/ -} $sU ]
         Log ">> Upload: $p1 $p2: '$sU'"
      }

      # Taken on <a href="/photos/PERSON/archives/date-taken/2006/01/10/"
      # style="text-decoration: none;">January 10, 2006</a>
      if {[regexp -- "archives/date-taken" $line]} { 
        #Log "4>> $line";
         set p1 [ string first "date-taken"   $line  0  ]; incr p1 11
         set p2 [ string first "style"        $line $p1 ]; incr p2 -4
         set sS [ string range $line $p1 $p2 ]
         set sS [ string map {/ -} $sS ]
         set Data($PicNr.Date) $sS
         Log ">> Shot: $p1 $p2: '$sS'"
         Print  "Date: '$sS'"
      }

      # <h1 id="title_div87654321">stilles Seitental</h1>
      if {[regexp -- "<h1"           $line]} { 
        #Log "H1: $line";
         set p1 [ string first ">"       $line  0  ]; incr p1  1
         set p2 [ string first "</h1>"   $line $p1 ]; incr p2 -1
         set sH [ string range $line $p1 $p2 ]
         Log ">> $PicNr - Header: $p1 $p2: '$sH'"
      }

      # <div id="description_div87654321" class="photoDescription">im Khao Sok</div>
      # <div id="description_div73182923" class="photoDescription">Massiert wird überall und immer...,
      # viel Konkurrenz bedeutet kleine Preise: 1h Fußmassage = 120Bt (3€)<br />
      # Es massieren die Frauen, die tragende Säule der Gesellschaft.</div>
      #
      if {[regexp -- (class="photoDescription")   $line]} { 
        #Log "D: $line"; 
         set p1 [ string first "Desc"    $line  0  ]; incr p1 13
         set p2 [ string first "</div>"  $line $p1 ]
      # !! Multiline-Descriptions: get at least the first line:
         if {$p2 > $p1} { incr p2 -1 } else { set p2 [string length $line] }

         set sD [ string range $line $p1 $p2 ]
         set Data($PicNr.Desc) $sD
         Log ">> Descr: $p1 $p2: '$sD'"
         Print  "Descr: '$sD'"
      }

      # Abort scanning of current file (nothing of interest below):
      if {[regexp -- "upload_form_container"  $line]} { 
        Print "-"

        Log "##> $PicNr : $Data($PicNr.Name) #\
          $Data($PicNr.Date) #\
          $Data($PicNr.Head) #\
          $Data($PicNr.Desc)"

        global Data2
       #%%
        set key $Data($PicNr.Name)
        set Data2($key.Date) $Data($PicNr.Date)
        set Data2($key.Head) $Data($PicNr.Head)
        set Data2($key.Desc) $Data($PicNr.Desc)

        break 
      }  
    }
  }

 #########1#########2#########3#########4#########5#########6#########7#####

 proc Go {url} {
 #: Start processing after user entered url

    global PicNr Const Opt Data
    set StartPage 1

    Print ""
    Print "Flickr-Download from $url" Hi

    set PicNr     0
    set filename [ format "%s%02d.html" $Const(Prefix1) $StartPage ]         ;# page01.html
    set MaxPage  [ Analyse1 $url $StartPage ]
    incr StartPage 1

   #set MaxPage  2  ;##
    if { $Opt(All_Pages) } {
      for {set page $StartPage} {$page <= $MaxPage} {incr page} {
        Analyse1 $url $page
      }
    }
    Print ""
    Print "Done !" Hi

  #: Show collected Data about pictures:
    Print ""
    set line -1
 #%%
    global Data2
    set line 0
    foreach key [lsort -dictionary [ array names Data2 ]] {
      Print "$key : $Data2($key)"  [expr [incr line]%3]
    }

    arr'dump Data  $Const(Datafile)
    arr'dump Data2 data2.txt
    Print ""

  }

  proc arr'dump { _arr fn } {
  #: Dump array to file, in a format ready to be loaded via 'source'
    upvar 1 $_arr arr
    set f [open $fn w]
    puts $f "array set $_arr \{"
    foreach key [ lsort [array names arr] ] {
      puts $f [ list $key $arr($key) ]
    }
    puts $f "\}"
    close $f
  }


 #########1#########2#########3#########4#########5#########6#########7#####

  #: Main :
  Init

 #catch {console show}        ;##
  pack [frame .f1]
  pack [frame .f2]
  label     .lab1   -text "URL:"
  entry     .ent1   -textvar e -width 80
  text      .txt1   -yscrollcommand ".scr1 set"  -width 100 -height 40  -bg white -wrap word
  scrollbar .scr1   -command ".txt1 yview"
  button    .but0   -text "Clear Log" -command { .txt1 delete 0.0 end }
  button    .but1   -text "Go"        -command { Go $e }
  pack .lab1 .ent1 .but0 .but1 -in .f1 -side left -padx 2

  label .lab2 -text "Options:"
  pack  .lab2 -in .f2 -side left

  set AllPages "All Pages"
  lappend Options 100x75 240x180 500x375 1024x768 MaxSize  Get_from_Web All_Pages
  foreach size $Options {
         set cl [label       .sz$size -text $size ]
         set cc [checkbutton .cb$size -variable Opt($size) -command ShowOpt ]
         pack $cl -in .f2 -side left -anchor e
         pack $cc -in .f2 -side left -anchor w
  }
  .txt1 tag configure "Hi"  -background red       -foreground white
  .txt1 tag configure "DL"  -background lightblue -underline 1
  .txt1 tag configure "Ok"  -background green     -underline 0
  .txt1 tag configure 1     -background cyan

  Print " $Prg(Title) $Prg(Version) - $Prg(Date) " Hi
  Print "$Prg(About)"
  Print "(c) $Prg(Author) - $Prg(Contact)" Ok

  set Opt(100x75)       0
  set Opt(All_Pages)    0
  set Opt(Get_from_Web) 1
  ShowOpt   ;##
  set Data(0.Album) "Flickr"

  pack .scr1 -side right  -fill y
  pack .txt1 -side right 

  bind .ent1 <Return> { Go $e }
  bind .     <Key-F1> { console show }

  set e http://www.flickr.com/photos/
 #set e http://www.flickr.com/photos/siegfrieden
 #set e http://www.flickr.com/photos/siegfrieden/page2

  wm title . $Prg(Title)
  focus -force .ent1

Comments

Now with a nice GUI: enter URL of first album-page, check the options you want, then press the GO-button.

Checkboxes for the image-sizes to download are obvious. When "Get from Web" is not checked, no internet-access happens and local files (from a previous download) are used. When "All Pages" is not checked, processing stops after the first page.

CJL wonders whether the Flickr-generated RSS feeds for an album might be a quicker way of getting at the required set of image URLs.

HJG: I don't think so - the data in the RSS lists only the most recently uploaded images, it misses some details (i.e. date when picture was taken), and the description-field looks messy.

Here is a more quick'n'dirty way to get just the pictures, using wget and awk:

Visit the first page of the album with a browser [L2 ]
Save this page as s01.html (html-only is enough)
At the bottom of the album-page, right-click each "Page X"-link, and save-link-as s02.html, etc. (ok, more than about a dozen of these would get tiresome...)
awk -f flickr.awk s*.html > links.txt
wget -w1 -i links.txt

With this minimal flickr.awk (i.e. it does not extract title, headers, descriptions etc. ) :

  BEGIN           { FS="\""
                    Found=0;
                    print "# flickr-Download:"
                  }
  /class="Photo/  { Found++
                    sub( "^.*http", "http", $0)
                    sub( "_m", "_b", $1)        # _b : large picture = 1024x768
                    print $1
                    next
                  }
  END             { print "# Found:", Found }

Next step: Flickr Offline Photoalbum.

schlenk wonders if using htmlparse or tdom in html mode would make the page parsing code look nicer.

HJG: Are there any examples of these tools here on the wiki (or elsewhere), with a demo of how to parse a fairly complex webpage ? Of course, it is hard to see how the webpage to be parsed looked like, when only the parsing code is there.

I admit that my code is more "working" than "elegant"...

2006-02-02: After the first successful use of this program, some problems showed up:

Descriptions for photos can be longer than one line
Flickr-pages with defined "sets" have extra title-entries (should be filtered out)
For some pictures, the selected size might not available (e.g. only 640x480).
No checks yet if the download of an image is successful
There are other types of webpages at flickr (e.g. Set, Calendar, Tags...) that cannot be parsed yet.
I have not yet decided on the design for the data to pass to the viewer.

Development will continue in the near future...

See also:

Category String Processing