Downloading pictures from Flickr

Summary

HJG: Someone has uploaded a lot of pictures to Flickr, and I want to show them someplace where no internet is available.

The pages at Flickr have a lot of links, icons etc., so a simple recursive download with e.g. wget would fetch lots of unwanted stuff. Of course, I could tweak the parameters for calling wget (-accept, -reject, etc.), or get the html-pages, then filter their contents with awk or perl, but doing roughly the same thing in Tcl looks like more fun :-) Moreover, with a Tcl-script I can also get the titles and descriptions of the images.

So the first step is to download the html-pages from that person, extract the links to the photos from them, then download the photo-pages (containing titles and complete descriptions), and the pictures in the selected size (Thumbnail=100x75, Small=240x180, Medium=500x375, Large=1024x768, Original=as taken).

Then we can make a Flickr Offline Photoalbum out of them, or just use a program like IrfanView [L1 ] to present the pictures as a slideshow.

Code

This is the beta-version of the downloader:


 #!/bin/sh
 # Restart with tcl: -*- mode: tcl; tab-width: 4; -*- \
 exec wish $0 ${1+"$@"}

 # FlickrDownload.tcl - HaJo Gurt - 2006-01-20 - https://wiki.tcl-lang.org/15303
 #: Download webpages and images for a photo-album from flickr.com
 #
 # 2005-11-22 First Version
 # 2005-11-23 entry
 # 2005-11-24 checkbuttons
 # 2005-11-25 save data to file

 # Todo:
 # * Save infos to file for next stage (album-maker)
 # * expand Analyze1 to recognize set-pages, search-result-pages etc. 
 # * Bug: !! End of Multiline-Descriptions not detected
 # * ?? FetchImage: check status

  package require Tk
  package require Img
  package require http


  proc Init {} {
  #: Initialize Values

    global Prg Const

    set Prg(Title)    "Flickr-Download"
    set Prg(Version)  "v0.32"
    set Prg(Date)     "2006-01-26"
    set Prg(Author)   "Hans-Joachim Gurt"
    set Prg(Contact)  [string map -nocase {: @ ! .} gurt:gmx!de]
    set Prg(About)    "Download pictures from a photo-album at Flickr.com" ;#:
  #%%
    set Const(Prefix1)  "s"
   #set Const(Prefix1)  "page"   ;# page01.html
    set Const(Datafile)  slides.txt
  }

 #########1#########2#########3#########4#########5#########6#########7#####

  proc Print { Str {Tag ""} } {
  #: Output to text-window
    #puts $Str
     .txt1 insert end "\n"
     .txt1 insert end "$Str" $Tag
     .txt1 see end          ;# scroll to bottom
     update
  }

  proc Log { Str {Tag ""} } {
  ##: Debug-Output
   #Print $Str $Tag        ;##
  }

  proc ShowOpt {} {
  ##: Debug: Show Options
    global Opt
    return        ;##

    Print ""
    foreach key [array names Opt] {
      Print "$key : $Opt($key)"
    }
  }

 #########1#########2#########3#########4#########5#########6#########7#####

  proc GetPage { url } {
  #: Fetch a webpage from the web
    set token [::http::geturl $url]
    set page  [::http::data $token]
    ::http::cleanup $token
    return $page
  }

  proc FetchImage { url fname } {
  #: Fetch a picture from the web 
  #  See also: [Polling web images with Tk]
   #puts -nonewline "Fetch: \"$url\" "
    Print "Fetch: \"$url\" " DL
    set Stat "skip"

    ## %% Deactivate for offline-testing:
if 1 {
    set f [open $fname w]
    fconfigure $f -translation binary
    set imgtok [http::geturl $url -binary true -channel $f]
   #set Stat [::http::error  $imgtok]
    set Stat [::http::status $imgtok]
   # ?? Errorhandling ??
    flush $f
    close $f
    http::cleanup $imgtok
}
    Print " Status: $Stat " Ok   ;# ?? true status
  }

 #########1#########2#########3#########4#########5#########6#########7#####

  proc Analyse1 { url1 page } {
  #: Analyse flickr album-webpage, 
  # like http://www.flickr.com/photos/PERSON
  # or   http://www.flickr.com/photos/PERSON/page2

    global PicNr Const Opt Data

    set filename [format "%s%02d.html" $Const(Prefix1) $page ]
    if ($page==1) {
      set url $url1
    } else {
      set url [format "$url1/page%d" $page ] 
    }

    set base $url1
    set p1   [ string first "//" $url  0  ]; incr p1  2
    set p2   [ string first "/"  $url $p1 ]; incr p2 -1
    set p1 0
    set base [ string range      $url $p1 $p2 ]
   #Log "$base: $p1 $p2: '$base'"
    Log "# filename: $filename"         ;##
    Log "# url : $url"                 ;##
    Log "# base: $base"                ;##

    ## %% Deactivate for offline-testing:
if 1 {
    set page  [ GetPage $url ]
   #puts "$page"                ;##
    set fileId [open $filename "w"]
    puts -nonewline $fileId $page
    close $fileId
}
    set fileId [open $filename r]
    set page [read $fileId]
    close $fileId

    foreach line [split $page \n] {
      # <title>Flickr: Photos from ALBUM</title>
      if {[regexp -- "<title>"  $line]} { 
        #Log "1: $line";
         set p1 [ string first ":"       $line  0  ]; incr p1 14
         set p2 [ string first "</title" $line $p1 ]; incr p2 -1
         set sA [ string range $line $p1 $p2 ]
         Log   "Album: $p1 $p2: '$sA'"
         Print "Album: '$sA'"
         set Data(0.Album) $sA
         set Data2(Album)  $sA
      }

      # <h4>stilles Seitental</h4>
      if {[regexp -- "<h4>"  $line]} { 
        #Log "2: $line";
         incr PicNr
         set p1 [ string first "<h4>"    $line  0  ]; incr p1  4
         set p2 [ string first "</h4>"   $line $p1 ]; incr p2 -1
         set sH [ string range $line $p1 $p2 ]
         Log "\n"
         Log   "$PicNr - Header: $p1 $p2: '$sH'" Hi
         Print "$PicNr - Header: '$sH'" Hi
         set Data($PicNr.Head) $sH
         set Data($PicNr.Desc) ""
      }

      # <p class="Photo"><a href="/photos/PERSON/87654321/">
      # <img src="http://static.flickr.com/42/87654321_8888_m.jpg" width="240" height="180" /></a></p>
      if {[regexp -- (class="Photo")  $line]} { 
        #Log "3: $line";
        #incr n
         set p1 [ string first "href=" $line  0  ]; incr p1  6
         set p2 [ string first "img"   $line $p1 ]; incr p2 -4
         set sL [ string range $line $p1 $p2 ]
         Log "Link : $p1 $p2: '$sL'"
        #set Data($PicNr.Link) $sL

         set p1 [ string first "src=" $line  0  ]; incr p1  5
         set p2 [ string first "jpg"  $line $p1 ]; incr p2  2
         set sP [ string range $line $p1 $p2 ]
         Log   "Photo: $p1 $p2: '$sP'"
        #Print "Photo: '$sP'"
        #set Data($PicNr.Photo) $sP

        set url2 $sL    ;# /photos/PERSON/87654321/
        set nr   $url2  ;# /photos/PERSON/87654321/
       #Log "#> '$nr'"
        set p2   [ string last "/"  $url2     ]; incr p2 -1
        set p1   [ string last "/"  $url2 $p2 ]; incr p1  1
        set nr   [ string range     $url2 $p1 $p2 ]
        Log "#>Nr: $p1 $p2 '$nr'"
       #set filename [format "p%04d.html" $PicNr ]
        set filename [format "%s.html" $nr ]       ;# Filename for local photo-page
         set Data($PicNr.Name) $nr
         Print "Name : $nr"

        set filename0 [format "%s_0100.jpg" $nr ]  ;#  100x75 - Thumbnail
        set sP0 [ string map {_m _t} $sP ]
        if { $Opt(100x75) } { FetchImage $sP0 $filename0 }

        set filename1 [format "%s_0240.jpg" $nr ]  ;#  240x180 - Small
        if { $Opt(240x180) } { FetchImage $sP $filename1 }

        set filename2 [format "%s_0500.jpg" $nr ]  ;#  500x375 - Medium
        set sP2 [ string map {_m ""} $sP ]
        if { $Opt(500x375) } { FetchImage $sP2 $filename2 }

        set filename3 [format "%s_1024.jpg" $nr ]  ;# 1024x768 - Large
        set sP3 [ string map {_m _b} $sP ]
        if { $Opt(1024x768) } { FetchImage $sP3 $filename3 }
       #break        ;##

        set filename4 [format "%s_2048.jpg" $nr ]  ;# Original Size, e.g. 2560x1920
        set sP4 [ string map {_m _o} $sP ]
        if { $Opt(MaxSize) } { FetchImage $sP4 $filename4 }
      }

      # <p class="Desc">im Khao Sok</p>
      # <p class="Desc">Figuren aus dem alten China, auf<a href="/photos/PERSON/87654321/">...</a></p>
      if {[regexp -- (class="Desc")   $line]} { 
        #Log "4: $line"; 
         set p1 [ string first "Desc"    $line  0  ]; incr p1  6
        #set p2 [ string first "</p>"    $line $p1 ]; incr p2 -1
         set p2 [ string first "<"       $line $p1 ]; incr p2 -1
         set sD [ string range $line $p1 $p2 ]
         Log   "Descr: $p1 $p2: '$sD'"
        #Print "Descr: '$sD'"
         set Data($PicNr.Desc) $sD   ;# gets replaced again in Analyse2
      }

      # <a href="/photos/PERSON/page12/" class="end">12</a>
      # <a href="/photos/PERSON/" class="end">1</a>
      if {[regexp -- (page.*class="end")    $line]} { 
        #Log "5: $line";
        #incr n; 
         set p1 [ string first "page" $line  0  ]; incr p1  4
         set p2 [ string first "/"    $line $p1 ]; incr p2 -1
         set s9 [ string range $line $p1 $p2 ]
         Log "End: $p1 $p2: '$s9'"
         return [incr s9 0]
        #break
      }

      # <p class="Activity">
      if {[regexp -- (class="Activity") $line]} { ;# now get photo-page
         Analyse2 $base $sL $filename
        #break
      }

      # <!-- ### MAIN NAVIGATION ### -->
      if {[regexp -- "<!-- ### MAIN" $line]} { 
        break    ;# Nothing interesting beyond this point
      }
    }
    return 0
  }

 #########1#########2#########3#########4#########5#########6#########7#####

  proc Analyse2 { url1 url2 filename } {
  #: Analyse a flickr photo-webpage (which shows a single photo),
  # like http://www.flickr.com/photos/PERSON/87654321/
  #
  # @url1    : first part of the url, e.g. "http://www.flickr.com/"
  # @url2    : 2nd   part of the url, e.g. "/photos/PERSON/87654321/"
  # @filename: filename for local copy of webpage

    global PicNr Data

    set url "$url1$url2"

    ## %% Deactivate for offline-testing:
if 1 {
    set page  [ GetPage $url ]
   #Log "$page"                ;##
    set fileId [open $filename "w"]
    puts -nonewline $fileId $page
    close $fileId
}
    set fileId [open $filename r]
    set page [read $fileId]
    close $fileId

    foreach line [split $page \n] {
      # page_current_url
      if {[regexp -- "page_current_url" $line]} { 
        #Log "1>> $line";
      }

      # <li class="Stats">
      # Taken with an Olympus C5050Z.
      if {[regexp -- "Taken with an" $line]} { 
        #Log "2>> $line";
         set p1 [ string first "with"   $line  0  ]; incr p1  8
         set p2 [ string first "<br /"  $line $p1 ]; incr p2 -3
         set sC [ string range $line $p1 $p2 ]
         Log ">> Camera: $p1 $p2: '$sC'"
      }

      # <p class="DateTime"> Uploaded on <a href="/photos/PERSON/archives/date-posted/2006/01/07/"
      # style="text-decoration: none;">Jan 7, 2006</a></p>
      if {[regexp -- "Uploaded on" $line]} { 
        #Log "3>> $line";
         set p1 [ string first "date-posted"  $line  0  ]; incr p1 12
         set p2 [ string first "style"        $line $p1 ]; incr p2 -4
         set sU [ string range $line $p1 $p2 ]
         set sU [ string map {/ -} $sU ]
         Log ">> Upload: $p1 $p2: '$sU'"
      }

      # Taken on <a href="/photos/PERSON/archives/date-taken/2006/01/10/"
      # style="text-decoration: none;">January 10, 2006</a>
      if {[regexp -- "archives/date-taken" $line]} { 
        #Log "4>> $line";
         set p1 [ string first "date-taken"   $line  0  ]; incr p1 11
         set p2 [ string first "style"        $line $p1 ]; incr p2 -4
         set sS [ string range $line $p1 $p2 ]
         set sS [ string map {/ -} $sS ]
         set Data($PicNr.Date) $sS
         Log ">> Shot: $p1 $p2: '$sS'"
         Print  "Date: '$sS'"
      }

      # <h1 id="title_div87654321">stilles Seitental</h1>
      if {[regexp -- "<h1"           $line]} { 
        #Log "H1: $line";
         set p1 [ string first ">"       $line  0  ]; incr p1  1
         set p2 [ string first "</h1>"   $line $p1 ]; incr p2 -1
         set sH [ string range $line $p1 $p2 ]
         Log ">> $PicNr - Header: $p1 $p2: '$sH'"
      }

      # <div id="description_div87654321" class="photoDescription">im Khao Sok</div>
      # <div id="description_div73182923" class="photoDescription">Massiert wird überall und immer...,
      # viel Konkurrenz bedeutet kleine Preise: 1h Fußmassage = 120Bt (3€)<br />
      # Es massieren die Frauen, die tragende Säule der Gesellschaft.</div>
      #
      if {[regexp -- (class="photoDescription")   $line]} { 
        #Log "D: $line"; 
         set p1 [ string first "Desc"    $line  0  ]; incr p1 13
         set p2 [ string first "</div>"  $line $p1 ]
      # !! Multiline-Descriptions: get at least the first line:
         if {$p2 > $p1} { incr p2 -1 } else { set p2 [string length $line] }

         set sD [ string range $line $p1 $p2 ]
         set Data($PicNr.Desc) $sD
         Log ">> Descr: $p1 $p2: '$sD'"
         Print  "Descr: '$sD'"
      }

      # Abort scanning of current file (nothing of interest below):
      if {[regexp -- "upload_form_container"  $line]} { 
        Print "-"

        Log "##> $PicNr : $Data($PicNr.Name) #\
          $Data($PicNr.Date) #\
          $Data($PicNr.Head) #\
          $Data($PicNr.Desc)"

        global Data2
       #%%
        set key $Data($PicNr.Name)
        set Data2($key.Date) $Data($PicNr.Date)
        set Data2($key.Head) $Data($PicNr.Head)
        set Data2($key.Desc) $Data($PicNr.Desc)

        break 
      }  
    }
  }

 #########1#########2#########3#########4#########5#########6#########7#####

 proc Go {url} {
 #: Start processing after user entered url

    global PicNr Const Opt Data
    set StartPage 1

    Print ""
    Print "Flickr-Download from $url" Hi

    set PicNr     0
    set filename [ format "%s%02d.html" $Const(Prefix1) $StartPage ]         ;# page01.html
    set MaxPage  [ Analyse1 $url $StartPage ]
    incr StartPage 1

   #set MaxPage  2  ;##
    if { $Opt(All_Pages) } {
      for {set page $StartPage} {$page <= $MaxPage} {incr page} {
        Analyse1 $url $page
      }
    }
    Print ""
    Print "Done !" Hi

  #: Show collected Data about pictures:
    Print ""
    set line -1
 #%%
    global Data2
    set line 0
    foreach key [lsort -dictionary [ array names Data2 ]] {
      Print "$key : $Data2($key)"  [expr [incr line]%3]
    }

    arr'dump Data  $Const(Datafile)
    arr'dump Data2 data2.txt
    Print ""

  }

  proc arr'dump { _arr fn } {
  #: Dump array to file, in a format ready to be loaded via 'source'
    upvar 1 $_arr arr
    set f [open $fn w]
    puts $f "array set $_arr \{"
    foreach key [ lsort [array names arr] ] {
      puts $f [ list $key $arr($key) ]
    }
    puts $f "\}"
    close $f
  }


 #########1#########2#########3#########4#########5#########6#########7#####

  #: Main :
  Init

 #catch {console show}        ;##
  pack [frame .f1]
  pack [frame .f2]
  label     .lab1   -text "URL:"
  entry     .ent1   -textvar e -width 80
  text      .txt1   -yscrollcommand ".scr1 set"  -width 100 -height 40  -bg white -wrap word
  scrollbar .scr1   -command ".txt1 yview"
  button    .but0   -text "Clear Log" -command { .txt1 delete 0.0 end }
  button    .but1   -text "Go"        -command { Go $e }
  pack .lab1 .ent1 .but0 .but1 -in .f1 -side left -padx 2

  label .lab2 -text "Options:"
  pack  .lab2 -in .f2 -side left

  set AllPages "All Pages"
  lappend Options 100x75 240x180 500x375 1024x768 MaxSize  Get_from_Web All_Pages
  foreach size $Options {
         set cl [label       .sz$size -text $size ]
         set cc [checkbutton .cb$size -variable Opt($size) -command ShowOpt ]
         pack $cl -in .f2 -side left -anchor e
         pack $cc -in .f2 -side left -anchor w
  }
  .txt1 tag configure "Hi"  -background red       -foreground white
  .txt1 tag configure "DL"  -background lightblue -underline 1
  .txt1 tag configure "Ok"  -background green     -underline 0
  .txt1 tag configure 1     -background cyan

  Print " $Prg(Title) $Prg(Version) - $Prg(Date) " Hi
  Print "$Prg(About)"
  Print "(c) $Prg(Author) - $Prg(Contact)" Ok

  set Opt(100x75)       0
  set Opt(All_Pages)    0
  set Opt(Get_from_Web) 1
  ShowOpt   ;##
  set Data(0.Album) "Flickr"

  pack .scr1 -side right  -fill y
  pack .txt1 -side right 

  bind .ent1 <Return> { Go $e }
  bind .     <Key-F1> { console show }

  set e http://www.flickr.com/photos/
 #set e http://www.flickr.com/photos/siegfrieden
 #set e http://www.flickr.com/photos/siegfrieden/page2

  wm title . $Prg(Title)
  focus -force .ent1

#.


Comments

Now with a nice GUI: enter URL of first album-page, check the options you want, then press the GO-button.

Checkboxes for the image-sizes to download are obvious. When "Get from Web" is not checked, no internet-access happens and local files (from a previous download) are used. When "All Pages" is not checked, processing stops after the first page.

CJL wonders whether the Flickr-generated RSS feeds for an album might be a quicker way of getting at the required set of image URLs.

HJG: I don't think so - the data in the RSS lists only the most recently uploaded images, it misses some details (i.e. date when picture was taken), and the description-field looks messy.

Here is a more quick'n'dirty way to get just the pictures, using wget and awk:

  • Visit the first page of the album with a browser [L2 ]
  • Save this page as s01.html (html-only is enough)
  • At the bottom of the album-page, right-click each "Page X"-link, and save-link-as s02.html, etc. (ok, more than about a dozen of these would get tiresome...)
  • awk -f flickr.awk s*.html > links.txt
  • wget -w1 -i links.txt

With this minimal flickr.awk (i.e. it does not extract title, headers, descriptions etc. ) :

  BEGIN           { FS="\""
                    Found=0;
                    print "# flickr-Download:"
                  }
  /class="Photo/  { Found++
                    sub( "^.*http", "http", $0)
                    sub( "_m", "_b", $1)        # _b : large picture = 1024x768
                    print $1
                    next
                  }
  END             { print "# Found:", Found }

Next step: Flickr Offline Photoalbum.


schlenk wonders if using htmlparse or tdom in html mode would make the page parsing code look nicer.

HJG: Are there any examples of these tools here on the wiki (or elsewhere), with a demo of how to parse a fairly complex webpage ? Of course, it is hard to see how the webpage to be parsed looked like, when only the parsing code is there.

I admit that my code is more "working" than "elegant"...

2006-02-02: After the first successful use of this program, some problems showed up:

  • Descriptions for photos can be longer than one line
  • Flickr-pages with defined "sets" have extra title-entries (should be filtered out)
  • For some pictures, the selected size might not available (e.g. only 640x480).
  • No checks yet if the download of an image is successful
  • There are other types of webpages at flickr (e.g. Set, Calendar, Tags...) that cannot be parsed yet.
  • I have not yet decided on the design for the data to pass to the viewer.

Development will continue in the near future...


See also: