Parsing PDF

Keith Vetter 2007-10-20 : I've been playing with PDF and generating it on the fly. My first effort resulted in Text2PDF, but I needed more formatting control so the next step was to delve into the file format [1 ]. PDF can be a VERY complex format with support for incremental updates and generations. I find that the only way for me to understand a new technology is to get my hands dirty, so I decided to write a parser to pull apart a PDF file.

So here is code to parse a PDF file. It understands the lexemes comprising the file, and it understands the document structure and the page tree data structure. It lets you see, but does NOT understand, the low-level stuff which makes up pages or defines fonts, etc.

If anybody else wants to delve into the PDF deeper, this code would make a good starting point.

The code here contains both the PDF parsing code plus two different ways to run it, both a command line interface and a GUI interface. In either case, it displays a PDF's catalog and root of the page layout tree, plus the meta-data for a specified page.


##+##########################################################################
#
# Parsing PDF -- decodes PDF document
# by Keith Vetter. October 2007
#
# http://partners.adobe.com/asn/tech/pdf/specifications.jsp
# http://www.adobe.com/devnet/pdf/pdfs/PDFReference13.pdf

proc ParsePDF {} {
   global D CATALOG PAGES PAGE CONTENT INFO

   unset -nocomplain D
   unset -nocomplain CATALOG
   unset -nocomplain PAGES
   unset -nocomplain PAGE
   unset -nocomplain CONTENT
   unset -nocomplain INFO

   CheckValidPDF
   GetTrailer
   ReadXREF
   ReadObject $D(root) CATALOG
   ReadObject $CATALOG(/Pages) PAGES
   
   if {[info exists D(info)]} { ReadObject $D(info) INFO }
}

proc ParsePDF.File {fname} {
   set fin [open $fname r]
   fconfigure $fin -translation binary
   SetText [read $fin]
   close $fin

   ParsePDF
}
##+##########################################################################
#
# CheckValidPDF -- does some simple sanity checks then reads in the
# cross-reference location.
#
proc CheckValidPDF {} {
   global data D

   if {[string range $data 0 4] ne "%PDF-"} { error "Mal-formed PDF 1" }
   set tail [string range $data end-1024 end] ; list
   if {! [regexp {%%EOF[\r\n]*$} $tail]} { error "Mal-formed PDF 2" }
   set n [regexp {[\n\r]+startxref *[\n\r]+(\d+)} $tail -> fpos]
   if {! $n} { error "Mal-formed PDF 3" }
   set D(xref,fpos) $fpos
}
##+##########################################################################
#
# GetTrailer -- reads key info out of the trailer section.
# Not sure this is correct with incrementally updated pdf files.
#
proc GetTrailer {} {
   global data D

   # Is this the best way to locate the trailer??? especially with
   # multiple trailers with incremental updates
   set offset [string first trailer $data $D(xref,fpos)]
   SetPosition $offset

   set token [GetToken]                        ;# Trailer keyword
   set token [GetToken]                        ;# Start of dict
   set dict [ReadDict]
   unset -nocomplain tmp
   array set tmp $dict

   if {! [info exists tmp(/Size)]} { error "trailer missing /Size" }
   if {! [info exists tmp(/Root)]} { error "trailer missing /Root" }
   set D(xref,cnt) $tmp(/Size)
   set D(root) $tmp(/Root)
   if {[info exists tmp(/Prev)]} { set D(prev) $tmp(/Prev) }
   if {[info exists tmp(/Info)]} { set D(info) $tmp(/Info) }
}
##+##########################################################################
#
# ReadXREF -- reads in the main and possibly prev XREF section.
# NB. can there be more than 2?
#
proc ReadXREF {} {
   global D

   set offsets $D(xref,fpos)
   if {[info exists D(prev)]} { lappend offsets $D(prev) }
   foreach offset $offsets {
       SetPosition $offset
       GetToken
       set token [GetToken]
       foreach {start cnt} [split $token " "] break

       for {set i 0} {$i < $cnt} {incr i} {
           set id [expr {$start + $i}]
           set token [GetToken]
           scan $token "%ld %d %s" offset generation inuse
           if {$inuse eq "f"} continue
           set D(xref,$id,offset) $offset
       }
   }
}
##+##########################################################################
#
# ReadObject -- assumes an object is just a dictionary, stores the
# dictionary into specified array
#
proc ReadObject {objID arrName} {
   upvar 1 $arrName arr
   unset -nocomplain arr

   PositionAtObject $objID
   set token [GetToken]
   if {$token ne "<<"} { error "Object $objID not a dictionary" }
   set dict [ReadDict]
   array set arr $dict
}
proc ShowObject {id title} {
   ReadObject $id tmp
   myParray tmp $title
}
##+##########################################################################
#
# GetPage -- Searches and extracts a given document page
#
proc GetPage {num} {
   global PAGE CONTENT

   set id [FindPage $num]
   if {$id eq ""} {return ""}                  ;# No page
   ReadObject $id PAGE

   # Content can be quite complex, just try for the simple case
   if {[catch {ReadObject $PAGE(/Contents) CONTENT}]} {return $id}
   if {! [string is integer -strict $CONTENT(/Length)]} { ;# Deref length
       PositionAtObject $CONTENT(/Length)
       set len [GetToken]
       append CONTENT(/Length) " => $len"
   }
   return $id
}
##+##########################################################################
#
# FindPage -- walks the page tree to find a given page
#
proc FindPage {pageNum} {
   global PAGES

   if {$pageNum < 1 || $pageNum > $PAGES(/Count)} {
       return ""
       error "Page number too big"
   }

   set thisPage 1
   set kids [eval concat $PAGES(/Kids)]
   while {1} {
       foreach {id gen R} $kids {
           ReadObject $id child
           if {$child(/Type) eq "/Page"} {
               if {$thisPage == $pageNum} { return $id }
               incr thisPage
               continue
           }

           set lastPage [expr {$thisPage + $child(/Count)}]
           if {$pageNum < $lastPage} {         ;# Go down a level
               set kids [eval concat $child(/Kids)]
               break
           } else {
               set thisPage $lastPage
           }
       }
   }
}
##+##########################################################################
#
# PositionAtObject -- "seeks" to position of a direct or indirect reference
#
proc PositionAtObject {ref} {
   set id [lindex $ref 0]
   SetPosition $::D(xref,$id,offset)
   set token [GetToken]                        ;# Object define
}
##+##########################################################################
#
# myParray -- specialized version of parray pretty printing an array
#
proc myParray {arrName {title ""} {pattern *}} {
   upvar 1 $arrName arr

   if {$title eq ""} { set title $arrName }
   puts $title

   set max 0
   foreach name [array names arr $pattern] {
       if {[string length $name] > $max} {
           set max [string length $name]
       }
   }
   incr max 1
   if {[info exists arr(/Type)]} {
       puts [format "  %-*s = %s" $max /Type $arr(/Type)]
   }

   foreach name [lsort -dictionary [array names arr $pattern]] {
       if {$name eq "/Type"} continue
       puts [format "  %-*s = %s" $max $name $arr($name)]
   }
   puts ""
}
##+##########################################################################
#
# ReadArray -- returns a list of everything up to
# the matching closing array character.
#
# NB. can be nested w/ arrays and dictionaries
#
proc ReadArray {} {
   set arr {}
   while {1} {
       set token [GetToken]
       if {$token eq "]"} break
       if {$token eq "\["} {
           set token [ReadArray]
       } elseif {$token eq "<<"} {
           set token [ReadDict]
       }
       lappend arr $token
   }
   return $arr
}
##+##########################################################################
#
# ReadDict -- returns a list of everything up to
# the matching closing dictionary character.
#
# NB. can be nested w/ arrays and dictionaries
#
proc ReadDict {} {
   set dict {}
   while {1} {
       set name [GetToken]
       if {$name eq ">>"} break

       set value [GetToken]
       if {$value eq "<<"} {
           set value [ReadDict]
       } elseif {$value eq "\["} {
           set value [ReadArray]
       }
       lappend dict $name $value
   }
   return $dict
}
##+##########################################################################
#
# GetToken -- breaks input stream into PDF tokens, worrying about
# strings, boolean, hex sequences, names, arrays, dictionaries and
# atoms.
#
proc GetToken {} {
   set eotoken "()<>\[]{}/%"

   # Past white space and comments
   set inComment false
   while {1} {
       set ch [GetChar]
       if {$inComment} {
           if {$ch eq "\n" || $ch eq "\r"} {
               set inComment false
           }
       } elseif {$ch eq "%"} {
           set inComment true
       } elseif {! [string is space $ch]} break
   }

   set start $::D(pos)
   while {1} {
       if {$ch eq "(" || $ch eq "\{"} {        ;# In a string or boolean
           set backslash false
           set match [expr {$ch eq "(" ? ")" : "\}"}]
           while {1} {
               set ch [GetChar]
               if {$ch eq "\\"} {
                   set backslash true
               } elseif {! $backslash && $ch eq $match} {
                   break
               } else {
                   set backslash false
               }
           }
           break
       }
       if {$ch eq "\[" || $ch eq "]"} break
       if {$ch eq "<"} {                       ;# Hex or dictionary
           set ch [GetChar]
           if {$ch eq "<"} break               ;# Dictionary
           while {$ch ne ">"} {
               set ch [GetChar]
           }
           break
       }
       if {$ch eq ">"} {
           GetChar
           break
       }

       set name [expr {$ch eq "/"}]
       while {1} {
           set ch [GetChar]
           if {$ch eq "\n" || $ch eq "\r"} break
           if {$name && [string is space $ch]} break
           if {[string first $ch $eotoken] != -1} break
       }
       UnGetChar
       break
   }
   set ::D(token,start) $start
   set ::D(token,end) $::D(pos)
   set token [string range $::data $start $::D(pos)]
   if {[regexp {\((.[^\)]*)\)} $token -> string]} {
    set token $string
   } elseif {[regexp {^<([^<]?[^>]*)>[^>]?$} $token -> hexstring]} {
     set token [decodeHexstring $hexstring]
   }
   return $token
}
proc GetChar {} {
   set ch [string index $::data [incr ::D(pos)]]
   return $ch
}
proc UnGetChar {} {
   incr ::D(pos) -1
}
proc SetPosition {pos} {
   set ::D(pos) [expr {$pos-1}]
}
proc SetText {txt} {
   set ::data $txt
   set ::D(pos) -1
}
proc decodeHexstring {hexstring} {
  if {[string tolower [string range $hexstring 0 3]] eq "feff"} {
    set string [string range $hexstring 4 end]
    set result {}
    for {set i 0} {$i < [string length $string]} {incr i 4} {
      set hex [string range $string $i [expr $i + 3]]
      append result [encoding convertfrom unicode [binary format s 0x$hex]]
    }
    return $result
  } else {
    # don't know what to do
    return $hexstring
  }
}
################################################################
#
# DEMO CODE
#

proc NoGUI {} {
   global argc argv
   global CATALOG PAGES PAGE CONTENT INFO

   if {$argc == 0 || $argc > 2} {
       puts stderr "usage: parsePDF <pdf file> ?page number?"
       exit
   } elseif {$argc == 1} {
       lappend argv 1
   }
   foreach {fname pageNum} $argv break

   puts "Parsing [file tail $fname]\n"
   ParsePDF.File $fname

   myParray CATALOG Catalog
   myParray PAGES "Page Tree"
   GetPage $pageNum
   myParray PAGE "Page $pageNum Metadata"
   myParray CONTENT "Page $pageNum Contents"
   
   myParray INFO "Info"
}

if {[lindex $argv 0] eq "-gui"} {
  catch {package require Tk}
  set argv [lrange $argv 1 end]
}
if {! [info exists tk_version]} {
   NoGUI
   exit
}

################################################################
#
# GUI DEMO CODE
#

if {! [catch {package require tile}]} {
   namespace import -force ::ttk::button
}

set S(pageNum) 1
set S(havePDF) 0
set S(fname) ""

proc DoDisplay {} {
   global S argc argv

  set S(fname) ""
  if {$argc > 0} {
    set S(fname) [lindex $argv 0]
  }
  
   wm title . "Parsing PDF"
   image create photo ::img::question -data {
       R0lGODlhCQAMALMAAODf49PS1oeHiURDRQAAAJOSlVtbXG1sbre2uczLz3t7fa6usf//////////
       /////yH5BAEAAAAALAAAAAAJAAwAAwQmEAQxyCggGxwKwcAxZABBkChhoBlCIGwywKyysNmJ79lx
       7D4eKwIAOw==
   }
   image create bitmap ::img::dots -data {
       #define dots_width 16
       #define dots_height 9
       static char dots_bits = {
           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
           0x00, 0x00, 0x00, 0x84, 0x10, 0xce, 0x39, 0x84, 0x10}}
   image create bitmap ::img::star -data {
       #define plus_width  11
       #define plus_height 9
       static char plus_bits[] = {
           0x00,0x00, 0x24,0x01, 0xa8,0x00, 0x70,0x00, 0xfc,0x01,
           0x70,0x00, 0xa8,0x00, 0x24,0x01, 0x00,0x00 }}

   frame .file -bd 2 -relief ridge
   frame .finfo -bd 2 -relief ridge
   frame .fcat -bd 2 -relief ridge
   frame .fpages -bd 2 -relief ridge
   frame .fpage -bd 2 -relief ridge

   # File area
   label .file.l -text "PDF File"
   entry .file.fname -textvariable S(fname)
   button .file.choose -image ::img::dots -command ChooseFile
   button .file.go -text Parse -command Go
   button .file.about -image ::img::question -command About

   grid .file.l .file.fname .file.choose -sticky ns -pady 5
   grid .file.fname -sticky nswe
   grid .file.go - - -pady 5
   grid columnconfigure .file 1 -weight 1
   place .file.about -anchor se -relx 1 -rely 1 -x -5 -y -5

   # Info area
   label .finfo.title -text "Info"
   scrollbar .finfo.sb -orient vertical -command {.finfo.info yview}
   listbox .finfo.info -width 40 -listvariable S(info) -yscroll {.finfo.sb set}

   # Catalog and data area
   label .fcat.title -text "Catalog"
   scrollbar .fcat.sb -orient vertical -command {.fcat.catalog yview}
   listbox .fcat.catalog -listvariable S(catalog) -yscroll {.fcat.sb set}
   label .fpages.title -text "Page Tree Root"
   scrollbar .fpages.sb -orient vertical -command {.fpages.pages yview}
   listbox .fpages.pages -listvariable S(pages) -yscroll {.fpages.sb set}

   frame .fpage.btns
   label .fpage.l -text "Page"
   entry .fpage.num -width 3 -textvariable S(pageNum) -justify center \
       -validate key -vcmd {string is integer -strict %P}
   button .fpage.go -image ::img::star -state disabled -command GoPage
   scrollbar .fpage.sb -orient vertical -command {.fpage.page yview}
   listbox .fpage.page -listvariable S(page) -yscroll {.fpage.sb set}

   grid .file - - - -sticky ew
   grid .finfo .fcat .fpages .fpage -sticky news
   grid columnconfigure . {1 2 3} -weight 1
   grid columnconfigure . 0 -weight 2
   grid rowconfigure . {1} -weight 1

   pack .finfo.title -side top -fill x
   pack .finfo.sb -side right -fill y
   pack .finfo.info -side top -fill both -expand 1
   pack .fcat.title -side top -fill x
   pack .fcat.sb -side right -fill y
   pack .fcat.catalog -side top -fill both -expand 1
   pack .fpages.title -side top -fill x
   pack .fpages.sb -side right -fill y
   pack .fpages.pages -side top -fill both -expand 1

   pack .fpage.btns -side top
   pack .fpage.l .fpage.num .fpage.go -in .fpage.btns -side left
   pack .fpage.sb -side right -fill y
   pack .fpage.page -side top -fill both -expand 1
   #grid x .fpage.l .fpage.num .fpage.go x
   #grid configure .fpage.go -padx 5 -sticky ns
   #grid .fpage.page - - - - .fpage.sb -sticky news
   #grid columnconfigure .fpage {0 4} -weight 1
   #grid rowconfigure .fpage {1} -weight 1

   foreach var {fname pageNum havePDF} {
       foreach t [trace info variable S($var)] {
           eval trace remove variable S($var) $t
       }
       trace add variable S($var) write Tracer
   }
   set pdfs [glob -nocomplain *.pdf]
   if {$S(fname) eq "" && [llength $pdfs] > 0} {
        set S(fname) [lindex $pdfs [expr {int(rand()*[llength $pdfs])}]]
   }
   if {$S(fname) ne ""} {
        Go
   }
}
proc Tracer {var1 var2 op} {
   global S

   if {$var1 ne "S"} return
   if {$var2 eq "fname"} {
       .file.go config -state \
           [expr {[file isfile $S(fname)] ? "normal" : "disabled"}]
   } elseif {$var2 eq "pageNum" || $var2 eq "havePDF"} {
       set how disabled
       if {$S(havePDF) && [string is integer -strict $S(pageNum)]} {
           set how normal
       }
       .fpage.go config -state $how
   }
}
proc ChooseFile {} {
   global S

   set fname [tk_getOpenFile -parent . -defaultextension ".pdf" \
                  -initialfile $S(fname) -filetypes {{{PDF Files} .pdf}} \
                  -title "PDF Parser Open File"]
   if {$fname eq ""} return
   set S(fname) $fname
}
proc ArrayToListBox {arrName var} {
   global S
   upvar 1 $arrName arr

   set S($var) {}
   set max 0
   foreach name [array names arr] {
       if {[string length $name] > $max} {
           set max [string length $name]
       }
   }
   incr max 1
   if {[info exists arr(/Type)]} {
       lappend S($var) [format "  %-*s = %s" $max /Type $arr(/Type)]
   }

   foreach name [lsort -dictionary [array names arr]] {
       if {$name eq "/Type"} continue
       lappend S($var) [format "  %-*s = %s" $max $name $arr($name)]
   }
}
proc Go {} {
   global S CATALOG PAGES INFO

   set S(catalog) [set S(pages) [set S(page) {}]]
   if {! [file isfile $S(fname)]} return

   ParsePDF.File $S(fname)
   ArrayToListBox INFO info
   ArrayToListBox CATALOG catalog
   ArrayToListBox PAGES pages
   set S(havePDF) 1
   GoPage
}
proc About {} {
   set txt "Parsing PDF\nby Keith Vetter, Oct 2007\n\n"
   append txt "This program demos code that can parse PDF files. The\n"
   append txt "code displays a pdf's document structure including its\n"
   append txt "catalog and the root of its page tree. It will also show\n"
   append txt "the page object for any page.\n\n"

   append txt "For more info on PDF format see\n"
   append txt "http://partners.adobe.com/asn/tech/pdf/specifications.jsp"

   tk_messageBox -message $txt
}
proc GoPage {} {
   global S PAGES PAGE

   if {! $S(havePDF)} return
   if {! [string is integer -strict $S(pageNum)]} return
   if {$S(pageNum) < 1} {
       set S(page) {{Page number too small}}
       return
   } elseif {$S(pageNum) > $PAGES(/Count)} {
       set a [expr {$PAGES(/Count) == 1 ? "page" : "pages"}]
       set S(page) [list [list Only $PAGES(/Count) $a]]
       return
   }

   set id [GetPage $S(pageNum)]
   if {$id eq ""} {
       set S(page) "{bad page number}"
   } else {
       ArrayToListBox PAGE page
       set S(page) [concat [list "Object id: $id"] {---------} $S(page)]
   }

}
DoDisplay
return

UKo 2008-05-31: I have added some code to get the info block from the pdf-file and cleaned up the code a bit, so that it can be started with a recent tclkit (tested tclkit 8.4 and kbskit 8.5). Also the strings from the info block can be decoded.

Invocation: parsePDF.tcl ?-gui? ?file.pdf?

Without the -gui option the information will be displayed on the commandline. With -gui a given pdf-file will be analysed, without a file a random file from the current directory will be chosen.


Andreas otto 2007-10-21 : other software solution pdflib

KPV: Note, pdflib is a commercial, non-TCL (ANSI C) product


EE 2008-03-05: NOTE that the above URL pointing at the PDF specification points into a subscriber-only area on the Adobe website. The PDF spec is publicly available at http://www.adobe.com/devnet/pdf/pdf_reference.html


Scott Beasley 2008-06-01: I started on a PDF Page splitter in tcl, but dropped it in favor of the iText Java Class library [2 ] and TclBlend. The iText code does so much and has worked so well, it was hard to pass up. I know in the long run for many people though, a tcl only solution would be the best for them.

ABU 2008-06-03: I agree with Scott. iText is far more complete and it is both for reading and generating PDF. iText + TclBlend is a good solution for tclers. Just two little things are missing :

  1. TclBlend is paranoic about the tcltk interpreter it is intended to run with - if you change the interpreter, you should rebuild the tclblend binaries.
  2. iText Java API is really complex and huge, it exposes an intricated mixin of low-level functions (..methods) and high-level functions. It should be useful to redesign a more smart, high-level API in.

CMcC 3Jun08 I would prefer a tcl solution - after all, there's already a PDF parser in C without going to Java, if that were the only dimension of solution.

RKZn 2019-07-20: In 2019 this is still a nice module to learn about PDF structure, though it is outdated. For starters it only handles plain text XREFs (cross-references) while nowadays some get encoded in binary; and not all indirect objects are PDF dictionaries. I've been using it to learn and hopefully make a small module to make a few simple edits. Maybe... So far I noticed a couple of bugs here, that would show up when the document has a empty hexadecimal encoded string, i.e. a <>. So I changed the code above (at GetToken) from

while {1} {
    set ch [GetChar]
    if {$ch eq ">"} break
}

to

while {$ch ne ">"} {
    set ch [GetChar]
}

and from

} elseif {[regexp {^<([^<][^>]*)>[^>]?$} $token -> hexstring]} {

to (added a ?)

} elseif {[regexp {^<([^<]?[^>]*)>[^>]?$} $token -> hexstring]} {

while at it, I updated (at DoDisplay) the deprecated call

trace variable S($var) w Tracer

to

trace add variable S($var) write Tracer