[Keith Vetter] 2007-10-20 : I've been playing with [PDF] and
generating it on the fly. My first effort resulted in [Text2PDF], but
I needed more formatting control so the next step was to delve into the
file format
[http://partners.adobe.com/asn/tech/pdf/specifications.jsp].
PDF can be a VERY complex format with support
for incremental updates and generations.
I find that the only way for me to understand a new technology is to
get my hands dirty, so I decided to write a parser to pull apart a PDF
file.
So here is code to parse a PDF file. It understands the
lexemes comprising the file, and it understands the document
structure and the page tree data structure. It lets you see, but
does NOT understand, the low-level stuff which makes up pages or
defines fonts, etc.
If anybody else wants to delve into the PDF deeper, this
code would make a good starting point.
The code here contains both the PDF parsing code plus two
different ways to run it, both a command line interface
and a GUI interface. In either case, it displays a PDF's
catalog and root of the page layout tree, plus the meta-data
for a specified page.
----
======
##+##########################################################################
#
# Parsing PDF -- decodes PDF document
# by Keith Vetter. October 2007
#
# http://partners.adobe.com/asn/tech/pdf/specifications.jsp
# http://www.adobe.com/devnet/pdf/pdfs/PDFReference13.pdf
proc ParsePDF {} {
global D CATALOG PAGES PAGE CONTENT INFO
unset -nocomplain D
unset -nocomplain CATALOG
unset -nocomplain PAGES
unset -nocomplain PAGE
unset -nocomplain CONTENT
unset -nocomplain INFO
CheckValidPDF
GetTrailer
ReadXREF
ReadObject $D(root) CATALOG
ReadObject $CATALOG(/Pages) PAGES
if {[info exists D(info)]} { ReadObject $D(info) INFO }
}
proc ParsePDF.File {fname} {
set fin [open $fname r]
fconfigure $fin -translation binary
SetText [read $fin]
close $fin
ParsePDF
}
##+##########################################################################
#
# CheckValidPDF -- does some simple sanity checks then reads in the
# cross-reference location.
#
proc CheckValidPDF {} {
global data D
if {[string range $data 0 4] ne "%PDF-"} { error "Mal-formed PDF 1" }
set tail [string range $data end-1024 end] ; list
if {! [regexp {%%EOF[\r\n]*$} $tail]} { error "Mal-formed PDF 2" }
set n [regexp {[\n\r]+startxref *[\n\r]+(\d+)} $tail -> fpos]
if {! $n} { error "Mal-formed PDF 3" }
set D(xref,fpos) $fpos
}
##+##########################################################################
#
# GetTrailer -- reads key info out of the trailer section.
# Not sure this is correct with incrementally updated pdf files.
#
proc GetTrailer {} {
global data D
# Is this the best way to locate the trailer??? especially with
# multiple trailers with incremental updates
set offset [string first trailer $data $D(xref,fpos)]
SetPosition $offset
set token [GetToken] ;# Trailer keyword
set token [GetToken] ;# Start of dict
set dict [ReadDict]
unset -nocomplain tmp
array set tmp $dict
if {! [info exists tmp(/Size)]} { error "trailer missing /Size" }
if {! [info exists tmp(/Root)]} { error "trailer missing /Root" }
set D(xref,cnt) $tmp(/Size)
set D(root) $tmp(/Root)
if {[info exists tmp(/Prev)]} { set D(prev) $tmp(/Prev) }
if {[info exists tmp(/Info)]} { set D(info) $tmp(/Info) }
}
##+##########################################################################
#
# ReadXREF -- reads in the main and possibly prev XREF section.
# NB. can there be more than 2?
#
proc ReadXREF {} {
global D
set offsets $D(xref,fpos)
if {[info exists D(prev)]} { lappend offsets $D(prev) }
foreach offset $offsets {
SetPosition $offset
GetToken
set token [GetToken]
foreach {start cnt} [split $token " "] break
for {set i 0} {$i < $cnt} {incr i} {
set id [expr {$start + $i}]
set token [GetToken]
scan $token "%ld %d %s" offset generation inuse
if {$inuse eq "f"} continue
set D(xref,$id,offset) $offset
}
}
}
##+##########################################################################
#
# ReadObject -- assumes an object is just a dictionary, stores the
# dictionary into specified array
#
proc ReadObject {objID arrName} {
upvar 1 $arrName arr
unset -nocomplain arr
PositionAtObject $objID
set token [GetToken]
if {$token ne "<<"} { error "Object $objID not a dictionary" }
set dict [ReadDict]
array set arr $dict
}
proc ShowObject {id title} {
ReadObject $id tmp
myParray tmp $title
}
##+##########################################################################
#
# GetPage -- Searches and extracts a given document page
#
proc GetPage {num} {
global PAGE CONTENT
set id [FindPage $num]
if {$id eq ""} {return ""} ;# No page
ReadObject $id PAGE
# Content can be quite complex, just try for the simple case
if {[catch {ReadObject $PAGE(/Contents) CONTENT}]} {return $id}
if {! [string is integer -strict $CONTENT(/Length)]} { ;# Deref length
PositionAtObject $CONTENT(/Length)
set len [GetToken]
append CONTENT(/Length) " => $len"
}
return $id
}
##+##########################################################################
#
# FindPage -- walks the page tree to find a given page
#
proc FindPage {pageNum} {
global PAGES
if {$pageNum < 1 || $pageNum > $PAGES(/Count)} {
return ""
error "Page number too big"
}
set thisPage 1
set kids [eval concat $PAGES(/Kids)]
while {1} {
foreach {id gen R} $kids {
ReadObject $id child
if {$child(/Type) eq "/Page"} {
if {$thisPage == $pageNum} { return $id }
incr thisPage
continue
}
set lastPage [expr {$thisPage + $child(/Count)}]
if {$pageNum < $lastPage} { ;# Go down a level
set kids [eval concat $child(/Kids)]
break
} else {
set thisPage $lastPage
}
}
}
}
##+##########################################################################
#
# PositionAtObject -- "seeks" to position of a direct or indirect reference
#
proc PositionAtObject {ref} {
set id [lindex $ref 0]
SetPosition $::D(xref,$id,offset)
set token [GetToken] ;# Object define
}
##+##########################################################################
#
# myParray -- specialized version of parray pretty printing an array
#
proc myParray {arrName {title ""} {pattern *}} {
upvar 1 $arrName arr
if {$title eq ""} { set title $arrName }
puts $title
set max 0
foreach name [array names arr $pattern] {
if {[string length $name] > $max} {
set max [string length $name]
}
}
incr max 1
if {[info exists arr(/Type)]} {
puts [format " %-*s = %s" $max /Type $arr(/Type)]
}
foreach name [lsort -dictionary [array names arr $pattern]] {
if {$name eq "/Type"} continue
puts [format " %-*s = %s" $max $name $arr($name)]
}
puts ""
}
##+##########################################################################
#
# ReadArray -- returns a list of everything up to
# the matching closing array character.
#
# NB. can be nested w/ arrays and dictionaries
#
proc ReadArray {} {
set arr {}
while {1} {
set token [GetToken]
if {$token eq "]"} break
if {$token eq "\["} {
set token [ReadArray]
} elseif {$token eq "<<"} {
set token [ReadDict]
}
lappend arr $token
}
return $arr
}
##+##########################################################################
#
# ReadDict -- returns a list of everything up to
# the matching closing dictionary character.
#
# NB. can be nested w/ arrays and dictionaries
#
proc ReadDict {} {
set dict {}
while {1} {
set name [GetToken]
if {$name eq ">>"} break
set value [GetToken]
if {$value eq "<<"} {
set value [ReadDict]
} elseif {$value eq "\["} {
set value [ReadArray]
}
lappend dict $name $value
}
return $dict
}
##+##########################################################################
#
# GetToken -- breaks input stream into PDF tokens, worrying about
# strings, boolean, hex sequences, names, arrays, dictionaries and
# atoms.
#
proc GetToken {} {
set eotoken "()<>\[]{}/%"
# Past white space and comments
set inComment false
while {1} {
set ch [GetChar]
if {$inComment} {
if {$ch eq "\n" || $ch eq "\r"} {
set inComment false
}
} elseif {$ch eq "%"} {
set inComment true
} elseif {! [string is space $ch]} break
}
set start $::D(pos)
while {1} {
if {$ch eq "(" || $ch eq "\{"} { ;# In a string or boolean
set backslash false
set match [expr {$ch eq "(" ? ")" : "\}"}]
while {1} {
set ch [GetChar]
if {$ch eq "\\"} {
set backslash true
} elseif {! $backslash && $ch eq $match} {
break
} else {
set backslash false
}
}
break
}
if {$ch eq "\[" || $ch eq "]"} break
if {$ch eq "<"} { ;# Hex or dictionary
set ch [GetChar]
if {$ch eq "<"} break ;# Dictionary while {1$ch ne ">"} {
set ch [GetChar] if {$ch eq ">"} break
}
break
}
if {$ch eq ">"} {
GetChar
break
}
set name [expr {$ch eq "/"}]
while {1} {
set ch [GetChar]
if {$ch eq "\n" || $ch eq "\r"} break
if {$name && [string is space $ch]} break
if {[string first $ch $eotoken] != -1} break
}
UnGetChar
break
}
set ::D(token,start) $start
set ::D(token,end) $::D(pos)
set token [string range $::data $start $::D(pos)]
if {[regexp {\((.[^\)]*)\)} $token -> string]} {
set token $string } elseif {[regexp {^<([^<]?[^>]*)>[^>]?$} $token -> hexstring]} {
set token [decodeHexstring $hexstring]
}
return $token
}
proc GetChar {} {
set ch [string index $::data [incr ::D(pos)]]
return $ch
}
proc UnGetChar {} {
incr ::D(pos) -1
}
proc SetPosition {pos} {
set ::D(pos) [expr {$pos-1}]
}
proc SetText {txt} {
set ::data $txt
set ::D(pos) -1
}
proc decodeHexstring {hexstring} {
if {[string tolower [string range $hexstring 0 3]] eq "feff"} {
set string [string range $hexstring 4 end]
set result {}
for {set i 0} {$i < [string length $string]} {incr i 4} {
set hex [string range $string $i [expr $i + 3]]
append result [encoding convertfrom unicode [binary format s 0x$hex]]
}
return $result
} else {
# don't know what to do
return $hexstring
}
}
################################################################
#
# DEMO CODE
#
proc NoGUI {} {
global argc argv
global CATALOG PAGES PAGE CONTENT INFO
if {$argc == 0 || $argc > 2} {
puts stderr "usage: parsePDF <pdf file> ?page number?"
exit
} elseif {$argc == 1} {
lappend argv 1
}
foreach {fname pageNum} $argv break
puts "Parsing [file tail $fname]\n"
ParsePDF.File $fname
myParray CATALOG Catalog
myParray PAGES "Page Tree"
GetPage $pageNum
myParray PAGE "Page $pageNum Metadata"
myParray CONTENT "Page $pageNum Contents"
myParray INFO "Info"
}
if {[lindex $argv 0] eq "-gui"} {
catch {package require Tk}
set argv [lrange $argv 1 end]
}
if {! [info exists tk_version]} {
NoGUI
exit
}
################################################################
#
# GUI DEMO CODE
#
if {! [catch {package require tile}]} {
namespace import -force ::ttk::button
}
set S(pageNum) 1
set S(havePDF) 0
set S(fname) ""
proc DoDisplay {} {
global S argc argv
set S(fname) ""
if {$argc > 0} {
set S(fname) [lindex $argv 0]
}
wm title . "Parsing PDF"
image create photo ::img::question -data {
R0lGODlhCQAMALMAAODf49PS1oeHiURDRQAAAJOSlVtbXG1sbre2uczLz3t7fa6usf//////////
/////yH5BAEAAAAALAAAAAAJAAwAAwQmEAQxyCggGxwKwcAxZABBkChhoBlCIGwywKyysNmJ79lx
7D4eKwIAOw==
}
image create bitmap ::img::dots -data {
#define dots_width 16
#define dots_height 9
static char dots_bits = {
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x84, 0x10, 0xce, 0x39, 0x84, 0x10}}
image create bitmap ::img::star -data {
#define plus_width 11
#define plus_height 9
static char plus_bits[] = {
0x00,0x00, 0x24,0x01, 0xa8,0x00, 0x70,0x00, 0xfc,0x01,
0x70,0x00, 0xa8,0x00, 0x24,0x01, 0x00,0x00 }}
frame .file -bd 2 -relief ridge
frame .finfo -bd 2 -relief ridge
frame .fcat -bd 2 -relief ridge
frame .fpages -bd 2 -relief ridge
frame .fpage -bd 2 -relief ridge
# File area
label .file.l -text "PDF File"
entry .file.fname -textvariable S(fname)
button .file.choose -image ::img::dots -command ChooseFile
button .file.go -text Parse -command Go
button .file.about -image ::img::question -command About
grid .file.l .file.fname .file.choose -sticky ns -pady 5
grid .file.fname -sticky nswe
grid .file.go - - -pady 5
grid columnconfigure .file 1 -weight 1
place .file.about -anchor se -relx 1 -rely 1 -x -5 -y -5
# Info area
label .finfo.title -text "Info"
scrollbar .finfo.sb -orient vertical -command {.finfo.info yview}
listbox .finfo.info -width 40 -listvariable S(info) -yscroll {.finfo.sb set}
# Catalog and data area
label .fcat.title -text "Catalog"
scrollbar .fcat.sb -orient vertical -command {.fcat.catalog yview}
listbox .fcat.catalog -listvariable S(catalog) -yscroll {.fcat.sb set}
label .fpages.title -text "Page Tree Root"
scrollbar .fpages.sb -orient vertical -command {.fpages.pages yview}
listbox .fpages.pages -listvariable S(pages) -yscroll {.fpages.sb set}
frame .fpage.btns
label .fpage.l -text "Page"
entry .fpage.num -width 3 -textvariable S(pageNum) -justify center \
-validate key -vcmd {string is integer -strict %P}
button .fpage.go -image ::img::star -state disabled -command GoPage
scrollbar .fpage.sb -orient vertical -command {.fpage.page yview}
listbox .fpage.page -listvariable S(page) -yscroll {.fpage.sb set}
grid .file - - - -sticky ew
grid .finfo .fcat .fpages .fpage -sticky news
grid columnconfigure . {1 2 3} -weight 1
grid columnconfigure . 0 -weight 2
grid rowconfigure . {1} -weight 1
pack .finfo.title -side top -fill x
pack .finfo.sb -side right -fill y
pack .finfo.info -side top -fill both -expand 1
pack .fcat.title -side top -fill x
pack .fcat.sb -side right -fill y
pack .fcat.catalog -side top -fill both -expand 1
pack .fpages.title -side top -fill x
pack .fpages.sb -side right -fill y
pack .fpages.pages -side top -fill both -expand 1
pack .fpage.btns -side top
pack .fpage.l .fpage.num .fpage.go -in .fpage.btns -side left
pack .fpage.sb -side right -fill y
pack .fpage.page -side top -fill both -expand 1
#grid x .fpage.l .fpage.num .fpage.go x
#grid configure .fpage.go -padx 5 -sticky ns
#grid .fpage.page - - - - .fpage.sb -sticky news
#grid columnconfigure .fpage {0 4} -weight 1
#grid rowconfigure .fpage {1} -weight 1
foreach var {fname pageNum havePDF} {
foreach t [trace info variable S($var)] {
eval trace remove variable S($var) $t
} trace add variable S($var) write Tracer
}
set pdfs [glob -nocomplain *.pdf]
if {$S(fname) eq "" && [llength $pdfs] > 0} {
set S(fname) [lindex $pdfs [expr {int(rand()*[llength $pdfs])}]]
}
if {$S(fname) ne ""} {
Go
}
}
proc Tracer {var1 var2 op} {
global S
if {$var1 ne "S"} return
if {$var2 eq "fname"} {
.file.go config -state \
[expr {[file isfile $S(fname)] ? "normal" : "disabled"}]
} elseif {$var2 eq "pageNum" || $var2 eq "havePDF"} {
set how disabled
if {$S(havePDF) && [string is integer -strict $S(pageNum)]} {
set how normal
}
.fpage.go config -state $how
}
}
proc ChooseFile {} {
global S
set fname [tk_getOpenFile -parent . -defaultextension ".pdf" \
-initialfile $S(fname) -filetypes {{{PDF Files} .pdf}} \
-title "PDF Parser Open File"]
if {$fname eq ""} return
set S(fname) $fname
}
proc ArrayToListBox {arrName var} {
global S
upvar 1 $arrName arr
set S($var) {}
set max 0
foreach name [array names arr] {
if {[string length $name] > $max} {
set max [string length $name]
}
}
incr max 1
if {[info exists arr(/Type)]} {
lappend S($var) [format " %-*s = %s" $max /Type $arr(/Type)]
}
foreach name [lsort -dictionary [array names arr]] {
if {$name eq "/Type"} continue
lappend S($var) [format " %-*s = %s" $max $name $arr($name)]
}
}
proc Go {} {
global S CATALOG PAGES INFO
set S(catalog) [set S(pages) [set S(page) {}]]
if {! [file isfile $S(fname)]} return
ParsePDF.File $S(fname)
ArrayToListBox INFO info
ArrayToListBox CATALOG catalog
ArrayToListBox PAGES pages
set S(havePDF) 1
GoPage
}
proc About {} {
set txt "Parsing PDF\nby Keith Vetter, Oct 2007\n\n"
append txt "This program demos code that can parse PDF files. The\n"
append txt "code displays a pdf's document structure including its\n"
append txt "catalog and the root of its page tree. It will also show\n"
append txt "the page object for any page.\n\n"
append txt "For more info on PDF format see\n"
append txt "http://partners.adobe.com/asn/tech/pdf/specifications.jsp"
tk_messageBox -message $txt
}
proc GoPage {} {
global S PAGES PAGE
if {! $S(havePDF)} return
if {! [string is integer -strict $S(pageNum)]} return
if {$S(pageNum) < 1} {
set S(page) {{Page number too small}}
return
} elseif {$S(pageNum) > $PAGES(/Count)} {
set a [expr {$PAGES(/Count) == 1 ? "page" : "pages"}]
set S(page) [list [list Only $PAGES(/Count) $a]]
return
}
set id [GetPage $S(pageNum)]
if {$id eq ""} {
set S(page) "{bad page number}"
} else {
ArrayToListBox PAGE page
set S(page) [concat [list "Object id: $id"] {---------} $S(page)]
}
}
DoDisplay
return
======
[UKo] 2008-05-31: I have added some code to get the info block from the pdf-file and cleaned up the code a bit, so that it can be started with a recent tclkit (tested tclkit 8.4 and kbskit 8.5). Also the strings from the info block can be decoded.
Invocation: parsePDF.tcl ?-gui? ?file.pdf?
Without the ''-gui'' option the information will be displayed on the commandline. With ''-gui'' a given pdf-file will be analysed, without a file a random file from the current directory will be chosen.
----
[Andreas otto] 2007-10-21 : other software solution [pdflib]
[KPV]: Note, [pdflib] is a commercial, non-TCL (ANSI C) product
----
EE 2008-03-05: NOTE that the above URL pointing at the PDF
specification points into a subscriber-only area on the Adobe
website. The PDF spec is publicly available at
http://www.adobe.com/devnet/pdf/pdf_reference.html
----
[Scott Beasley] 2008-06-01: I started on a PDF Page splitter in tcl, but dropped it in favor of the iText Java Class library [http://itextpdf.sourceforge.net/] and [TclBlend]. The iText code does so much and has worked so well, it was hard to pass up. I know in the long run for many people though, a tcl only solution would be the best for them.
[ABU] 2008-06-03: I agree with Scott. iText is far more complete and it is both for reading and generating PDF. iText + TclBlend is a good solution for tclers. Just two little things are missing :
1. TclBlend is paranoic about the tcltk interpreter it is intended to run with - if you change the interpreter, you should rebuild the tclblend binaries.
1. iText Java API is really complex and huge, it exposes an intricated mixin of low-level functions (..methods) and high-level functions. It should be useful to redesign a more smart, high-level API in.
[CMcC] 3Jun08 I would prefer a tcl solution - after all, there's already a PDF parser in C without going to Java, if that were the only dimension of solution.
[RKZn] 2019-07-20: In 2019 this is still a nice module to learn about PDF structure, though it is outdated. For starters it only handles plain text XREFs (cross-references) while nowadays some get encoded in binary; and not all indirect objects are PDF dictionaries. I've been using it to learn and hopefully make a small module to make a few simple edits. Maybe... So far I noticed a couple of bugs here, that would show up when the document has a empty hexadecimal encoded string, i.e. a `<>`. So I changed the code above (at GetToken) from
======tcl
while {1} {
set ch [GetChar]
if {$ch eq ">"} break
}
======
to
======tcl
while {$ch ne ">"} {
set ch [GetChar]
}
======
and from
======tcl
} elseif {[regexp {^<([^<][^>]*)>[^>]?$} $token -> hexstring]} {
======
to (added a `?`)
======tcl
} elseif {[regexp {^<([^<]?[^>]*)>[^>]?$} $token -> hexstring]} {
======
while at it, I updated (at DoDisplay) the deprecated call
======tcl
trace variable S($var) w Tracer
======
to
======tcl
trace add variable S($var) write Tracer
======
<<categories>> Printing | Parsing | PDF