PDF

Difference between version 65 and 66 - Previous - Next
'''[https://en.wikipedia.org/wiki/Portable_Document_Format%|%PDF]''', or '''Portable Document Format''', by [http://www.adobe.com/%|%Adobe], is a [https://en.wikipedia.org/wiki/Page_description_language%|%page description language] derived from [Postscript].  In 2008, PDF became [ISO] standard 32000-1:2008, based on PDF version 1.7.



** Documentation **

   [http://www.adobe.com/devnet/pdf/pdf_reference.html%|%PDF Reference and Adobe Extensions to the PDF Specification]:   Official reference collection.


   [http://phaseit.net/claird/comp.text.pdf/PDF.html%|%Personal Notes on PDF], [Cameron Laird]:   



** Tcl and PDF, an Overview **

[VI]: There are two problems when using PDFs.  One is that PDF is
page based, like [PostScript], so pagination is the application's headache.
Add headers, footers, tables that span pages, and this soon gets
to be like a mini typesetting program.  Doing a canvas on one page
(modulo the next para) or a text widget onto one page seems doable.

The other is that I haven't figured out how to wrap text consistently.
e.g. if a text item uses the `-width` option in the canvas, then I
don't know how to decide where to break the text.  I have read about
this, and I've heard that this is possible in 8.4.5 or 8.5, but I
still don't know how.

From the packages below it is clear that the only really missing thing is the conversion of the text widget to PDF. This has not yet been done and working with fonts, type, and glyphs is perhaps also one of the most difficult parts of a PDF file. Further, it seems that the Tk text widget lacks some [introspection] features that are needed when converting a text widget to PDF.



** Tools: Converters **

   [images2pdf]:   A simple utility to bundle images into a PDF document without transoding the images. Uses [pdf4tcl].

   [Trampoline!], by [Mac Cody]:   Aims to provide a means to export the contents of a canvas widget to a PDF document (see [http://trampoline.sourceforge.net]).  The internal procedures within Trampoline! can also be used to generate PDF documents without Tk being present.  The plan is to eventually split Trampoline! into two separate packages.  Trampoline! will contain the Tk-dependent components that will call another package, called TclPDF, that will contain Tk-independent functionalities for generating PDF documents.

   [Text2PDF], by [KPV]:   Produces PDF from some [ASCII] text.

   [http://www.assembla.com/wiki/show/tktopdf%|%tktopdf]:   A [Project Ideas for Google Summer of Code 2008%|%GSoC 2008] project to to create PDF from Tk windows (excluding ttk and the text widget). It is based on a library from [Clif Flynt], who also worked on generating PDFs from Tcl.

   [http://noucorp.com/Tcl/applications/postcard/index.htm%|%Postcard] ([starkit]), by [Clif Flynt]:   Contains a subset of a canvas->PDF script which produces lines, text, and nested canvases.


   [http://ftp.ctan.org/tex-archive/macros/latex/contrib/tclldoc/examples/%|%writepdf], by [Lars H]:   A [pure-Tcl] package that can generate well-formed PDF files.  The API is roughly at the same level of abstraction as writing raw [postscript] with [puts], so you need to be familiar with the PDF specification to use it.  '''anonymous''': ''It would be great to have something sitting on top of writepdf which can generate correct PDF file from a text widget and a canvas...It seems like this ought to be possible in pure Tcl.'' [Lars H]: It is certainly possible (at least to the same extent as the analogous generation of PS files). I suspect the main difficulty would be how to handle text, or more precisely, select a font that matches that used in the original widget. My more immediate interest was otherwise to be able to use PDF as an "export format" for graphics not really having anything to do with [Tk], but I don't know when I can return to even that part of the project.



** Tools: Editors **

   [pdf4tcl], by Frank Richter, maintained by [Peter Spjuth]:   A [BSD license%|%BSD-licensed] [pure-Tcl] package which started as a port of pdf4php.  It has a friendly Tclish interface and supports fair number of PDF features.

   [pdflib]:   An [ANSI] C library for creating new PDF files. Bindings for a number of languages, including Tcl, are available.  This package is not designed to access and update existing PDF files.  PDFLib is not entirely free. Notice at the heart of PDFLib's [http://www.pdflib.com/purchase/howtobuy.html%|%licensing] this summary:  "You may test and fully include our software in your projects, but you need a valid license key to remove the 'www.pdflib.com' demo stamp on all generated pages."



** Tools:  Rendering **

   [tclMuPdf]:   



** Other Tools **

   http://spivey.oriel.ox.ac.uk/corner/Obfuscated_PDF?195#Scramble%|%Obfuscated PDF%|%:   Provides a Tcl script that scrambles the encodings of a PDF file so that it is legible when displayed, but delivers gibberish when attempts are made to copy and paste text from the document.



** To Sort: Other Utilities**

*** Tclhpdf ***

 What: Tclhpdf
 Where: http://reddog.s35.xrea.com/wiki/tclhpdf.html
 Description: Tcl interface to HARU Free PDF Library
 Updated: 2007-09  (Added by Roy Terry see comp.lang.tcl posting: 2007-09-22)
 Contact: [email protected]

*** TEItools ***

 What: TEItools
 Where: http://xtalk.price.ru/SGML/TEItools/
 Description: Collection of Tcl and other tool scripts used for
        transforming SGML documents to various other formats.  Currently
        supports HTML, LaTeX2e, RTF, PS, and PDF.  Uses CoST.
 Updated: 04/1998
 Contact: mailto:[email protected]

*** WELD ***

 What:  WELD
 Where: http://www.javafoundry.com/javapdf
 Description: Web Enabled Logic Documents (WELD) make use of Jacl/Tcl to
        provide the ability to hold WWW applications together.  WELD is
        a part of JavaPDF (the open source servlet utilities project).
 Updated: 2001-08
 Contact: mailto:[email protected] (Lee T  Hall)


*** Epspdf ***

http://tex.aanhet.net/epspdf/%|%Epspdf%|% is a utility for converting between PDF, [eps] and [PostScript] formats.  It can be run via command line or with a Tk GUI.  Packaged as a [Starpack].



** Other ways of producing PDF Documents **

[Sven Sass%|%Sen Sass's] employer generates PDF by way of [Java] classes
converting [Docbook] to [FO] to PDF (!).

[Perl] and [Python] do PDF.  For example, see [https://web.archive.org/web/20060830180714/http://www.unixreview.com/documents/s=7822/ur0304g/%|%Regular Expressions: Low-cost PDF%], [Cameron Laird] and Kathryn Soraiz, [https://en.wikipedia.org/wiki/UNIX_Review%|%Unix Review], 2003-04. Tcl, of course, has ways to exploit anything available through those languages, including [Tclperl] and [Tclpython].

Another well-worn pathway involves [transforming HTML to PDF].

[GS] 2003-11-29: [https://web.archive.org/web/20091025004408/http://wwwvms.mppmu.mpg.de/vmssig/src/C/txt2pdf.c%|%txt2pdf] is a simple [C] source code to convert text files to PDF.

[Googie]:  [Image Magick] contains '''import''' tool, which can be executed by [exec], so we can import [wish] window and save it as PDF. We can also use [TclMagick] instead of [exec].


[tonytraductor]:  In [http://www.linguasos.org/tcltext.html%|%TickleText] I just did:

======
proc pdfout {} {
    if {$::filename != " "} {
        set data [.txt.txt get 1.0 {end -1c}]
        set fileid [open $::filename w]
        puts -nonewline $fileid $data
        close $fileid
        eval exec enscript $::filename -q -B -p $::filename.ps &
        eval exec ps2pdf $::filename.ps $::filename.pdf &
        eval exec rm $::filename.ps &
    }  else { 
        set filename [tk_getSaveFile -filetypes $::file_types]
        set data [.txt.txt get 1.0 {end -1c}]
        wm title . "Now Tickling: $::filename"
        set fileid [open $::filename w]
        puts -nonewline $fileid $data
        close $fileid
        eval exec enscript $::filename -q -B -p $::filename.pdf &
        eval exec ps2pdf $::filename.ps $::filename.pdf &
        eval exec rm $::filename.ps &
    }
}
======

but, honestly, I don't know if enscript works on Windows.
Or for exporting .tex files to pdf I just used pdflatex in a similar manner.

======
proc texpdf {} {
   if {$::filename ne { }} {
       set data [.txt.txt get 1.0 {end -1c}]
       set fileid [open $::filename w]
       puts -nonewline $fileid $data
       close $fileid
       eval exec pdflatex $::filename
   } else {
       set filename [tk_getSaveFile -filetypes $::file_types]
       set data [.txt.txt get 1.0 {end -1c}]
       wm title . "Now Tickling: $::filename"
       set fileid [open $::filename w]
       puts -nonewline $fileid $data
       close $fileid
       eval exec pdflatex $::filename
   }
}
======

Of course, I'm a total newbie, so, perhaps I'm missing something here, or maybe
there is a more efficient way to this.  I don't know.
It does work, though.
The [TeX] to PDF procedure complains about the period in the filename, but, 
a friend gave me a perl script to shave off the extension, so, I think
I can fix that...but with a perl script.
Either way, despite the complaint, it generates a pdf file every time.
I have posted samples to the Tickle Text site.



** Reading PDF Documents with Tcl **

The page [Parsing PDF] has some code for reading PDF files.

http://www.pstoedit.net%|%pstoedit%|% is a C++ program that translates PostScript and PDF graphics into other vector formats. Tk canvas is an output option.



** Format Details **

The content of a PDF file has four sections: header, body, cross references, and trailer. In essence, a PDF file consists of so-called objects which describe how the document should look like and how they are inter-related. Besides the objects, a PDF file contains some information on how and where to find these objects. The small examples below show a very simple PDF file. Apart from that, PDF files can contain binary data, can be encrypted, may be "linearized" (for quick access over the web), and even contain different versions of the same document in an incremental way.


*** Header ***

The header consists of one line like this:
======
%PDF-1.4
======
The number denotes the version of the file format. If the version specified is 1.4 or higher, the version specification can be overridden by a '''Version''' entry in the document's catalog dictionary.

The "1.4" in this example represents version 1.4 of the Adobe Portable Document Format, which is the version used by Acrobat 5. The highest valid value is "1.7" (Acrobat 8), because version 1.7 of the PDF specification was made into an ISO standard (ISO 32000:2008). After version 1.7, additional feature sets used by Acrobat are indicated by adding an '''ExtensionLevel''' key to the catalog dictionary.

*** Body ***

The body is a collection of so-called PDF objects. These carry the content which is displayed in a PDF viewer application. An object can look like this:
======
3 0 obj
<<
/Type /Page
/Parent 2 0 R
/Resources 4 0 R
/MediaBox [0 0 595 842]
/Contents [6 0 R]
>>
endobj
======
It is started by a line “`<object number> <generation number> obj`” and ended by “`endobj`”. In between there can be nearly anything depending on the type of the object. The example above shows a typical object containing a PDF dictionary (which is like a Tcl [array] or [dict%|%dictionary%|%]) with keys and corresponding values. The values can be arrays, numbers, strings, and such and even dictionaries again. PDF dictionaries can be said to contain metadata about the document and the content of the file. The real data are located in so-called streams:
======
8 0 obj
<<
/Length 32
>>
stream
10 10 m
10 100 l
100 100 l S
endstream
endobj 
======
The above object contains a stream that paints two lines on a page.

*** Cross-references ***

This sections contains the byte offsets for all objects in a PDF file. Like this:
======
xref
0 10
0000000005 65535 f 
0000000010 00000 n 
0000000060 00000 n 
0000000124 00000 n 
0000000231 00000 n 
0000000000 00001 f 
0000000299 00000 n 
0000000424 00000 n 
0000000542 00000 n 
0000000625 00000 n 
======
In this example there are 10 objects in the PDF file. This section of the PDF file makes it possible for viewer applications to quickly locate the objects needed to paint a pages content to the screen without searching through a possibly huge file. You can think of it as a poor-man's database lookup table.

***Trailer***
The trailer is normally the first thing read by the viewer application:
======
trailer
<<
/Size 10
/Root 1 0 R
/Info 9 0 R
>>
startxref
811
%%EOF
======

On the second but last line it contains the byte offset to the start of the cross-reference section. Th PDF dictionary before this offset number contains some information about the PDF. The line with the “`/Root`” entry is a link to the root object, which is the starting point for the viewer. The root object lists the Pages object which then again refers to the each Page object, and the Page object contains the list of things to paint.

----

Version 1.5 of the PDF specification introduces things called "Object Streams" and "Cross-Reference Streams". Objects streams are used to reduce the size of the PDF file. This is achieved by making most of the objects in the file into the content of a Stream Object, which stream can then be compressed. . Objects stored in an object stream are called "Compressed Objects". The PDF specification explicitly names them so even when not actually compressed. Cross-Reference streams allow to provide cross-reference information for compressed objects. A cross-reference stream identifies a stream object and then lists the offsets within the object's data stream for each of the referenced objects. A PDF file that uses object and cross-reference streams need not have a cross-reference table, in which case it would obviously not be compatible with previous versions of Acrobat. A PDF file that uses both cross-reference stream and cross-reference table is called "Hybrid". Such a document could be structured to contain different content depending on whether it is viewed by an older or newer PDF viewer.



** Discussions **

[The Mathematical Orchid]: Does anybody know of a pure Tcl tool to ''read'' PDF files? (For diagnostic purposes. I have a PDF file here that Acrobat can't open; even the newest version. But the author claims "it works fine on my PC".)

As far as I know, Tcl is able to handle binary files, so it should in principle be possible to do something like this. The only real difficulty is that parts of the file can be encrypted and/or compressed. (Anybody ever wrote a pure Tcl DEFLATE decompressor?)

Orchid, [CL] works often with PDF (and the people who work with PDF), and I suspect there's a 
problem "upstream".  Briefly, no, Tcl doesn't read PDF in the sense I believe you mean.  You're
welcome to e-mail or join the [Chat], for more detailed advice.

[Lars H]: A practical advice (if you can't fix the problem upstream) is to try a different PDF viewer (e.g. ghostscript can handle not only [postscript] but also PDF, to mention the most cross-platform one). In my early experiments with generating PDF the files used to be viewable in everything I tried ''except'' Acrobat, and the error messages gave no information. (Later PDF specifications more properly described the issue that Acrobat got hung up on, though, so I managed to fix it.)

[DKF]: Check that the PDF file has been downloaded correctly. If it was downloaded as text, there can be corruption of the contents (which, unlike [PostScript], are binary).

[The Mathematical Orchid]: Already tried Ghostscript. Gives me some error about "JavaScript is required". The problem is a PDF file with some diagrams which refuse to display. (Half the diagram shows up, then some cryptic error message.) Downloaded several times, still won't display. Author insists they display with his copy of Acrobat. I'm guessing they have some extra software or something that they've forgotten they have. Well anyway, I've read the PDF spec (and the PS one, actually), so maybe I'll have a go at writing something myself. (As I say, I can read the file in Notepad, but the FLATE compressed parts are a tad difficult to follow. Most of a PDF file is actually plain ASCII until you compress it...)

[IDG]: Recent versions of acrobat have javascript built in, but it can be turned off in the preferences menu. Do you have an old version, or have you got javascript turned off?

[schlenk]: Decoding the pdf stream isn't very hard, but your right that the compression is one thing that needs a bit more work, the rest is straight forward, unless you have edited pdfs with newer dicts. I have an example that reads PDF metadata with Tcl somewhere (need to search it...), was two hours work or so.

[DKF]: BTW, [Changes in Tcl/Tk 8.6%|%8.6] has a built-in [zlib deflate] command.



** Not PDF, but Anyway ... **

"[http://webpages.charter.net/jtlien/pdf.html%|%pdf417_encode] ([https://sourceforge.net/projects/pdf417encode/%|%sourceforge]) converts ASCII strings into pdf417 [barcode]". NOTE: PDF417 barcodes have nothing whatsoever to do with Adobe's Portable Document Format.
**PDF Forms**

[AMG]: Any advice on how to script filling in existing PDF forms?

<<categories>> Glossary | Documentation | Printing | Word and Text Processing | PDF