tclMuPdf

ABU 18-Aug-2024 - tclMuPDF 2.4.1 released. Added compatibility with Tcl9

ABU 12-Oct-2024 - tclMuPDF 2.4.3 released. BUGFIX: in some corner cases, crash when saving a PDF modified with the graft/embed methods. Now FIXED.

tclMuPDF is a porting of the MuPDF framework (see at mupdf.com ), for fast and high-quality rendering of PDF pages.

See also

  • muPdfWidget a fully customizable widget for displaying PDF-files
  • Highlighter a small app for highlighting text on PDF documents.
  • PiX PiX is a small app for extracting images from PDFs.
  • miniMu A very simple Pdf viewer built using tclMuPdf.

saito - 2024-04-24

Recently I tried to use this package to extract images from a pdf file. I am on Windows and using mupdf version 2.1.1. I can get the list of images using the "$page images list" command and it seems to be correct. But wen it comes to saving them ($page images extract ...), the images are not extracted. I have tried different options available including with and without image id's, output directory, output image name, etc. In a few cases, the package actually segfaults. This was a single page pdf with a couple of images.

Is this an intended use case of the package? I wonder if I am completely off the mark here or if I am just not using the commands correctly.

ABU Please, open a ticket at https://sourceforge.net/p/irrational-numbers/tickets/ , attaching that specific PDF, or you can try PiX; it is a small app with an (old) embedded tclMuPdf just for extracting all the images from PDF files.


sample
On this PDF page the yellow annotations and the red seal were added with tclMuPDFImage tclmupdf-stamp

 History
  • 13-dec-2016 - Version 1.0b1 (beta) released. No support for MacOS
  • 15-dec-2016 - Version 1.0 - Support for MacOS. API unchanged, but big internal optimization for reusing opened pages (read
  • 28-jan-2017 - Version 1.1 (Withdrawn) - new commands: fields, field, anchor, mupdf::libinfo . Added package mupdf-notk for tcl-only usage. (read the documentation). Aligned with core library MuPDF v.1.10a
  • 23-feb-2017 - Version 1.2 (Withdrawn) with a lot of new features:
    • commands for extracting and searching text
    • commands for extracting images from PDF pages (experimental)
    • first steps towards PDF manipulations: you can add new signature fields, then save your changes.
    • many minor auxiliary commands
  • 26-feb-2017 - Version 1.2.1 - BugFixing - replaces Version 1.1 and 1.2
    • a bug related to the saveImage command, introduced in 1.1 was fixed here.
    • Versions 1.1 and 1.2 were withdrawn
  • 29-sep-2017 - Version 1.3 - image extraction
    • subcommands for extracting images, previously released as experimental.images are now official, more powerful and deeply tested with a lot of different kinds of images.
    • Aligned with core library MuPDF v.1.11
  • 29-oct-2017 - Version 1.4 - portfolio & password-protected files
    • commands for working with PDF-portfolio (embedded files)
    • ability to work with password-protected PDFs.
    • BUG-FIX removed limit to 130 char for extracting images (full path name length)
  • 20-Dec-2018 - Version 1.5 - field get&set, new export options, graft & embed pages.
    • Working with fields: now you can also change the fields values (and bug-fixed support for Unicode characters)
    • Saving/exporting PDF: added -decrypt and -flatten option
    • Grafting & embedding pages: you can graft a page taken from a PDF, and put it over/below another pages. (watermarks, stamps,.. )
  • 8-Jul-2020- Version 1.6 - passwords, forms&fields,annotations,blocks&lines ...
    • Password protected files
      • added commands for adding/changing passwords on a PDF. See methods [pdfHandle upwd] and [pdfHandle opwd]
    • Forms and Fields
      • added commands for changing field attributes. See method [pdfHandle fieldattrib]
      • added commands for flattening single fields. See method [pdfHandle flattenfield]
    • Annotations
      • added methods for listing, adding, changing, removing annotations. Currently only a limited set of annotation types are supported: highlight, underline, strikeout, squiggly. See methods [pageHandle annots] and [pageHandle annot ...]
    • Page and text structure
      • added methods for extracting the image/text "blocks" and the bounding box of each text line. These method are the basic building blocks - along with the new Annotation methods - for building an interactive PDF editor. See methods [pageHandle blocks] and [pageHandle lines]
  • 11-mar-2021 - Version 2.0 New Major release (broken compatibility with 1.x)
  • 10-aug-2021 - Version 2.1 (using MuPDF-core 1.18 features as of 2021-08-08)
    • text extraction
      • added options for extracting text and block/lines bounding-boxes.
    • BUG-FIX - fixed PDFColorToTkColor()
  • 20-aug-2021 - Version 2.1.1
    • Better control on MUPDF warning messages and repaired PDF
    • BUGFIX: flattening annotations may produce cropped appearances. Now FIXED
    • BUGFIX: (documentation) FIXED some wrong info about the "embed" and the "annot create stamp" methods
    • BUGFIX: saving the same file (incrementally) more than once in a session, may produce malformed PDF . Now FIXED
  • 28-jan-2023 - Version 2.2 - based on MuPDF 1.21.2
    • Option -overlay for rendering pages with a transparent background.
    • BUGFIX: Crash when closing some malformed PDF. Now FIXED
  • 8-feb-2023 - Version 2.3 - based on MuPDF 1.21.2
    • Extended methods for text-searching and text-extraction.
      • added option '-mode' to 'text' and 'newsearch' for fine control over the text-extraction logic (ligatures, de-hyphenate, whitespaces ...)
  • 14-jul-2023 - Version 2.4 - based on MuPDF 1.22.2
    • New command mupdf::new for creating a new empty PDF
    • New methods for adding/deleting/moving pages
    • New method addimage: insert a jpg/png image
    • BUGFIX: critical typo error in "graft" method (causing missing sync when a related doc is closed). NOW FIXED.
    • BUGFIX: saveImage : CRASH if "-from" is greater than the page size. Now FIXED
  • 18-aug-2024 - Version 2.4.1 - based on MuPDF 1.24.8
    • Adapted for Tcl9 compatibility.
      • Binaries for a given platforms are delivered with two DLL (one Tor tcl8.6+ and one for Tcl9.0+)
    • Mupdf-core library upgraded to 1.24.8
    • Small internal changes for alignment with the new MuPdf-core API.
    • No changes in the tclMuPdf API ( reference manual for tclMuPdf 2.4 i s still valid)
  • 31-aug-2024 - Version 2.4.2 - based on MuPDF 1.24.8
    • BUGFIX: interactive check of password protected PDFs, does not work with Tcl9. Now FIXED.
      • Added more tests for such cases
    • Code cleaning in XColors.c
  • 12-oct-2024 - Version 2.4.3 - based on MuPDF 1.24.8
    • BUGFIX: in some corner cases, crash when saving a PDF modified with the graft/embed methods. Now FIXED.
      • Added more tests for such cases
    • Small changes in the reference manual.

  Download

Starting from version 1.2.1, you can download tclMuPdf in a pre-built package with multi-platform support or, if you only need support for a single platform, you can download a lighter package.

Version 2.4.3 (12-Oct-2024)

  • [L1 ] FULL (Win x64, Linux x64, MacOS) (Warning: 39 MB)
  • [L2 ] Windows x64 (12 MB)
  • [L3 ] Linux x64 (13 MB)
  • [L4 ] Mac OS (13 MB)

Version 1.6 (July 2020) - Versions 1.x is no more maintained -

Note that specific platform support (e.g. "Linux 32") is not referred to the hosting O.S. architecture, but it's referred to the architecture of the TclTk interpreter. E.g. if you have a 32-bit TclTk interpreter running on a 64-bit Linux, you need the tclMuPdf package for linux-x32.

  • [L5 ] FULL (Win 32/64, Linux 32/64, MacOS) (Warning: 33MB)
  • [L6 ](Win 32/64) (12MB)
  • [L7 ] (Win 32 only) (6 MB)
  • [L8 ] (Win 64 only) (6 MB)
  • [L9 ] (Linux 32/64) (15MB)
  • [L10 ] (Linux 32 only) (8 MB)
  • [L11 ] (Linux 64 only) (7MB)
  • [L12 ] (MacOS 64 only) (7 MB)
  • ---
  • [L13 ] tclMuPdf version 1.x Development-Kit. For developers/maintainers.
  • [L14 ] tclMuPdf version 2.x Development-Kit. For developers/maintainers.

Examples

 set PDFfile /tmp/xxx.pdf
 set pdfObj [mupdf::open $PDFfile]
 set pageObj [$pdfObj getpage 0]  ;# page 0 is the first page
 $pageObj savePNG /tmp/img0.png -zoom 0.5
  # ...
 $pdfObj quit

Image tclmupdf-thumb

  # rendering just a part of a page (zoomed) 
 set PDFfile /tmp/zzz.pdf
 set pdfObj [mupdf::open $PDFfile]
 set pageObj [$pdfObj getpage 1]
 $pageObj savePNG /tmp/img1.png -zoom 2.75 -from 50 450 200 700

Image tclmupdf-particular|&

  Reference manual - tclMuPDF 1.6

tclMuPDF 1.6 Tcl meets MuPDF

Tcl meets MuPDF

SYNOPSIS

package require Tcl 8.5

package require mupdf ?1.6?

  • mupdf::open filename ?-password password?
  • pdfHandle upwd user-password
  • pdfHandle opwd owner-password
  • pdfHandle quit
  • pdfHandle close ?-flatten bool? ?-decrypt bool?
  • pdfHandle fullname
  • pdfHandle authentication
  • pdfHandle version
  • pdfHandle npages
  • pdfHandle getpage n
  • pdfHandle anchor anchorName
  • pdfHandle openedpages
  • pdfHandle closeallpages
  • pdfHandle fields
  • pdfHandle field fieldname
  • pdfHandle field fieldname value
  • pdfHandle fieldattrib fieldname
  • pdfHandle fieldattrib fieldname ?options?
  • pdfHandle flattenfield fieldname ?...?
  • pdfHandle signatures
  • pdfHandle haschanges
  • pdfHandle canbesavedincrementally
  • pdfHandle export filename ?-flatten bool? ?-decrypt bool?
  • pdfHandle portfolio list
  • pdfHandle portfolio extract i ?-dir pathname? ?-as filename?
  • pdfHandle search needle ?-startpage pagenum? ?-max hits?
  • pdfHandle search..more ?-max hits?
  • pdfHandle graft pageHandle
  • pageHandle embed graftID ?...options...?
  • pageHandle annots
  • pageHandle annot annotID
  • pageHandle annot annotID attribute
  • pageHandle annot annotID attribute value ?attribute value ...?
  • pageHandle annot annotID delete
  • pageHandle annot create type ?attribute value ...?
  • pageHandle blocks
  • pageHandle close
  • pageHandle size
  • pageHandle docref
  • pageHandle lines
  • pageHandle pagenumber
  • pageHandle savePNG filename ?-zoom zoom? ?-from x0 y0 x1 y1?
  • pageHandle saveImage image ?-zoom zoom? ?-from x0 y0 x1 y1? ?-to x0 y0?
  • pageHandle addsigfield fieldname x0 y0 x1 y1
  • pageHandle search needle ?-fromtop true/false? ?-max hits?
  • pageHandle images list ?-id imageID?
  • pageHandle images extract ?-id imageID? ?-dir pathname? ?-as pattern? ?-transparency boolean?
  • mupdf::imagenamesformat ?pattern?
  • mupdf::close handle (**DEPRECATED**)
  • mupdf::quit pdfHandle (**DEPRECATED**)
  • mupdf::isobject handle
  • mupdf::type handle
  • mupdf::documents
  • mupdf::documentnames
  • mupdf::isopen filename
  • mupdf::cli_passwordhelper ?helperProc?
  • mupdf::tk_passwordhelper ?helperProc?
  • mupdf::libinfo

DESCRIPTION

Package mupdf integrates the MuPDF framework in Tcl. The focus of MuPDF is on speed, small code size, and high-quality anti-aliased rendering. The main goal of this integration is to generate images of the pdf pages, in a .png format, or directly in a Tk's photo image type. Thanks to its speed mupdf can be used for building interactive pdf-viewers with high-quality and real-time zooming. mupdf is a binary package, distributed in a multi-platform bundle, i.e. it can be used on

  • Windows 32/64 bit
  • Linux 32/64 bit
  • MacOS 64 bit

Just an example to get the flavor of how to use mupdf:

    # open a file and save 1st page as a .png file

    package require mupdf
    set pdf [mupdf::open /mydir/sample.pdf]
    set page [$pdf getpage 0]   ;# 0 is the 1st page
    $page savePNG /mydir/page0.png
    $pdf quit

MuPDF with and without Tk

Starting from version 1.1, you can also run mupdf from a tclsh interpreter, without loading Tk. The following command

package require mupdf-notk

can be used in a tclsh interpreter to load the package without requiring Tk support. You will be still able to save images as PNG files, but of course, some subcommands related to Tk won't be available (e.g saveImage ) The command

package require mupdf

loads the full package (and requires Tk).

mupdf Commands

mupdf supports the following commands:

mupdf::open filename ?-password password?
This is the main command: it opens the PDF-file filename and returns a pdfHandle to be used in subsequent operations. If filename is password-protected, you may specify a password adding the option -password; if option -password is not specified, mupdf will ask for a password. Read more about it in the section Working with Password Protected Files.
pdfHandle upwd user-password
pdfHandle opwd owner-password
add or change respectively the user-password or the owner-password. In order to reset the passwords,set them to "". Read more about it in the section Working with Password Protected Files.
pdfHandle quit
close and destroy the pdfHandle without saving the changes. All the related resources (e.g. opened pages) will be closed.
pdfHandle close ?-flatten bool? ?-decrypt bool?
close and destroy the pdfHandle and save the changes. All the related resources (e.g. opened pages) will be closed. See export for the meaning of the various options.
pdfHandle fullname
return the fully normalized pathname of the pdf-file.
pdfHandle authentication
return the current authentication mode for pdfHandle. It may be none (no auth required), user (opened with user's password), owner (opened with owner's password).
pdfHandle version
return the document's internal PDF-version.
pdfHandle npages
return the number of pages.
pdfHandle getpage n
return a pageHandle to be used in subsequent operations. Note that first page is page 0. Note that if the requested page is currently opened, getpage reuses the handle of the opened page.
pdfHandle anchor anchorName
return the location of anchorName as list of 3 numbers:
  • a page number ( -1 if anchorName is not found )
  • x displacement
  • y displacement x and y are hints for locating the anchor: they represent the displacement from the upper-left corner of the page (0,0).
pdfHandle openedpages
return the list of all pageHandles currently opened related to pdfHandle
pdfHandle closeallpages
close all currently opened pages related to pdfHandle
pdfHandle fields
return a list of field-records. Each field-record is a list of three elements:
  • the field-name
  • the field-type (button, radiobutton, checkbox, text, combobox, listbox, signature or unknown)
  • the field-value Note that for a signature field, if a signature is present the returned field-value is simply the fixed string <<signature>>.
pdfHandle field fieldname
return the field's value, or raise an error if fieldname is not a valid field. Warning: field-names with accented characters or in general characters from latin alphabets are accepted (even not encouraged). Fields-names with characters from non-latin alphabets (greek, russian, chinese, ...) are not supported.
pdfHandle field fieldname value
set the field's value, or raise an error if fieldname is not a valid field. Warning: see limitations in Notes and limitations about field names and values for details.
pdfHandle fieldattrib fieldname
get all the field's attributes, as a list of options/values, or raise an error if fieldname is not a valid field. Currently supported options are:
  • -readonly
pdfHandle fieldattrib fieldname ?options?
set the field's attributes, or raise an error if fieldname is not a valid field. Currently supported options are:
  • -readonly bool
pdfHandle flattenfield fieldname ?...?
flatten all the listed fields' appearances (and removes all the listed fields from the field-table). Note that flattening N fields with a single command is far more efficient than running N flattening commands; this is especially true for documents with many pages.
pdfHandle signatures
return a list of signature field-records . Each field-record is a list of two elements:
  • the field-name
  • the field-value Empty signature fields (blanc signatures) have a field-value equal to "" (empty string). Currently, filled signature fields are simply denoted with the fixed string <<signature>>.
pdfHandle haschanges
return 1 if pdfHandle has been changed, otherwise 0.
pdfHandle canbesavedincrementally
return 1 if pdfHandle can be saved in incremental-mode, otherwise 0.
pdfHandle export filename ?-flatten bool? ?-decrypt bool?
save the current document and its changes in an alternative filename. Valid options are:
  • -flatten bool - (**DEPRECATED**) flat all form's fields and annotations (default is false)
  • -decrypt bool - ((**DEPRECATED**) if bool is true remove the password protection from an encrypted pdf. If bool is false keep the original password protection. Note that the target filename should be different from the original pdf-file related with pdfHandle and in general, different from the name of any pdf-file currently opened in this process. When a pdfHandle is closed (see mupdf::close or "pdfHandle close"), all the changes will be saved to its original pdf-file. (see also mupdf::quit or "pdfHandle quit" for closing without saving changes). Saving with the -flatten option, is now (**DEPRECATED**). You should use "pdfHandle flattenfield fieldname1 ..." and then save/export the document. Saving with the -decrypt option is now (**DEPRECATED**). The default is to remove the password protection, unless the opwd or upwd commands explicitely set new passwords.
pdfHandle portfolio list
return the list of the embedded files
pdfHandle portfolio extract i ?-dir pathname? ?-as filename?
extract the i-th embedded file. ( i >= 0 ) If option -dir is not specified, the embedded file is saved in the current directory. If option -as is not specified, embedded file is saved with its original name
pdfHandle search needle ?-startpage pagenum? ?-max hits?
search the string needle starting from page-number pagenum (default is page 0) and returning up to hits results (default is 10). The result of search subcommand is a list of page-positions. Each page-positions is a list of two elements:
  • the page-number
  • a list with the 4 coords of the box enclosing the searched needle Next results can be retrived with the search..more subcommand.
pdfHandle search..more ?-max hits?
return a list of the next hits elements matching the last given needle. The result of search..more subcommand is a list of page-positions similar to the list returned by search.
pdfHandle graft pageHandle
import in the currently open document referred by pdfHandle the full content of the page referred by pageHandle. The page contents are just imported, not displayed, and should be controlled via the next embed command. The result of graft is a graftID that should be used by the next embed ... command for displaying the image of the source page within one or more pages of the document. Note: this page-image is not a raster-image; it's vector based and hence remains accurate across zooming levels. Note: graftID is unique for each opened document pdfHandle and it's valid until pdfHandle is closed.
pageHandle embed graftID ?...options...?
put a copy of a grafted page in the page referred by pageHandle. pageHandle should be a page of the same opened document who returned that graftID. Valid options are:
  • -over bool - displays the contents of the grafted page over (or below) the contents of the current page (default is true).
  • -angle theta - rotate the grafted imaged by theta degrees (-360.0 ..+360.0), Default is 0.0.
  • -zoom factor - enlarge/shrink the grafted imaged by a factor of factor (default is 1.0).
  • -from x0 y0 ?x1 y1? - just copy a rectangular sub-region of the grafted image. (x0,y0) and (x1,y1) specify diagonally opposite corners of the rectangle; (0,0) is the upper-left corner of the (visible portion of the) grafted page. If x1 and y1 are not specified, the default value is the bottom-right corner of the grafted page.
  • -to x0 y0 ?x1 y1? - place the grafted page in a rectangular subregion of the destination page. If x0 and y0 are not specifified, the grafted page is placed at (0,0) i.e. at the upper-left corner of the destination page. If x1 and y1 are not specified, the default value is the bottom-right corner of the destination page.
pageHandle annots
return the list of annotations-ID on pageHandle.
pageHandle annot annotID
return the list of the (so far recognized) annotIDs attributes. Two special read-only attributes -type and -rect' are provided for each annotation.
pageHandle annot annotID attribute
get the value of attribute for the annotation annotID.
pageHandle annot annotID attribute value ?attribute value ...?
set the the value of one or more attributes of the annotation annotID.
pageHandle annot annotID delete
delete the annotation annotID
pageHandle annot create type ?attribute value ...?
create a new annotation. Currently supported types are: highlight, underline, strikeout, squiggly. Common attributes for each supported type are:
  • -color { R G B } - R,G,B must be decimal numers between 0.0 and 1.0 Specific options for highlight, underline, strikeout, squiggly are:
  • -vertices {x0 y0 x1 y1 ... } - a list of 4x integers denoting the bounding boxes of the text to be marked. Note: these coordinates are not necessarily related to the text on a page. A new annotationID is returned.
pageHandle blocks
return the list of image/text blocks in pageHandle. A block is a list of 5 elements with the following format: textblock|imageblock x0 y0 x1 y1
pageHandle close
close and free all the resources of the page referred by pageHandle. the handle pdfHandle is destroyed.
pageHandle size
return the physical size of the page as a list of two decimal numbers. Note that page size is expressed in points, i.e. 1/72 inch.
pageHandle docref
return a reference to the related pdf-document as a pdfHandle
pageHandle lines
return the list of the bbox of the text lines in pageHandle. A bbox is a list of 4 elements with the following format: x0 y0 x1 y1
pageHandle pagenumber
return the pagenumber of pageHandler
pageHandle savePNG filename ?-zoom zoom? ?-from x0 y0 x1 y1?
render the page in a .png file named filename. With a default -zoom factor equal to 1.0, a page whose size is W x H points is rendered as a raster image of W x H pixels. If -zoom is specified, the resulting image size is scaled by a factor of zoom. By default the whole page is rendered; the -from option, allows you to render only a given rectangular area of the page. x0 y0 are the coords of the top-left corner and x1 y1 are for the bottom-right corner. These coords must be expressed in terms of the physical size of the page, i.e in points Note that if these coords lies outside of the page, only the intersection of this area with the page area is rendered.
  ...
  set page [$pdf getpage 0]   ;# 0 is the 1st page
  lassign [$page size] dx dy
   # save just the upper half of the page
  $page savePNG /mydir/page0.png -zoom 2.25 -from 0 0 $dx [expr $dy/2]
  ...
pageHandle saveImage image ?-zoom zoom? ?-from x0 y0 x1 y1? ?-to x0 y0?
render the page in an existing Tk's photo image. The width and/or height of image are unchanged if the user has set on it an explicit image width or height (with the -width and/or -height configuration options, respectively). About the -zoom and -from options, the same rules for the savePNG apply. Option -to allows you to place the resulting raster image at the x0 y0 coords of the destination image. By default, is -to 0.0 0.0 NOTE: this command is not available with the package mupdf-notk.
pageHandle addsigfield fieldname x0 y0 x1 y1
add a blank signature field in a rectangular box at coords x0 y0 x1 y1. fieldname must be unique among the existing field names.
pageHandle search needle ?-fromtop true/false? ?-max hits?
search the string needle in the current page and return up to hits positions (default is 10). By default search starts from top of the page. If you need to find the next hits, use option -fromtop false. The result of search subcommand is a list of positions. Each positions is a list with the 4 coords of the box enclosing the searched needle.
pageHandle images list ?-id imageID?
return a list of all the images contained in the page referred by pageHandle. The result of images list subcommand is a list of image-records. Each image-record is a list of six elements:
  • image-ID (unique for each page)
  • image's width (in pixel)
  • image's height (in pixel)
  • image's colorspace (e.g. DeviceRGB, ICCBased, ...)
  • image's bit per component (number of color components may be inferred from colorspace)
  • mask-flag : 1 means that the image has a pixel-mask (i.e. some transparent pixels) If option -id is present, the resulting list is limited to the image-record for imageID.
pageHandle images extract ?-id imageID? ?-dir pathname? ?-as pattern? ?-transparency boolean?
if option -id is specified, extract and save the image referred by imageID (see images list). If option -id is missing, all the images contained in a page are extracted and saved. If option -dir is not specified, images are saved in the current directory. If option -as is not specified, images are saved with a name derived from the default mupdf::imagenamesformat (see below), otherwise images are saved as pattern (for pattern rules see below for mupdf::imagenamesformat). if option -transparency is true, save the (semi)transparent pixels (if any), otherwise transparent pixels (if any) are rendered as white pixels. extract returns a list of extracted-records. Each extracted-record is a list of three elements:
  • image-ID (unique for each page)
  • image-name (pdf page's internal name)
  • saved filename (empty string if the image was skipped (unknow format...))
mupdf::imagenamesformat ?pattern?
a pattern is a parametric filename specification similar to the format specification used by the C printf function. If pattern is not specified, return the currently defined default pattern. If pattern is specified, set the default pattern. When the "pageHandle images extract ..." command is called, all the images are saved with a filename based on a pattern (This pattern can be explicit, if option -as is present, or it can be implicit, based on the default mupdf::imagenamesformat). a pattern is a a simple filename specification (just the base-name, since the extension of the extracted images (png, jpg, ...) is automatically determined.) with zero or more special symbols like the following ones:
  • %p : page number (first page is 1)
  • %P : total number of pages
  • %i : image number - images in a page are numbered starting from 1
  • %I : total number of images (in the current page) Special symbols may also be written with a padding-specification like %5P; this notation means the the symbol %P should be padded as a 5-character string with leading '0's. If more than one image is extracted in a single operation, and pattern does not contain the %i symbol (the image number), then -%i is implicitely appended in order to avoid a filename collision. Assuming that the current page is the page 123 (i.e the 124th page), the following command
  $pageHandle images extract -as "IMG_%5p"

will generate IMG_00124-1.jpg, IMG_00124-2.jpg ..... (note: the file extension may be different)

  $pageHandle images extract -as "Z%5p(%2i)"

will generate Z00124(01).jpg, Z00124(02).jpg ..... (note: the file extension may be different)

mupdf::close handle (**DEPRECATED**)
This command has been deprecated since mupdf 1.5. Use "pdfHandle close ...." or "pageHandle close". if handle refers to a pageHandle, close the page. if handle refers to a pdfHandle, close the pdf (updating its original pdf-file) and all its opened pages.
mupdf::quit pdfHandle (**DEPRECATED**)
close and destroy the pdfHandle without saving the changes. This command has been deprecated since mupdf 1.5 . Use "pdfHandle quit" .
mupdf::isobject handle
return 1 if handle is a valid reference to a pdf or a page.
mupdf::type handle
return document or page if handle is a valid reference, else raise an error.
mupdf::documents
return a list of the currently opened pdfHandles
mupdf::documentnames
return a list of pdf-filenames currently opened (fully normalized filenames).
mupdf::isopen filename
check if filename is among the currently opened pdf-files.
mupdf::cli_passwordhelper ?helperProc?
get/set the helper procedure for shell-like applications. If helperProc is ', then the default helper is re-set. See section Working with Password Protected Files'.
mupdf::tk_passwordhelper ?helperProc?
get/set the helper procedure for applications with a Tk graphical interface. If helperProc is ', then the default helper is re-set. See section Working with Password Protected Files'.
mupdf::libinfo
return specific attributes of the underlying MuPdf libray as a list of keywords and their values. The provided keywords are:
version
The version of the underlying MuPDf library
..more to come ..

Working with Password Protected Files

You can open a password-protected PDF in a non-interactive or in an interactive way. With the first way, non-interactive way, you must provide in advance a password

  set pdf [mupdf::open book.pdf -password "123open"]

If password is wrong then an error is raised; look at the error message and check for the errorcode: it should be MUPDF WRONGPASSWORD. Note that there's no distinction between owner's password and user's password; if you provide the right owner's password, the PDF is opened in owner-mode, if you provide the right user's password, the PDF is opened in user-mode. You can check if PDF has been opened in user-mode or owner-mode by calling the authentication command.

  set pdf [mupdf::open book.pdf -password "123open"]
   # if we are here, it means that password was OK
  set mode [$pdf authentication]
   # "none" means that book.pdf was not password-protected
   # "user" means that the supplyed password was the user's password
   # "owner" means ...

The second way , interactive way, does not require to provide an explicit password; if the PDF is password-protected, mupdf will ask for a password. Depending on the nature of your application (with or without Tk), mupdf will select and call a predefined (yet customizable) helper procedure. These predefined, built-in procedures can be changed with the mupdf::cli_passwordhelper or mupdf::tk_passwordhelper command.

   # template for a (shell-like) password-helper procedure
   proc mypswdhelper {filename} {
       ...  ask for a password ....
       .... get the password ....
       ....
       return $password    
   }
  mupdf::cli_passwordhelper mypswdhelper

   # template for a (GUI) password-helper procedure
   proc myGUIpswdhelper {filename} {
       ...  raise some popup ...
       ...  ask for a password ....
       .... get the password ...
       .... close the popup
       return $password    
   }
  mupdf::tk_passwordhelper myGUIpswdhelper

Removing protection from a password protected files

If you open a password-protected file and export it, by default the resulting PDF is saved without password. You can explicitely control this behaviour by appending the -decrypt option to the export or close commands (default is -decrypt true).

Adding or changing a password

After opening a PDF, you can add or change the user/owner password with the opwd or the upwdcommands,like in the following example

  set pdf [mupdf::open mybook.pdf]
   # if mybook.pdf is proteccted, insert the password

  $pdf upwd "my user-password"
  $pdf opwd "my owner-password"
  $pdf close  ;# ..or $pdf export ...

Notes and limitations about field names and values

Due to current limitations of the underlying MuPdf library, field names and values should be limited to latin alphabets only. See the note enclosed with MuPdf's BUG 696478 (Nov 2018): '' ... "There is, however, a limitation with our form filling support that still

     needs to be addressed: it only supports PDFDocEncoding, not Unicode.

... "We can currently fill out the form fields in any language, but the display code still only handles the Latin alphabet." '' This means that you can enter text in a form field in any alphabet (tested with Russian, Chinese, Greek, ...) but the rendered text for the non-latin alphabets will appear wrong. Another major limitation is the rendering of fields other that 'text' widgets, i.e. checkbox, radiobuttons... For these kind of widgets the support of the underlying library is still incomplete, in particular for values expressed in non-ASCII characters (e.g. accented letters ... )

Limitations

Full support for portfolio management is still incomplete; commands for adding/removing/reordering embedded files will require a more robust mupdf-core implementation.

KEYWORDS

pdf, photo

CATEGORY

pdf parsing and rendering

COPYRIGHT

 Copyright (c) 2020, by A.Buratti

  Reference manual - tclMuPDF 2.3

tclMuPDF 2.4 tclMuPDF Tcl meets MuPDF

Tcl meets MuPDF

SYNOPSIS

package require Tcl 8.6

package require MuPDF ?2.4?

  • mupdf::open filename ?-password password?
  • mupdf::new filename
  • pdfObj authentication
  • pdfObj upwd user-password
  • pdfObj opwd owner-password
  • pdfObj removepassword
  • pdfObj quit
  • pdfObj close
  • pdfObj fullname
  • pdfObj version
  • pdfObj addpage pageNumber|end ?-size dx dy?
  • pdfObj deletepage pageNumber
  • pdfObj deletepages pageNumA pageNumB
  • pdfObj deletepages pageNumA pageNumB
  • pdfObj movepage from to
  • pdfObj anchor anchorName
  • pdfObj npages
  • pdfObj getpage n
  • pdfObj openedpages
  • pdfObj ispageopened pageNum
  • pdfObj closepage pageNum
  • pdfObj closeallpages
  • pdfObj fields
  • pdfObj field fieldname
  • pdfObj field fieldname value
  • pdfObj fieldattrib fieldname
  • pdfObj fieldattrib fieldname ?options?
  • pdfObj flattenfield fieldname ?...?
  • pdfObj signatures
  • pdfObj addsigfield fieldname pageNumber x0 y0 x1 y1
  • pdfObj haschanges
  • pdfObj export filename
  • pdfObj portfolio list
  • pdfObj portfolio extract i ?-dir pathname? ?-as filename?
  • pdfObj newsearch ?-mode listOfFlags?
  • pdfObj graft pageObj
  • pdfObj grafts
  • pdfObj embed graftID pageNumber ?...options...?
  • mupdf::printwarnings ?boolValue?
  • pdfObj warnings
  • pdfObj resetwarnings
  • pdfObj wasrepaired
  • searchObj destroy
  • searchObj find searchStr ?-max hits? ?-currpageonly boolean?
  • searchObj currpage
  • searchObj currpage n
  • searchObj docref
  • pageObj annots
  • pageObj annot get annotID
  • pageObj annot annotID
  • pageObj annot get annotID attribute
  • pageObj annot annotID attribute
  • pageObj annot set annotID attribute value ?attribute value ...?
  • pageObj annot annotID attribute value ?attribute value ...?
  • pageObj annot delete annotID ?annotID ...?
  • pageObj annot flatten annotID ?annotID ...?
  • pageObj annot create type ?attribute value ...?
  • pageObj blocks
  • pageObj close
  • pageObj destroy
  • pageObj size
  • pageObj docref
  • pageObj pagenumber
  • pageObj lines
  • pageObj text ?-bbox none|blocks|lines? ?-mode listOfFlags?
  • pageObj savePNG filename ?-zoom zoom? ?-from x0 y0 x1 y1? ?-overlay true/false?
  • pageObj saveImage image ?-zoom zoom? ?-from x0 y0 x1 y1? ?-to x0 y0? ?-overlay true/false?
  • pageObj images list ?-id imageID?
  • pageObj images extract ?-id imageID? ?-dir pathname? ?-as pattern? ?-transparency boolean?
  • mupdf::imagenamesformat ?pattern?
  • pageObj addimage -file filename|-reuse ID ?other options?
  • mupdf::isobject obj
  • mupdf::classes
  • mupdf::classinfo obj
  • mupdf::documents
  • mupdf::documentnames
  • mupdf::isopen filename
  • mupdf::Doc names
  • mupdf::Page names
  • mupdf::TextSearch names
  • mupdf::cli_passwordhelper ?helperProc?
  • mupdf::tk_passwordhelper ?helperProc?
  • mupdf::libinfo

DESCRIPTION

Package MuPDF integrates the MuPDF framework in Tcl. The focus of MuPDF is on speed, small code size, and high-quality anti-aliased rendering. The main goal of this integration is to generate images of the pdf pages, in a .png format, or directly in a Tk's photo image type. MuPDF also provides a partial support for editing PDFs, allowing to fill form-fields and working with annotations. Thanks to its speed MuPDF can be used for building interactive pdf-viewers with high-quality and real-time zooming. MuPDF is a binary package, distributed in a multi-platform bundle, i.e. it can be used on

  • Windows 64 bit
  • Linux 64 bit
  • MacOS 64 bit

Just an example to get the flavor of how to use MuPDF:

    # open a file and save 1st page as a .png file

    package require tclMuPDF
    set pdf [mupdf::open /mydir/sample.pdf]
    set page [$pdf getpage 0]   ;# 0 is the 1st page
    $page savePNG /mydir/page0.png
    $pdf quit

MuPDF 2.x and mupdf 1.x

Please note that MuPDF-2.x is a new major release breaking the backward compatibility with 1.x. In order to support these new features, and for the next planned features, MuPDF-2.x has been totally redesigned. In Appendix you can find a quick summary of the changes and how to adapt your old code. See From mupdf 1.x to MuPDF 2.x

MuPDF with and without Tk

MuPDF is provided in two variants:

  • tkMuPDF (or simply MuPDF) is the full package, it requires Tk
  • tclMuPDF does not require Tk. You will be still able to save images as PNG files, but of course, some subcommands related to Tk won't be available (e.g saveImage )

MuPDF Commands

The MuPDF package is based on 2 main classes mupdf::Doc, Mupdf::Page whose instances correspond to a PDF-document and its pages, plus an auxiliary class mupdf::TextSearch whose purpose is to manage progressive searches across all the pages of a document

mupdf::Doc methods

mupdf::open filename ?-password password?
This is the main command: it opens the PDF-file filename and returns a pdfObj to be used in subsequent operations.

If filename is password-protected, you may specify a password adding the option -password; if option -password is not specified, MuPDF will ask for a password. Read more about it in the section Working with Password Protected Files.

mupdf::new filename
This command is for creating a new, empty document. It returns a pdfObj to be used in subsequent operations. Note that this document has no pages, so the first thing to do is to call addpage and add some contents

Read more about it in the section Adding/removing pages.

pdfObj authentication
return the current authentication mode for pdfObj. It may be none (no auth required), user (opened with user's password), owner (opened with owner's password).
pdfObj upwd user-password
pdfObj opwd owner-password
add or change respectively the user-password or the owner-password. In order to reset the passwords,set them to "".

Read more about it in the section Working with Password Protected Files.

pdfObj removepassword
reset both the user-password and the owner-passwords.
pdfObj quit
close and destroy the pdfObj without saving the changes. All the related resources (e.g. opened pages ..) will be destroyed.
pdfObj close
save the changes (if any) and then close and destroy the pdfObj. All the related resources (e.g. opened pages ..) will be destroyed.
pdfObj fullname
return the fully normalized pathname of the pdf-file.
pdfObj version
return the document's internal PDF-version.
pdfObj addpage pageNumber|end ?-size dx dy?
add a new empty page before the page pageNumber, or at end, and return a new pageObj to be used in susequent operations. If -size is not specified, the new page is created with an A4-size (595.0x842.0)

See mupdf::Page methods. All the currently opened pages are properly renumbered.

pdfObj deletepage pageNumber
delete the page at pageNumber. All the currently opened pages are properly renumbered.
pdfObj deletepages pageNumA pageNumB
delete all the pages from pageNumA to pageNumB. All the currently opened pages are properly renumbered.
pdfObj deletepages pageNumA pageNumB
delete all the pages from pageNumA to pageNumB. If pagNumA is greater than pageNumB, then nothing is done, no error is raised. All the currently opened pages are properly renumbered.
pdfObj movepage from to
move the page from position from to position to. All the currently opened pages are properly renumbered.
pdfObj anchor anchorName
return the location of anchorName as list of 3 numbers:
  • a page number ( -1 if anchorName is not found )
  • x displacement
  • y displacement x and y are hints for locating the anchor: they represent the displacement from the upper-left corner of the page (0,0).
pdfObj npages
return the number of pages.
pdfObj getpage n
return a pageObj to be used in subsequent operations. See mupdf::Page methods. Note that first page is page 0. Note that if the requested page is currently opened, getpage reuses the handle of the opened page.
pdfObj openedpages
return the list of all currently opened pages (as page-numbers) related to pdfObj
pdfObj ispageopened pageNum
return 1 if page pageNum is currently opened, else 0.
pdfObj closepage pageNum
close the page referred bu pageNum. No error is returned if pageNum does not refer to an opened page.
pdfObj closeallpages
close all currently opened pages related to pdfObj
pdfObj fields
return a list of field-records. Each field-record is a list of three elements:
  • the field-name
  • the field-type (button, radiobutton, checkbox, text, combobox, listbox, signature or unknown)
  • the field-value Note that for a signature field, if a signature is present the returned field-value is simply the fixed string <<signature>>.
pdfObj field fieldname
return the field's value, or raise an error if fieldname is not a valid field.

Warning: field-names with accented characters or in general characters from latin alphabets are accepted (even not encouraged). Fields-names with characters from non-latin alphabets (greek, russian, chinese, ...) are not supported.

pdfObj field fieldname value
set the field's value, or raise an error if fieldname is not a valid field.

Warning: see limitations in Notes and limitations about field names and values for details.

pdfObj fieldattrib fieldname
get all the field's attributes, as a list of options/values, or raise an error if fieldname is not a valid field.

Currently supported options are:

  • -readonly
pdfObj fieldattrib fieldname ?options?
set the field's attributes, or raise an error if fieldname is not a valid field.

Currently supported options are:

  • -readonly bool
pdfObj flattenfield fieldname ?...?
flatten all the listed fields' appearances (and removes all the listed fields from the field-table).

Note that flattening N fields with a single command is far more efficient than running N flattening commands; this is especially true for documents with many pages.

pdfObj signatures
return a list of signature field-records . Each field-record is a list of two elements:
  • the field-name
  • the field-value Empty signature fields (blanc signatures) have a field-value equal to "" (empty string). Currently, filled signature fields are simply denoted with the fixed string <<signature>>.
pdfObj addsigfield fieldname pageNumber x0 y0 x1 y1
add a blank signature field on the page number pageNumber in a rectangular box at coords x0 y0 x1 y1.

fieldname must be unique among the existing field names.

pdfObj haschanges
return 1 if pdfObj has been changed, otherwise 0.
pdfObj export filename
save the current document and its changes in an alternative filename. Note that in general, filename should be different from the name of any pdf-file currently opened in this process. The only exception is when filename is the same name of pdfObj, but in this case you MUST quit pdfObj just after export.
pdfObj portfolio list
return the list of the embedded files
pdfObj portfolio extract i ?-dir pathname? ?-as filename?
extract the i-th embedded file. ( i >= 0 ) If option -dir is not specified, the embedded file is saved in the current directory.

If option -as is not specified, embedded file is saved with its original name

pdfObj newsearch ?-mode listOfFlags?
return a new searchObj object for performing text-search on the pdfObj document. See mupdf::TextSearch methods
pdfObj graft pageObj
import in the currently open document referred by pdfObj the full content of the page referred by pageObj. The page contents are just imported, not displayed, and should be inserted in pdfObj via the next embed command. The result of graft is a graftID that should be used by the next embed command.

Note: this page-image is not a raster-image; it's vector based and hence remains accurate across zooming levels. Note: graftID is unique for each opened document pdfObj and it's valid until pdfObj is closed.

pdfObj grafts
return the list of the currently grafted pages
pdfObj embed graftID pageNumber ?...options...?
embed a copy of a grafted page in the page referred by pageNumber.

Valid options are:

  • -over bool - displays the contents of the grafted page over (or below) the contents of the current page (default is true).
  • -angle theta - rotate the grafted imaged by theta degrees (-360.0 ..+360.0), Default is 0.0.
  • -zoom factor - enlarge/shrink the grafted imaged by a factor of factor (default is 1.0).
  • -crop x0 y0 ?x1 y1? - just copy a rectangular sub-region of the grafted image. (x0,y0) and (x1,y1) specify diagonally opposite corners of the rectangle; (0,0) is the upper-left corner of the (visible portion of the) grafted page. If x1 and y1 are not specified, the default value is the bottom-right corner of the grafted page.
  • -to x0 y0 - place the center of the grafted page at (x0, y0). If this option is not specified then the grafted page is placed in the center of the destination page.

Control on warning-messages

See the rationale for these commands in the section Better control for MuPDF internal errors/warnings and repaired PDFs

mupdf::printwarnings ?boolValue?
If boolVal is true, then all the next MUPDF warning-messages will be printed on stderr. If boolVal is false, printing on stderr is suppressed (but you can still inspect them with the "pdfObj warnings"" command) If boolValue is missing, the current setting is returned (true/false) The default setting is false.
pdfObj warnings
return a list with all the recorded MUPDF warning-messages.
pdfObj resetwarnings
reset the list of the warning-messages
pdfObj wasrepaired
returns true if the PDF was automatically repaired (and in this case you should find some details with the warnings method.)

mupdf::TextSearch methods

Instances of TextSearch objects allows to perform text-search on a given pdfObj document. TextSearch object can be created in two ways:

  set searchObj [mupdf::TextSearch new ''pdfObj'' ?'''-mode''' ''listOfFlags''?]
or
  set searchObj [''pdfObj'' newsearch ?'''-mode''' ''listOfFlags''?]

The -mode option can be used to control the quality of extracted text. Read more about it in the section Text-Extraction & Text-Search Flags. Note that when pdfObj is closed, all its TextSearch instances are automatically destroyed. TextSearch methods:

searchObj destroy
destroy the searchObj object.
searchObj find searchStr ?-max hits? ?-currpageonly boolean?
search for the string searchStr starting from the current search-position. Return a list of up to hits matches (default is 10); the current position is then accordling advanced. The result of the find subcommand is a list of page-positions. Each page-positions is a list of two elements:
  • the page-number
  • a list with the 4 coords of the box enclosing the searched searchStr If option -currpageonly is true, the search is limited to the page holding the current search-position.
searchObj currpage
return the current search-position as a page-number
searchObj currpage n
set the current search-position at the beginning of page n.
searchObj docref
return a reference to the related pdf-document as a pdfObj

mupdf::Page methods

A Page object can be created starting from a pdfObj document with the following command:

  set pageObj [$pdfObj getpage _n_]

If the requested page is currently opened, getpage reuses the handle of the opened page. Note that when pdfObj is closed, all its opened pages are automatically closed/destroyed.

pageObj annots
return the list of annotations-ID on pageObj. Annotations ID are unique within the whole PDF.
pageObj annot get annotID
pageObj annot annotID
return the list of the annotIDs attributes. The special read-only attributes -type' is provided for each annotation.
pageObj annot get annotID attribute
pageObj annot annotID attribute
get the value of attribute for the annotation annotID.
pageObj annot set annotID attribute value ?attribute value ...?
pageObj annot annotID attribute value ?attribute value ...?
set the the value of one or more attributes of the annotation annotID.
pageObj annot delete annotID ?annotID ...?
delete one or more annotations
pageObj annot flatten annotID ?annotID ...?
flatten one or more annotations
pageObj annot create type ?attribute value ...?
create a new annotation and return a new annotationID.

Currently supported types are: highlight, underline, strikeout, squiggly and stamp. Common attributes for each supported type are:

  • -color tkcolor - color must be a simbolic name (lightblue) or #RRGGBB Specific options for highlight, underline, strikeout, squiggly are:
  • -vertices {x0 y0 x1 y1 ... } - a list of 4x numbers denoting the bounding boxes of the text to be marked. Note: these coordinates are not necessarily related to the text on a page. Specific options for stamp are:
  • -xobject xObjectID - the ID of a grafted page (see graft method )
  • -rotate angle - the rotation anngle (in degree)
  • -scale s - the zoom factor
  • -to {x0 y0} - a list of 2 numbers. Note that the center of the stamp will placed at the coords specified with the -to option. If this option is missing, the center of the stamp will be placed in the centre of the page.
pageObj blocks
return the list of image/text blocks in pageObj. A block is a list of 5 elements with the following format:

textblock|imageblock x0 y0 x1 y1

pageObj close
pageObj destroy
close and free all the resources of the page referred by pageObj. the object pdfObj is destroyed.
pageObj size
return the physical size of the page as a list of two decimal numbers. Note that page size is expressed in points, i.e. 1/72 inch.
pageObj docref
return a reference to the related pdf-document as a pdfObj
pageObj pagenumber
return the pagenumber of pageObj
pageObj lines
return the list of the bbox of the text lines in pageObj. A bbox is a list of 4 elements with the following format:

x0 y0 x1 y1

pageObj text ?-bbox none|blocks|lines? ?-mode listOfFlags?
extract the plain text from the page pageObj with the (optional) bounding-box of each block or line. The -mode option can be used to control the amount and the quality of extracted text. Read more about it in the section Text-Extraction & Text-Search Flags.
  • If -bbox is missing or equal to none, this method returns just the plain text.
  • If -bbox is equal to blocks, this method returns a list like the following
  {bbox of 1st block} {plain text of the 1st block}
   ... 
  {bbox of the Nth block} {plain text of the Nth block}
  • If -bbox is equal to lines, this method returns a list like the following
{bbox of the 1st block} {
  {bbox of the 1st line} {plain text of the 1st line} 
  ... 
  {bbox of the Nth line} {plain text of the Nth line}
} 
...
{bbox of the Nth block} {
  {bbox of the 1st line} {plain text of the 1st line} 
  ... 
  {bbox of the Nth line} {plain text of the Nth line}
} 
pageObj savePNG filename ?-zoom zoom? ?-from x0 y0 x1 y1? ?-overlay true/false?
render the page in a .png file named filename.

With a default -zoom factor equal to 1.0, a page whose size is W x H points is rendered as a raster image of W x H pixels. If -zoom is specified, the resulting image size is scaled by a factor of zoom. By default the whole page is rendered; the -from option, allows you to render only a given rectangular area of the page. x0 y0 are the coords of the top-left corner and x1 y1 are for the bottom-right corner. These coords must be expressed in terms of the physical size of the page, i.e in points Note that if these coords lies outside of the page, only the intersection of this area with the page area is rendered. If the -overlay option is set to true then the page is rendered with a transparent background, else if it set to false (default) the page is rendered with a white background.

  ...
  set page [$pdf getpage 0]   ;# 0 is the 1st page
  lassign [$page size] dx dy
   # save just the upper half of the page
  $page savePNG /mydir/page0.png -zoom 2.25 -from 0 0 $dx [expr $dy/2]
  ...
pageObj saveImage image ?-zoom zoom? ?-from x0 y0 x1 y1? ?-to x0 y0? ?-overlay true/false?
render the page in an existing Tk's photo image.

The width and/or height of image are unchanged if the user has set on it an explicit image width or height (with the -width and/or -height configuration options, respectively). About the -zoom, -from and -overlay options, the same rules for the savePNG apply. Option -to allows you to place the resulting raster image at the x0 y0 coords of the destination image. By default, is -to 0.0 0.0 NOTE: this command is not available with the package tclMuPDF.

pageObj images list ?-id imageID?
return a list of all the images contained in the page referred by pageObj.

The result of images list subcommand is a list of image-records. Each image-record is a list of six elements:

  • image-ID (unique for each page)
  • image's width (in pixel)
  • image's height (in pixel)
  • image's colorspace (e.g. DeviceRGB, ICCBased, ...)
  • image's bit per component (number of color components may be inferred from colorspace)
  • mask-flag : 1 means that the image has a pixel-mask (i.e. some transparent pixels) If option -id is present, the resulting list is limited to the image-record for imageID.
pageObj images extract ?-id imageID? ?-dir pathname? ?-as pattern? ?-transparency boolean?
if option -id is specified, extract and save the image referred by imageID (see images list). If option -id is missing, all the images contained in a page are extracted and saved.

If option -dir is not specified, images are saved in the current directory. If option -as is not specified, images are saved with a name derived from the default mupdf::imagenamesformat (see below), otherwise images are saved as pattern (for pattern rules see below for mupdf::imagenamesformat). if option -transparency is true, save the (semi)transparent pixels (if any), otherwise transparent pixels (if any) are rendered as white pixels. extract returns a list of extracted-records. Each extracted-record is a list of three elements:

  • image-ID (unique for each page)
  • image-name (pdf page's internal name)
  • saved filename (empty string if the image was skipped (unknow format...))
mupdf::imagenamesformat ?pattern?
a pattern is a parametric filename specification similar to the format specification used by the C printf function.

If pattern is not specified, return the currently defined default pattern. If pattern is specified, set the default pattern. When the "pageObj images extract ..." command is called, all the images are saved with a filename based on a pattern (This pattern can be explicit, if option -as is present, or it can be implicit, based on the default mupdf::imagenamesformat). a pattern is a a simple filename specification (just the base-name, since the extension of the extracted images (png, jpg, ...) is automatically determined.) with zero or more special symbols like the following ones:

  • %p : page number (first page is 1)
  • %P : total number of pages
  • %i : image number - images in a page are numbered starting from 1
  • %I : total number of images (in the current page) Special symbols may also be written with a padding-specification like %5P; this notation means the the symbol %P should be padded as a 5-character string with leading '0's.

If more than one image is extracted in a single operation, and pattern does not contain the %i symbol (the image number), then -%i is implicitely appended in order to avoid a filename collision. Assuming that the current page is the page 123 (i.e the 124th page), the following command

  $pageObj images extract -as "IMG_%5p"

will generate IMG_00124-1.jpg, IMG_00124-2.jpg ..... (note: the file extension may be different)

  $pageObj images extract -as "Z%5p(%2i)"

will generate Z00124(01).jpg, Z00124(02).jpg ..... (note: the file extension may be different)

pageObj addimage -file filename|-reuse ID ?other options?
add a new or a reused image; return the ID of the loaded image.
  • -file filename - specify a PNG or JPEG filename.
  • -reuse ID - ID must be 0 (this means that option -file should be considered), or an ID returned by a previous call of addimage. At least one of -reuse, -file must be present. If -reuse is present and its value ID is greater than 0, then option -file is simply ignored.

Other options are:

  • -to x0 y0 x1 y1 - defines the rectangular area where to put the image. By default this area is the full page. x0 y0 are the coords of the top-left corner and x1 y1 are for the bottom-right corner. These coords must be expressed in terms of the physical size of the page, i.e in points
  • -resize boolean - If value is true, then the image will be resized in order to fill the whole destination area (preserving the initial proportions). (Default is true).
  • -over boolean - if this option is set to true, then the image is displayed over the current contents of the page, else it is displayed under the current contents. Default value is true.
  • -marginx dx1 ?dx2? - specifies how much horizontal spacing should be reserved for left,right margins within the destination area. If dx2 is missing, it is assumed to be equal to dx1. Default value is 0.
  • -marginy dy1 ?dy2? - specifies how much vertical spacing should be reserved for top,bottom margins within the destination area. If dy2 is missing, it is assumed to be equal to dy1. Default value is 0.
  • -anchor where - where specifies how to align the image within the page's destination area. Must be one of the values n, ne, e, se, s, sw, w, nw, or c. Deafault value is "c" (center) meaning that the image is centered in the destination area.

MuPDF introspection commands

mupdf::isobject obj
return 1 if obj is an oo-object (not limited to MuPDF objects).
mupdf::classes
returns the list of the oo-classes of MuPDF
mupdf::classinfo obj
return the oo::class of obj (not limited to MuPDF objects)
mupdf::documents
return a list of the currently opened pdfObjs
mupdf::documentnames
return a list of pdf-filenames currently opened (fully normalized filenames).
mupdf::isopen filename
check if filename is among the currently opened pdf-files.
mupdf::Doc names
mupdf::Page names
mupdf::TextSearch names
return a list of all the instances of Doc, Page, TextSearch. Note that when a Doc is destroyed, all its related Pages and TextSearch are destroyed, too.
mupdf::cli_passwordhelper ?helperProc?
get/set the helper procedure for shell-like applications. If helperProc is ', then the default helper is re-set. See section Working with Password Protected Files'.
mupdf::tk_passwordhelper ?helperProc?
get/set the helper procedure for applications with a Tk graphical interface. If helperProc is ', then the default helper is re-set. See section Working with Password Protected Files'.
mupdf::libinfo
return specific attributes of the underlying MuPdf library as a list of keywords and their values. The provided keywords are:
version
The version of the underlying MuPDf library
..more to come ..

Working with Password Protected Files

You can open a password-protected PDF in a non-interactive or in an interactive way. In non-interactive mode, you must provide in advance a password

  set pdf [mupdf::open book.pdf -password "123open"]

If password is wrong then an error is raised; look at the error message and check for the errorcode: it should be MUPDF WRONGPASSWORD. Note that there's no distinction between owner's password and user's password; if you provide the right owner's password, the PDF is opened in owner-mode, if you provide the right user's password, the PDF is opened in user-mode. You can check if PDF has been opened in user-mode or owner-mode by calling the authentication command.

  set pdf [mupdf::open book.pdf -password "123open"]
   # if we are here, it means that password was OK
  set mode [$pdf authentication]
   # "none" means that book.pdf was not password-protected
   # "user" means that the supplied password was the user's password
   # "owner" means ...

In interactive mode, you should not provide an explicit password; if the PDF is password-protected, MuPDF will ask for a password. Depending on the nature of your application (with or without Tk), mupdf will select and call a predefined (yet customizable) helper procedure. These predefined, built-in procedures can be changed with the mupdf::cli_passwordhelper or mupdf::tk_passwordhelper command.

   # template for a (shell-like) password-helper procedure
   proc mypswdhelper {filename} {
       ...  ask for a password ....
       .... get the password ....
       ....
       return $password    
   }
  mupdf::cli_passwordhelper mypswdhelper

   # template for a (GUI) password-helper procedure
   proc myGUIpswdhelper {filename} {
       ...  raise some popup ...
       ...  ask for a password ....
       .... get the password ...
       .... close the popup
       return $password    
   }
  mupdf::tk_passwordhelper myGUIpswdhelper

Removing protection from a password protected files

If you open a password-protected file and export it, by default the resulting PDF is saved without a password. You can explicitly control this behavior by appending the -decrypt option to the export or close commands (default is -decrypt true).

Adding or changing a password

After opening a PDF, you can add or change the user/owner password with the opwd or the upwd commands, like in the following example

  set pdf [mupdf::open mybook.pdf]
   # if mybook.pdf is protected, insert the password

  $pdf upwd "my user-password"
  $pdf opwd "my owner-password"
  $pdf close  ;# ..or $pdf export ...

Better control for MuPDF internal errors/warnings and repaired PDFs

In many cases MuPDF (the core library) is able to work with bad/corrupted PDFs, and if the errors are not critical, the 'repaired' PDF is opened as a good PDF. By default MuPDF prints on stderr all these errors/warnings; if the error is critical MuPDF raises an error (and so tclMuPDF), but if the malformed PDF can be automatically repaired, no error is raised, just some messages are printed on stderr. With the previous tclMuPDF release (2.1) we completely suppressed all these (non-critical) warning-messages generated by the core-MuPDF. This was a useful feature because many of those messages were annoying and rarely useful, but there are still some cases that shouldn't be ignored because in some cases the automatically repaired PDFs may be still inconsistent or damaged in some other parts. Starting with tclMuPDf 2.1.1 we added some new commands and methods, for better control of these warning-messages:

  • mupdf::printwarnings
  • pdfObj warnings
  • pdfObj resetwarnings
  • pdfObj wasrepaired By using "mupdf::printwarnings" tou can enable/disable the warning-messages. By default printing on stderr is disabled, but if you set "mupdf::printwarnings true" then all the next errors/warnings will be printed on stderr.

Even if printing is globally disabled, all the MUPDF warning-messages are still recorded and they can be inspected with the command "pdfObj warnings". Note that each opened document pdfObj has its own list of warnings. The command "pdfObj resetwarnings" resets all the warnings bound to the pdfObj document; of course, when the document is closed/deleted, all its related warnings are deleted. Finally, the "pdfObj wasrepaired" command can be used for checking if pdfObj was automatically repaired, so you can decide to quit or keep on working with the repaired pdfObj.

Notes and limitations about field names and values

Due to current limitations of the underlying MuPdf-core library, field names and values should be limited to latin alphabets only. See the note enclosed with MuPdf's BUG 696478 (Nov 2018): '' ... "There is, however, a limitation with our form filling support that still

     needs to be addressed: it only supports PDFDocEncoding, not Unicode.

... "We can currently fill out the form fields in any language, but the display code still only handles the Latin alphabet." '' This means that you can enter text in a form field in any alphabet (tested with Russian, Chinese, Greek, ...) but the rendered text for the non-latin alphabets will appear wrong. Another major limitation is the rendering of fields other than 'text' widgets, i.e. checkbox, radiobuttons... For these kinds of widgets the support of the underlying library is still incomplete, in particular for values expressed in non-ASCII characters (e.g. accented letters ... )

Text-Extraction & Text-Search Flags

Extracting text from a PDF is still a matter of magic art. Different PDF-tools provide different results, based on their own internal rules and assumptions. MuPDF text-extraction is far to be perfect, (MuPDF claims it's an experimental feature), but starting with release 2.3, tclMuPDF provides some explicit flags that can be used for tuning the text-extraction logic. They control how text should be handled with respect to white spaces, hyphenation and ligatures. The following basic text-extraction command

        $pageObj text ..other options.. 

is equivalent to

        $pageObj text -mode {preserve-ligatures preserve-whitespace} ..other options..

Similarly, the following command for creating a text-search object

        set searchObj [$pdfObj newsearch]

is equivalent to

        set searchObj [$pdfObj newsearch -mode {preserve-whitespace dehyphenate}]

In general, text-extraction/search flags can be combined in a list for the -mode option. Be aware that if you specify the -mode option, you must set all desired options. These are the currently available flag, both for text-extraction and text-search:

  • preserve-ligatures If set, ligatures are passed through to the application in their original form. Otherwise, ligatures are expanded into their constituent parts, e.g. the ligature ffi is expanded into three separate characters f, f and i. MuPDF supports the following 7 ligatures: ff , fi , fl , ffi , ffl , , ft , st .
  • preserve-whitespace If set, whitespace is passed through in its original form. Otherwise, any type of horizontal whitespace (including horizontal tabs) will be replaced with space characters of variable width
  • inhibit-spaces If set, Mupdf will not try to add missing space characters where there are large gaps between characters. In PDF, the creator often does not insert spaces to point to the next character's position but will provide the direct location address.
  • dehyphenate Ignore hyphens at line ends and join with the next line. if set, text extractions will return joined text lines (or spans) with the ending hyphen of the first line removed. So two separate spans first meth- and od leads to wrong results on different lines will be joined to one span first method leads to wrong results and correspondingly updated bboxes: the characters of the resulting span will no longer have identical y-coordinates.
  • preserve-spans If this option is set, spans on the same line will not be merged. Each line will thus be a span of text with the same font, color, and size.

Adding/removing pages

tclMuPdf 2.4 introduced a few basic commands for adding/deleting/moving pages in a PDF document. These commands are:

  • pdfObj addpage ..
  • pdfObj deletepage ..
  • pdfObj deletepages ..
  • pdfObj movepage .. These commands can be used both on existing documents and on new documents created with mupdf::new.

Please, don't confuse the new deletepage method with the old closepage method. closepage simply 'closes' an opened page, whilst deletepage 'removes' a page from a loaded document. New pages (those created with the addpage method) are initially empty (blank pages). You can add images, annotations, empty fields for digital signatures, or you can graft a page from another PDf and embed it in the new added page. Example of how to add a 'cover' to an existing document.

        set doc [mupdf::open "mydoc.pdf"]
        set page [$doc addpage 0] ;# A4 size
        $page addimage "mycover.jpg"
        $doc close ;#  -- document saved
        
        # if you want to do a "save as..", dont' call "$doc close", but rather
        # $doc export "mydoc-with-cover.pdf"
        # $doc quit ;# -- quit without saving 

Example of how to copy a page in a new pdf document.

  proc savePageAsPdf { pageObj filename } {
     # create a new empty pdf
        set newPdf [mupdf::new $filename]
         # add a new (blanc) page - page size should be equal the the size of the original page
        $newPdf addpage 0 -size {*}[$pageObj size]
         # now newPdf has just 1 (blanc) page

         # graft and embed in page 0
        set graftId [$newPdf graft $pageObj]
        $newPdf embed $graftId 0
        $newPdf close    ;# close & save !
  }

  set doc [mupdf::open "myFullDoc.pdf"]
   # get the 4th page from doc and save it in a new pdf named "page-4.pdf"
   #  --  note: since first page is page 0,  the 4th page is page #3
  savePageAsPdf [$doc getpage 3] "page-4.pdf"

Limitations

Full support for portfolio management is still incomplete; commands for adding/removing/reordering embedded files will require a more robust mupdf-core implementation.

From mupdf 1.x to MuPDF 2.x

New commands and methods

pdfObj removepassword
pdfObj ispageopened
pdfObj closepage
pdfObj flatten
pdfObj grafts
mupdf::classes
mupdf::classinfo obj

Changed commands and methods

package require mupdf
package require mupdf-notk
With the release 2.x these package names are no more valid. Use:
  • package require tclMuPDF ;# replacement for "mupdf-notk"
  • package require tkMuPDF ;# replacement for "mupdf"
  • package require MuPDF ;# replacemente for "mupdf"
pdfObj openedpages
now returns a list of page-numbers. Formerly it returned a list of pageObj
pdfObj close options
options -flatten and -decrypt are no more supported. In case you need to flatten the fields or remove the password, you should call the new "pdfObj flatten" or "pdfObj removepassword" methods before pdfObj close
pdfObj flattenfield ...
has been replaced by "pdfObj flatten ..."
pdfObj addsigfield fieldname pageNum x0 y0 x1 y1
changed parameters order and signature
pdfObj embed ...
changed parameters order and signature
pdf export ...
options -flatten and -decrypt have been removed. In case you need to flatten the fields or remove the password, you should call the new "pdfObj flatten"" or "pdfObj removepassword" methods before pdfObj export
pageObj annot delete annotID ...
replaces "pageObj annot annotID delete"
pageObj annot ...
Formerly colors were expressed as a triple {R G B}. (0<=R,G,B<=1.0) Now colors should be expressed as tk-color (e.g "lightblue" or #223312).

Removed commands and methods

32 bit
Removed support for Linux-32bit and Windows-32 bit
package require mupdf
package require mupdf-notk
pdfObj flattenfield ...
Renamed as "pdfObj flatten ..."
pdfObj canbesavedincrementally
Still present but deprecated.
pdfObj search ...
Use the "pdfObj newsearch" command and the mupdf::TextSearch methods.
pageObj embed ...
Replaced by "pdfObj embed ..."
mupdf::quit pdfObj
mupdf::close pdfObj
mupdf::close pageObj
mupdf::type obj
Use "mupdf::classinfo obj"

KEYWORDS

PDF

CATEGORY

pdf parsing and rendering

COPYRIGHT

 Copyright (c) 2024, by A.Buratti