Version 3 of pandoc

Updated 2021-04-23 20:42:15 by TR

(TR 2021-04-22)

Pandoc is a universal document converter which converts between a wealth of various formats. It can also be used to convert markdown to any of the supported document formats. Pandoc has its own extended set of markdown syntax (see also commonmark).

One of its strengths is the use of external filters. A document in one format will first be converted to a native Pandoc format, forming a so-called 'abstract syntax tree' (AST). This can be output in JSON format. A filter (some external program) may read this JSON output and change it to its desired form. Then, the filter returns the modified JSON representation and Pandoc will take over and convert it to another output format. E.g., the following call of Pandoc will read the markdown file, apply the filter myfilter and finally output the document in MS Word format as output.docx (ooxml):

pandoc -s -t docx -o output.docx --filter myfilter

The filter program can be written in any language. It just has to be an executable reading from stdin and writing to stdout. So, we can use Tcl to manipulate the AST. The use cases are mainfold. E.g., such a filter could evaluate the Tcl code included in code sections in a markdown document and add the results of the evaluations (much like tmdoc::tmdoc). Or, code sections could be an image coded in pikchr which gets inserted instead of the actual code. Or, a filter could extract the hierarchy of section headers and produce an outline from this, listing the length of each section in words and chars and include that as an annex to the document being converted.

This is a minimal example for a myfilter.tcl which just passes the AST unchanged, serving as a skeleton (Note, that the file must be executable):

#! /usr/bin/env tclsh

# read the JSON AST from stdin
set jsonData {}
while {[gets stdin line] > 0} {
   append jsonData $line

# and give it back as is (unchanged)
puts $jsonData

When we want to do something with the AST, it is easiest to use the json package from tcllib and then change the resulting dict representation, and converting back to json. However, the last step is not trivial as the json::write package cannot automagically convert back to the original json types (arrays, objects):

#! /usr/bin/env tclsh

package require json

# read the JSON AST from stdin
set jsonData {}
while {[gets stdin line] > 0} {
   append jsonData $line

set astDict [::json::json2dict $jsonData]

# do some processing of the data and then ...

# give it back as json again (the following is not trivial and needs extensive coding :-()
puts [::json::write object ... ...]

This is, how a minimal AST looks like:


The AST json is one object with the three elements (key-value-pairs) *pandoc-api-version*, *meta* and *blocks*. The main part is the *blocks* element which itself contains an array of objects where each object is one part of the document (in this case just an empty paragraph ("Para").

The complete documentation of the AST format is on this Haskell page (which is the programming language Pandoc is written in). The most interesting element for us is the "blocks" element. It can contain the following block elements:

Block elementdescription
Plain Plain text, not a paragraph
Para Paragraph
LineBlock Multiple non-breaking lines
CodeBlock Code block (literal) with attributes
RawBlock Raw block
BlockQuote Block quote (list of blocks)
OrderedList Ordered list (attributes and a list of items, each a list of blocks)
BulletList Bullet list (list of items, each a list of blocks)
DefinitionList Definition list. Each list item is a pair consisting of a term (a list of inlines) and one or more definitions (each a list of blocks)
Header Header - level (integer) and text (inlines)
HorizontalRule Horizontal rule
Table Table, with attributes, caption, optional short caption, column alignments and widths (required), table head, table bodies, and table foot
Div Generic block container with attributes
Null Nothing