Version 1 of pandoc

Updated 2021-04-22 20:37:22 by SEH

(TR 2021-04-22)

Pandoc is a universal document converter which converts between a wealth of various formats. It can also be used to convert markdown to any of the supported document formats. Pandoc has its own extended set of markdown syntax (see also commonmark).

One of its strengths is the use of external filters. A document in one format will first be converted to a native Pandoc format, forming a so-called 'abstract syntax tree' (AST). This can be output in JSON format. A filter (some external program) may read this JSON output and change it to its desired form. Then, the filter returns the modified JSON representation and Pandoc will take over and convert it to another output format. E.g., the following call of Pandoc will read the markdown file input.md, apply the filter myfilter and finally output the document in MS Word format as output.docx (ooxml):

pandoc -s input.md -t docx -o output.docx --filter myfilter

The filter program can be written in any language. It just has to be an executable reading from stdin and writing to stdout. So, we can use Tcl to manipulate the AST. This is a minimal example for a myfilter.tcl which just passes the AST unchanged, serving as a skeleton (Note, that the file must be executable):

#! /usr/bin/env tclsh

# read the JSON AST from stdin
set jsonData {}
while {[gets stdin line] > 0} {
   append jsonData $line
}


# and give it back as is (unchanged)
puts $jsonData

When we want to do something with the AST, it is easiest to use the json package from tcllib and then change the resulting dict representation, and converting back to json. However, the last step is not trivial as the json::write package cannot automagically convert back to the original json types (arrays, objects):

#! /usr/bin/env tclsh

package require json

# read the JSON AST from stdin
set jsonData {}
while {[gets stdin line] > 0} {
   append jsonData $line
}

set astDict [::json::json2dict $jsonData]

# do some processing of the data and then ...

# give it back as json again (the following is not trivial and needs extensive coding :-()
puts [::json::write object ... ...]

This is, how a minimal AST looks like:

{
   "pandoc-api-version":[1,22],
   "meta":{},
   "blocks":[
      {
         "t":"Para",
         "c":[]
      }
   ]
}

The AST json is one object with the three elements (key-value-pairs) *pandoc-api-version*, *meta* and *blocks*. The main part is the *blocks* element which itself contains an array of objects where each object is one part of the document (in this case just an empty paragraph ("Para").

... to be continued ...