pandoc

(TR 2021-04-22)

Pandoc is a universal document converter which converts between a wealth of various formats. It can also be used to convert markdown to any of the supported document formats. Pandoc has its own extended set of markdown syntax (see also commonmark).

One of its strengths is the use of external filters. A document in one format will first be converted to a native Pandoc format, forming a so-called 'abstract syntax tree' (AST). This can be output in JSON format. A filter (some external program) may read this JSON output and change it to its desired form. Then, the filter returns the modified JSON representation and Pandoc will take over and convert it to another output format. E.g., the following call of Pandoc will read the markdown file input.md, apply the filter myfilter and finally output the document in MS Word format as output.docx (ooxml):

pandoc -s input.md -t docx -o output.docx --filter myfilter

The filter program can be written in any language. It just has to be an executable reading from stdin and writing to stdout. So, we can use Tcl to manipulate the AST. The use cases are mainfold. E.g., such a filter could evaluate the Tcl code included in code sections in a markdown document and add the results of the evaluations (much like tmdoc::tmdoc). Or, code sections could be an image coded in pikchr which gets inserted instead of the actual code. Or, a filter could extract the hierarchy of section headers and produce an outline from this, listing the length of each section in words and chars and include that as an annex to the document being converted.

This is a minimal example for a myfilter.tcl which just passes the AST unchanged, serving as a skeleton (Note, that the file must be executable):

#! /usr/bin/env tclsh

# read the JSON AST from stdin
set jsonData {}
while {[gets stdin line] > 0} {
   append jsonData $line
}


# and give it back as is (unchanged)
puts $jsonData

When we want to do something with the AST, it is easiest to use the json package from tcllib and then change the resulting dict representation, and converting back to json. However, the last step is not trivial as the json::write package cannot automagically convert back to the original json types (arrays, objects):

#! /usr/bin/env tclsh

package require json

# read the JSON AST from stdin
set jsonData {}
while {[gets stdin line] > 0} {
   append jsonData $line
}

set astDict [::json::json2dict $jsonData]

# do some processing of the data and then ...

# give it back as json again (the following is not trivial and needs extensive coding :-()
puts [::json::write object ... ...]

This is, how a minimal AST looks like:

{
   "pandoc-api-version":[1,22],
   "meta":{},
   "blocks":[
      {
         "t":"Para",
         "c":[]
      }
   ]
}

The AST json is one object with the three elements (key-value-pairs) *pandoc-api-version*, *meta* and *blocks*. The main part is the *blocks* element which itself contains an array of objects where each object is one part of the document (in this case just an empty paragraph ("Para").

The complete documentation of the AST format is on this Haskell page (which is the programming language Pandoc is written in). The most interesting element for us is the "blocks" element. It can contain the following block elements:

Block elementdescription
Plain Plain text, not a paragraph
Para Paragraph
LineBlock Multiple non-breaking lines
CodeBlock Code block (literal) with attributes
RawBlock Raw block
BlockQuote Block quote (list of blocks)
OrderedList Ordered list (attributes and a list of items, each a list of blocks)
BulletList Bullet list (list of items, each a list of blocks)
DefinitionList Definition list. Each list item is a pair consisting of a term (a list of inlines) and one or more definitions (each a list of blocks)
Header Header - level (integer) and text (inlines)
HorizontalRule Horizontal rule
Table Table, with attributes, caption, optional short caption, column alignments and widths (required), table head, table bodies, and table foot
Div Generic block container with attributes
Null Nothing

Let's explore the "Header block". It is used when a header is written in the document. This simple markdown file ...

# Simple header

... translates to this json document:

{
   "pandoc-api-version":[1,22],
   "meta":{},
   "blocks":[
      {
       "t":"Header",
       "c":[
            1,
            ["simple-header",[],[]],
            [
             {"t":"Str","c":"Simple"},
             {"t":"Space"},
             {"t":"Str","c":"header"}
            ]
           ]
      }
   ]
}

So, the header block contains three elements in an array:

  1. a number which defines the header level (1 in this case since we were using one #-sign)
  2. some attributes (empty in this case apart from a name generated from the section title serving as an anchor for identification of this specific header (e.g. for crossreferencing)
  3. the text of the section title as an array of 'inline' elements (in this case a string "Str" followed by a "Space", followed by another string.

Now, how does this look like after parsing into Tcl using [::json::json2dict]?

% read Pandoc json document:
% set astDict [::json::json2dict $jsonData]
pandoc-api-version {1 22} meta {} blocks {{t Header c {1 {simple-header {} {}} {{t Str c Simple} {t Space} {t Str c header}}}}}
% # see that we three elements in there:
% dict keys $astDict
pandoc-api-version meta blocks
% # extract the first block in the list of blocks:
% set blockList [dict get $astDict blocks]
{t Header c {1 {simple-header {} {}} {{t Str c Simple} {t Space} {t Str c header}}}}
% set firstBlock [lindex [dict get $astDict blocks] 0]
t Header c {1 {simple-header {} {}} {{t Str c Simple} {t Space} {t Str c header}}}
% foreach item $firstBlock {puts $item}
t
Header
c
1 {simple-header {} {}} {{t Str c Simple} {t Space} {t Str c header}}
% # The type of the first block:
% dict get $firstBlock t
Header
% # the content of the first block:
% set firstBlockContent [dict get $firstBlock c]
1 {simple-header {} {}} {{t Str c Simple} {t Space} {t Str c header}}

OK, this is nested deeply into the json, but now we know how we e.g. could manipulate the AST and change the heading level from 1 to 2:

% lset firstBlockContent 0 2
2 {simple-header {} {}} {{t Str c Simple} {t Space} {t Str c header}}
% dict set firstBlock c $firstBlockContent
t Header c {2 {simple-header {} {}} {{t Str c Simple} {t Space} {t Str c header}}}
% lset blockList 0 $firstBlock
{t Header c {2 {simple-header {} {}} {{t Str c Simple} {t Space} {t Str c header}}}}
% dict set astDict blocks $blockList
pandoc-api-version {1 22} meta {} blocks {{t Header c {2 {simple-header {} {}} {{t Str c Simple} {t Space} {t Str c header}}}}}

That was not too hard although it took quite some effort to change just one number ... now we only need to re-format the Tcl dict into a proper json form again. This is the non-trivial part and would require substantial effort.

But we can have this much easier using the rl_json package!

Using rl_json

The rl_json package is able to read and manipulate json data without converting to a Tcl structure. We can keep the json object and manipulate it directly. So, the above code for reading and changing the Pandoc AST will get much shorter, easier to read and very effective:

% package require rl_json
0.11.0
% namespace import rl_json::json

% read Pandoc json document:
% set AST {{
   "pandoc-api-version":[1,22],
   "meta":{},
   "blocks":[{"t":"Header","c":[1,["simple-header",[],[]],[{"t":"Str","c":"Simple"},{"t":"Space"},{"t":"Str","c":"header"}]]}]
}}
% # see that we three elements in there:
% json keys $AST
pandoc-api-version meta blocks
% # extract the first block in the list of blocks:
% json get $AST blocks 0
t Header c {1 {simple-header {} {}} {{t Str c Simple} {t Space} {t Str c header}}}
% # The type of the first block:
% json get $AST blocks 0 t
Header
% # the content of the first block:
% % json get $AST blocks 0 c
1 {simple-header {} {}} {{t Str c Simple} {t Space} {t Str c header}}

% # now change the header level from 1 to 2
% json set AST blocks 0 c 0 2
{"pandoc-api-version":[1,22],"meta":{},"blocks":[{"t":"Header","c":[2,["simple-header",[],[]],[{"t":"Str","c":"Simple"},{"t":"Space"},{"t":"Str","c":"header"}]]}]}

Wow, that was easy and we get a full Pandoc AST back in correct json format that we can output directly and the result of a filter. So, this could be a complete (yet overly simple) filter for Pandoc:

#! /usr/bin/env tclsh

package require rl_json

# read the JSON AST from stdin
set jsonData {}
while {[gets stdin line] > 0} {
   append jsonData $line
}

# change the header level of the header from 1 to 2
# (assuming the first block element in the document is a header and has level 1 as in the minimal example markdown document above):
set newJSON [json set AST blocks 0 c 0 2]

# give it back as json again:
puts $newJSON

Now a filter that is a bit more useful It takes all headers in a Pandoc document and increases their header level by 1:

#! /usr/bin/env tclsh

package require rl_json

# read the JSON AST from stdin
set jsonData {}
while {[gets stdin line] > 0} {
   append jsonData $line
}

for {set i 0} {$i < [llength [json get $jsonData blocks]]} {incr i} {
   set blockType [json get $jsonData blocks $i t]
   if {$blockType eq "Header"} {
      set headerLevel [json get $jsonData blocks $i c 0]
      set jsonData [json set jsonData blocks $i c 0 [expr {$headerLevel + 1}]]
   }
}

# give the modified document back to Pandoc again:
puts $newJSON

... to be continued ...