A transformable markup document format

Ashley Noel Hinton
ahin017@aucklanduni.ac.nz

Department of Statistics, The University of Auckland

28 July 2016

Many technologies now exist for writing a document in a format that can be transformed into various output formats for sharing. This report proposes that using markup, rather than one of the Markdown languages, is a good fit for writing flexible human- and machine-readable transformable documents. We propose a transformable markup document format written in XML. We also provide several examples of how the document format could be transformed.

Why a markup format

When writing reports and articles for publication in journals, books, online or otherwise, an author will employ one of many document formats to describe the typographical and structural details of her document. Common formats for research publication include the LaTeX system for producing printed documents, and HTML for producing online documents. These and other formats allow an author to have low-level control over a certain type of output, but usually do not offer control over multiple outputs.

The Pandoc document converter can be used to allow an author to write a document in one format and then convert to another. For example an author could create an HTML document and use Pandoc to generate a LaTeX version of this document. An author will generally write a document using the Pandoc Markdown format, and then use Pandoc to convert the document to one (or several) formats for publication.

Pandoc Markdown is a variant of Markdown, a lightweight markup language written as a simpler way to author HTML documents. Markdown is intended to be readable without conversion, and as such is written so as not to appear to contain markup tags or formatting. While this can makes it easy for an author to write a document with common formatting it makes it very difficult to have fine-grained control over the eventual output document(s).

As well as needing control over the appearance and structure of a document, an author may also need to embed software code to be executed to produce the final version of the document. Various tools exist for executing code embedded in a literate document to produce the desired output. For example, an author can embed R code in her document and use the Knitr package to execute the R code and produce a final document. An author embeds R code in another document format, for example HTML or LaTeX, and processes the document to produce an output document in the same format.

The R Markdown package combines the dynamic document processing of Knitr with the common authoring format of Pandoc Markdown. An author can write a document in Pandoc Markdown which contains embedded R code to be executed. The author can then execute the code and produce an output document for publication in one of the many formats available through Pandoc document conversion.

It is clear that a document author has many options, and the implementation of R Markdown in the RStudio IDE is an indication that many developers in the R world at least see Markdown as a very useful authoring tool. The following use cases will server to highlight some of the drawbacks to authoring in Markdown. These drawbacks will inform the design of a transformable document solution described in this report.

Limitations of Markdown

While Markdown can be very useful for authoring simple documents quickly (Markdown's creators took inspiration from the formatting conventions of plain text emails) an author also makes several sacrifices in choosing Markdown over a markup language like HTML. Two examples of the limitations of Markdown are the creating of lists and tables. List creation in Pandoc Markdown is described below. Table creation is left as an exercise for the reader.

In Pandoc Markdown a simple list is created by prepending a “*”, “+”, or “-” character to the beginning of each list item, as in the following example:

  * one
  * two
  * three

If an author wishes to create an embedded list, she must use the four space rule to indent each embedded list.

  * outer list 1
      - inner list 1
      - inner list 2
  * outer list 2

At even two levels this process is already quite awkward. As list membership depends on whitespace it can be a frustrating exercise trying to make changes to a complex list, let alone author a list in the first place.

In contrast, an author using HTML markup can indicate an unordered list using a<ul> element, with <li> elements for each item, as in the following code:

<ul>
  <li>one</li>
  <li>two</li>
  <li>three</li>
</ul>

Similarly, an embedded list simply nests the same structure inside a list structure, as in the following code:

<ul>
  <li>outer list 1
    <ul class="subList">
      <li>inner list 1</li>
      <li>inner list 2</li>
    </ul>
  </li>
  <li>outer list 2</li>
</ul>

The author does not have to count white space, and it is trivial to make changes to this list structure.

Markup like HTML also gives a document author more control over the output than Markdown. The author of the HTML list example above has used the class attribute to indicate the inner list belongs to the “subList” class. Using class an author can apply output styles or perform other actions on subsets of elements. When a document authored in Markdown is transformed to HTML its lists will be marked up using the same HTML list tags as above, but an author of a Markdown document does not have access to these class attributes for customising output. She could include raw HTML in her Pandoc Markdown document, but this limits the output types to those using HTML.

Another example of the limitations of Markdown is demonstrated by embedded code chunks. A document author has various methods for embedding chunks of code in her document. For example, an author of a Knitr HTML or LaTeX document can enclose code chunks to be executed in specially formatted comments in the respective document languages. For example, a document author can embed and R code in a Knitr HTML document as in the following code:

<!-- begin.rcode
x <- rnorm(n = 10)
plot(x)
end.rcode-->

Similarly an author can embed R code in a Knitr LaTeX document as in the following code:

%% begin.rcode
% x <- rnorm(n = 10)
% plot(x)
%% end.rcode

A document author using the R Markdown package can enclose R code chunks in special “fenced code” blocks as in the following code:

```{r}
x <- rnorm(n = 10)
plot(x)
```

While the use of these methods for including code makes it quick and easy to write a document it makes it more difficult for an author to do extra processing to chunks of code before producing the a final document. In contrast, an author creating an HTML document might wrap R code in <code> elements as in the following code:

<code class="R">
  x <- rnorm(n = 10)
  plot(x)
</code>

If an author marked up code chunks in this fashion she could, for example, make use of tools which employ the XPATH query language to locate <code> elemenents and perform transformations. If the author gives R code chunks the class “R”, as in the above example, she could perform transformations on just the R code chunks.

If an author uses Pandoc Markdown to write a document she can include raw HTML or raw TeX language elements to control the document output. These raw code sections are only processed by Pandoc when creating the associated output formats, and would otherwise be ignored. There is no simple method for creating custom sections or formats within Pandoc Markdown.

While an author using HTML is also unable to expect new and custom elements to be recognised by a web browser, the fact that HTML is a form of XML means an author can invent her own XML elements for document writing. These custom elements could then be processed using an XML transformation tool like XSL Transformations to convert the custom elements to valid HTML code. A markup document format has the benefit of providing a simplified authoring format without sacrificing fine control when required.

Markdown has proven itself to be very useful for document authors, and it is not the suggestion of this report that a markup format replace Markdown entirely. Rather this report proposes that in situation where Markdown is not powerful enough for a document author a markup format like the one described in the next section might provide the solution. Importantly, a well designed markup document should allow an author to recover a Markdown document as output, thus providing a readable plain text document. While a format like Markdown is designed to satisfy a set of known transformations a format based on markup can also satisfy future unknown transformations, e.g. extracting subsets of elements.

The idea of using Markup as the basis for authoring documents has been championed before by Deb Nolan and Duncan Temple Lang (e.g., in their book XML and Web Technologies for Data Sciences with R and the XDynDocs package for R). The proposal made in this report is essentially a much simplified approach that aims to provide a lower barrier to entry.

A `document` markup format

In the previous section we described some of the limitations in the Markdown document authoring format. It is our proposal that an authoring format based in XML provides more control and flexibility when authoring a document. When an author uses a Markdown format she is limited to the formatting tags and transformations found in Markdown; similarly an author using HTML markup is limited to HTML tags. Authoring a document in XML, however, permits an author not only to include all of the tags and transformations afforded by HTML, but also any customised tags or transformations she may require. A document author is free to invent new personalised tags to suit her current document transformation needs. In this section we describe such a custom document markup format.

The transformable document format described in this report is an XML file with document as the root element. This document has two child elements: metadata and body.

The metadata element contains the document metadata, with elements for the document title and subtitle, author information, date of publication, and a description section. An example metadata element follows:

<metadata>
  <title>Today should be a holiday</title>
  <author>
    <name>Ashley Noel Hinton</name>
    <email>ahin017@aucklanduni.ac.nz</email>
  </author>
  <date>25 December 2015</date>
</metadata>

The body element contains the document's main content. The following elements are used in the same way as they are used in HTML (https://www.w3.org/TR/html-markup/elements.html):

a – hyperlink
code – code fragment
em – emphatic stress
figcaption – figure caption
figure – figure with optional caption
h1 – heading
h2 – heading
h3 – heading
img – image
li – list item
ol – ordered list
p – paragraph
pre – preformatted text
q – quoted text
section – section
strong – strong importance
ul – unordered list

The <url> element is introduced in the document format to indicate a hyperlink where the enclosed URL is both the href and the value. The following code block demonstrates the use of the url element:

<ul>
  <li>modular</li>
  <li>reusable</li>
  <li>shareable</li>
  <li><url>https://github.com/anhinton/conduit</url></li>
</ul>

The resulting output:

modular
reusable
shareable
https://github.com/anhinton/conduit

The document XML format uses <code> elements to indicate blocks of computer code, just as in HTML. Dynamic code chunks which are to be executed are marked using the class attribute to code. For example chunks of R code which are to be executed used the Knitr package are wrapped in a <code> element with class="knitr". An author can also provide a name attribute for the knitr code chunk, as well as knitr options. A document author can also use CDATA sections to wrap code with reserved XML characters. The following code demonstrates how to include an R code chunk to be executed with Knitr:

<code class="knitr" name="knitrDemo" options="tidy=FALSE"><![CDATA[x <- rnorm(n = 10)
mean(x)]]></code>

And the following is the result of executing this code chunk:

x <- rnorm(n = 10)
mean(x)

## [1] 0.3505599

The document format also makes use of the include element from XInclude (http://www.w3.org/2001/XInclude) namespace to include XML content from external files. This allows document authors to embed other documents which may be authored separately from the main document. There is no simple method of doing this directly in either HTML or Pandoc Markdown.

The next sections describes some simple transformations which can be performed on the document markup format using freely available open source tools. This report was itself written in the document markup format—the source code is available at report.xml.

Transforming the `document` markup format

This section describes how freely available open source tools can be used to transform the document markup format in different ways. The examples include creating an HTML document, preparing R code chunks for executing using Knitr, processing XInclude elements, and some steps towards creating a PDF document. The command line tools xsltproc is used in the following examples, but many other tools are available to do the transformations described.

1. Transforming to HTML

The document markup format can be easily transformed into HTML using XSL Transformations (https://www.w3.org/TR/1999/REC-xslt-19991116). XSLT stylsheets are XML documents which describe how another XML document can be transformed. They can be used to produce new XML documents, such as HTML, or other plain text formats. The command line XSLT processor xsltproc (http://www.xmlsoft.org/) can be used to apply an XSLT stylesheet to an XML document to produce an HTML output document.

As a large part of the document format is based on HTML already we do not have to desrcibe very many transformations in our XSLT stylesheet. We will need to transform the metadata section of the document to HTML head elements and an appropriate title section. We also need to transform our custom <url> elements to HTML hyperlinks.

The full XSLT stylesheet used in this example can be found at examples/documentToHtml.xsl. The XSLT code used to transform url elements to HTML hyperlinks is as follows:

<xsl:template match="url">
  <xsl:element name="a">      
    <xsl:attribute name="href">
      <xsl:value-of select="node()"/>
    </xsl:attribute>
    <xsl:value-of select="node()"/>
  </xsl:element>
</xsl:template>

The source document examples/toHtml.xml contains the following XML:

<?xml version="1.0" encoding="UTF-8"  ?>
<document>
  <metadata>
    <title>Today should be a holiday</title>
    <author>
      <name>Ashley Noel Hinton</name>
      <email>ahin017@aucklanduni.ac.nz</email>
    </author>
    <date>25 December 2015</date>
  </metadata>

  <body>
    <p>For many years I have believed that 25 December should be a
    public holiday, and I am now prepared to provide evidence for
    this.</p>

    <ol>
      <li>There aren't any other holidays in December.</li>
      <li>Schools are usually closed anyway.</li>
    </ol>

    <p>More information can be found at
    <url>https://en.wikipedia.org/wiki/December_25</url>.</p>
  </body>
</document>

This document can be transformed to HTML using the following call to xsltproc:

xsltproc -o examples/toHtml.html examples/documentToHtml.xsl examples/toHtml.xml

The resulting HTML document can be viewed at examples/toHtml.html.

2. Process `xi:include` elements

The document markup format uses XInclude elements (http://www.w3.org/2001/XInclude) to embed text from external documents. These documents referenced in these elements can be processed and embedded directly into the output document using the command line tool xsltproc (http://www.xmlsoft.org/).

We will use the same XSLT stylesheet we used in the previous example. The source document examples/processXinclude.xml contains the following XML:

<?xml version="1.0" encoding="UTF-8"  ?>
<document xmlns:xi="http://www.w3.org/2001/XInclude">
  <metadata>
    <title>Today should be a holiday</title>
    <author>
      <name>Ashley Noel Hinton</name>
      <email>ahin017@aucklanduni.ac.nz</email>
    </author>
    <date>25 December 2015</date>
  </metadata>

  <body>
    <p>For many years I have believed that 25 December should be a
    public holiday, and I am now prepared to provide evidence for
    this.</p>

    <xi:include href="evidenceList.xml" parse="xml"/>

    <p>More information can be found at
    <url>https://en.wikipedia.org/wiki/December_25</url>.</p>
  </body>
</document>

The element <xi:include href="evidenceList.xml" parse="xml"/> indicates that the XML included in examples/evidenceList.xml is to be included in the output document.

We will add the --xinclude tag to our call to xsltproc to process the XInclude elements when we do our transformation:

xsltproc --xinclude -o examples/processXinclude.html examples/documentToHtml.xsl  examples/processXinclude.xml

The resulting HTML document can be viewed at examples/processXinclude.html.

3. Subsetting elements: prepare R code chunks for Knitr

The Knitr package lets document authors embed chunks of R code in special comment code and execute these chunks to produce an output document. The use of comments to indicate code makes it difficult to perform custom actions on R code marked up in this way. The document markup format wraps chunks of R code to be executed by Knitr in <code class="knitr"> elements. This allows an author using the document markup format to perform any operations she likes on chunks of R code. An XSLT stylesheet can be used to transform chunks of code marked up in this fashion into Knitr R code chunks in a Knitr HTML document.

The full XSLT stylesheet used in this example can be found at examples/documentToRhtml.xsl. The XSLT code used to transform <code class="knitr"> elements to Knitr R code chunks is as follows:

<xsl:template match="code[@class='knitr']">
  <xsl:comment><xsl:text>begin.rcode </xsl:text><xsl:value-of select="@name"/><xsl:if test="@options"><xsl:text>, </xsl:text><xsl:value-of select="@options"/></xsl:if>
  <xsl:text>&#xA;</xsl:text>
  <xsl:value-of select="node()"/>
  <xsl:text>&#xA;end.rcode</xsl:text></xsl:comment>
</xsl:template>

The source document examples/knitrChunk.xml contains the following XML:

<?xml version="1.0" encoding="UTF-8"  ?>
<document>
  <metadata>
    <title>Plotting in R</title>
    <author>
      <name>Ashley Noel Hinton</name>
      <email>ahin017@aucklanduni.ac.nz</email>
    </author>
    <date>25 December 2015</date>
  </metadata>

  <body>
    <p>A plot to celebrate 25 December:</p>

    <code class="knitr"><![CDATA[x <- rnorm(n = 10)
plot(x)]]></code>

    <p>More information can be found at
    <url>https://en.wikipedia.org/wiki/December_25</url>.</p>
  </body>
</document>

This document can be transformed to Knitr HTML using the following call to xsltproc:

xsltproc -o examples/knitrChunk.Rhtml examples/documentToRhtml.xsl examples/knitrChunk.xml

The resulting Knitr HTML document can be viewed at examples/knitrChunk.Rhtml.

4. Extended transformations

The previous three examples have shown single transformations on documents written in the document markup format. Authors of document files are not limited to just one transformation, however. In the previous example we demonstrated how a module author can convert a document with marked up chunks of R code into a document which can then be processed using the Knitr package in R. The Knitr HTML document produced in the previous example can be converted to HTML using the following code in R:

library(knitr)
oldwd <- setwd("examples")
knit(input = "knitrChunk.Rhtml")
setwd(oldwd)

The resulting HTML document can be viewed at examples/knitrChunk.html.

This output document is the result of two transformation steps, using two different tools (xsltrproc, and the Knitr package in R), each producing an output document. The document format can be used in this way to author documents that require several transformation steps, and several transformation tools. The result of each transformation can be provided as a source document for the following transformation. In the following discussion we explore how a document author might manage multiple transformations on a document.

Discussion

The document markup format provides a reasonably simple authoring format in which an author can write documents for one or several output formats. Like Markdown, the document format allows an author to target HTML and PDF as output formats, among others. Unlike Markdown the document format is not limited to a known set of transformations. Adding a custom transformation to the document format is as easy as using a custom XML element while authoring—the transformation of this custom element could then be defined in an XSLT stylesheet, for example.

The kinds of transformations available to an author using the document format are as many and diverse as those available when using XML. This report has described two basic transformations:

Transformation to HTML using an XSLT stylesheet and the xsltproc command line tool.
Incorporating external XML files using XInclude's <xi:include> elements, processed with the xsltproc command line tool.

This report has also demonstrated how transformations can be applied to subsets of elements, as in the example where chunks of R code wrapped in <code class="knitr"> elements were transformed into Knitr R code chunks using an XSLT stylesheet and xsltproc. The ability to perform transformations on subsets of elements using the document markup format allows an author much finer control over the output produced than when using Markdown. For instance, an author may wish to produce a teacher's and student's copy of a document, with some sections only visible to the teacher. Subsetting elements would allow an author to do produce a student's and as teacher's output from the same source document—this would likely require separate source documents if done using Markdown.

It is of course also possible to subset and to perform similar transformations on Markdown documents using regular expressions—an author using a Markdown format could, for example, include custom tags in her document indicating custom transformations or subsets. Finding and transforming specific XML elements is made easy by using existing XML query tools like XPath. While custom tags and subsets in a Markdown document could be found and transformed using regular expressions this would place a greater burden on transformation authors, and would be much easier to get wrong.

It is worth noting that an author using a markup format like the document format described in this report is making some sacrifices in terms of simplicity of authoring in order to gain greater control over transformations. Authoring in Markdown allows an author to “format” her source document in such a way that the source can also be read as an output document. Authoring in XML makes a source document less immediately readable. For example, though Markdown list formatting can be fiddly to manage as an author it has the advantage of appearing like list output. A markup list format, like that found in HTML, consists of many list tags which are intended to tell a computer that the content is a list, not to be read by a human. In this respect an author has to know more “code” to author using a markup format than when using a Markdown format.

While only HTML and Knitr HTML output were demonstrated in this report, transformation to other formats is of course possible. One popular format for sharing articles and reports is the PDF format. One method for producing a PDF from the document format is to use the Pandoc document converter on the HTML output produced in the first example. The transformation can be performed using the following call to Pandoc:

pandoc -s -o examples/toHtml.pdf examples/toHtml.html

The resulting PDF document can be viewed at examples/toHtml.pdf. If a document author wanted to have greater control over the PDF produced she might instead use an XSLT stylesheet and xsltproc to transform the document to the LaTeX format. This could then be transformed to PDF using a tool like pdflatex.

The production of PDF and other formats, and the secondary transformation example in this report, where a Knitr HTML file produced from a document is transformed into HTML, demonstrate how a document author may want to perform multiple transformation, employing multiple tools. For example an author may wish to do all of the following to a document:

Merge XML from external documents indicated by <xi:include> elements.
Convert the document to Knitr HTML.
Process the code chunks in the Knitr HTML to produce and HTML output.

A potential method for handling a pipeline of such transformations is the OpenAPI architecture (Introducing OpenAPI, OpenAPI version 0.6). A document author could describe each of the transformation steps as an OpenAPI module, and use these modules to describe the entire transformation in an OpenAPI pipeline.

Using an OpenAPI pipeline to describe a document transformation doesn't just provide a quick method of performing all transformations at once—wrapping transformation code in OpenAPI modules also provides a means by which the author of a transformation pipeline, or anyone else, can modify and extend the transformation. For example, if a user of a transformation pipeline following the steps listed above wished to perform another transformation between step 1 and step 2, she would only need to create a module which took the output produced in step 1 as input, and produced the input required by step 2 as output—this new module could then be placed into a copy of the pipeline. The production of a teacher's and student's version of the HTML output could be produced by branching the pipeline to process a subset of elements in the appropriate way. Similarly, a different output format—PDF, for example—could be produced by adding the appropriate transformation modules to the pipeline while still producing the original HTML output.

Summary

In this report we have described how an XML authoring format provides document authors with greater control over document transformations than popular Markdown formats allow. We have described a simple XML document format that includes features of standard HTML formatting as well as custom elements for document transformation. We have shown several examples of how the document format can be used in common document transformations. The document format described is a good candidate for documents which require multiple transformations, allowing an author to employ various tools and to produce multiple output documents in multiple formats.

Technical requirements

Conduit version 0.6-3, a prototype OpenAPI glue system R package, was used to produce the final version of this report (https://github.com/anhinton/conduit/releases/tag/v0.6-3).
Knitr version 1.12.3, an R package, was used for the transformations in this report (http://yihui.name/knitr/).
Pandoc version 1.16.02 was used for the transformations in this report (http://pandoc.org).
R version 3.3.1 was used for the transformations in this report (https://www.r-project.org/).
All of the transformations described in this report were produced on a machine running Ubuntu 16.04 LTS 64-bit (http://www.ubuntu.com/).
xmllint using libxml version 20903 was used in the transformations which produced this report (http://www.xmlsoft.org/).
xsltproc using libxml 20903, libxslt 10128 and libexslt 817 was used for the transformations in this report (http://www.xmlsoft.org/).

Resources

The transformation to HTML example uses the source document examples/toHtml.xml, and the XSLT stylesheet examples/documentToHtml.xsl.
The processing XInclude elements example using the source document examples/processXinclude.xml, and the XML file examples/evidenceList.xml.
The subsetting elements example uses the source document examples/knitrChunk.xml, and the XSLT stylesheet examples/documentToRhtml.xsl.
The extended transformation example uses the output document examples/knitrChunk.Rhtml produced by the third example as its source document.
This report was produced using an OpenAPI pipeline executed with Conduit version 0.6-3. The source document is available at report.xml. The transformation pipeline can be found at transform/toHtml/pipeline.xml—the pipeline's modules are at transform/toHtml/convertToRhtml.xml, transform/toHtml/xinclude.xml, transform/toHtml/knitToHtml.xml, and transform/toHtml/substituteEntities.xml. The pipeline result object can be found at toHtml.tar.gz. The R script used to execute this pipeline is available at transform.R.

A transformable markup document format by Ashley Noel Hinton and Paul Murrell is licensed under a Creative Commons Attribution 4.0 International License.