A transformable markup document format
28 July 2016
Many technologies now exist for writing a document in a format
that can be transformed into various output formats for
sharing. This report proposes that using markup, rather than one
of the Markdown languages, is a good fit for writing flexible
human- and machine-readable transformable documents. We propose
a transformable markup document
format written in
XML. We also provide several examples of how the
document
format could be transformed.
Why a markup format
When writing reports and articles for publication in journals, books, online or otherwise, an author will employ one of many document formats to describe the typographical and structural details of her document. Common formats for research publication include the LaTeX system for producing printed documents, and HTML for producing online documents. These and other formats allow an author to have low-level control over a certain type of output, but usually do not offer control over multiple outputs.
The Pandoc document converter can be used to allow an author to write a document in one format and then convert to another. For example an author could create an HTML document and use Pandoc to generate a LaTeX version of this document. An author will generally write a document using the Pandoc Markdown format, and then use Pandoc to convert the document to one (or several) formats for publication.
Pandoc Markdown is a variant of Markdown, a lightweight markup language written as a simpler way to author HTML documents. Markdown is intended to be readable without conversion, and as such is written so as not to appear to contain markup tags or formatting. While this can makes it easy for an author to write a document with common formatting it makes it very difficult to have fine-grained control over the eventual output document(s).
As well as needing control over the appearance and structure of a document, an author may also need to embed software code to be executed to produce the final version of the document. Various tools exist for executing code embedded in a literate document to produce the desired output. For example, an author can embed R code in her document and use the Knitr package to execute the R code and produce a final document. An author embeds R code in another document format, for example HTML or LaTeX, and processes the document to produce an output document in the same format.
The R Markdown package combines the dynamic document processing of Knitr with the common authoring format of Pandoc Markdown. An author can write a document in Pandoc Markdown which contains embedded R code to be executed. The author can then execute the code and produce an output document for publication in one of the many formats available through Pandoc document conversion.
It is clear that a document author has many options, and the implementation of R Markdown in the RStudio IDE is an indication that many developers in the R world at least see Markdown as a very useful authoring tool. The following use cases will server to highlight some of the drawbacks to authoring in Markdown. These drawbacks will inform the design of a transformable document solution described in this report.
Limitations of Markdown
While Markdown can be very useful for authoring simple documents quickly (Markdown's creators took inspiration from the formatting conventions of plain text emails) an author also makes several sacrifices in choosing Markdown over a markup language like HTML. Two examples of the limitations of Markdown are the creating of lists and tables. List creation in Pandoc Markdown is described below. Table creation is left as an exercise for the reader.
In Pandoc Markdown a simple list is created by prepending a “*”, “+”, or “-” character to the beginning of each list item, as in the following example:
* one
* two
* three
If an author wishes to create an embedded list, she must use the four space rule to indent each embedded list.
* outer list 1
- inner list 1
- inner list 2
* outer list 2
At even two levels this process is already quite awkward. As list membership depends on whitespace it can be a frustrating exercise trying to make changes to a complex list, let alone author a list in the first place.
In contrast, an author using HTML markup can indicate an
unordered list using a<ul>
element, with
<li>
elements for each item, as in the
following code:
<ul>
<li>one</li>
<li>two</li>
<li>three</li>
</ul>
Similarly, an embedded list simply nests the same structure inside a list structure, as in the following code:
<ul>
<li>outer list 1
<ul class="subList">
<li>inner list 1</li>
<li>inner list 2</li>
</ul>
</li>
<li>outer list 2</li>
</ul>
The author does not have to count white space, and it is trivial to make changes to this list structure.
Markup like HTML also gives a document author more control
over the output than Markdown. The author of the HTML list
example above has used the class
attribute to
indicate the inner list belongs to the “subList”
class. Using class
an author can apply output
styles or perform other actions on subsets of elements. When a
document authored in Markdown is transformed to HTML its lists
will be marked up using the same HTML list tags as above, but an
author of a Markdown document does not have access to these
class
attributes for customising output. She could
include raw HTML in her Pandoc Markdown document, but this
limits the output types to those using HTML.
Another example of the limitations of Markdown is demonstrated by embedded code chunks. A document author has various methods for embedding chunks of code in her document. For example, an author of a Knitr HTML or LaTeX document can enclose code chunks to be executed in specially formatted comments in the respective document languages. For example, a document author can embed and R code in a Knitr HTML document as in the following code:
<!-- begin.rcode
x <- rnorm(n = 10)
plot(x)
end.rcode-->
Similarly an author can embed R code in a Knitr LaTeX document as in the following code:
%% begin.rcode
% x <- rnorm(n = 10)
% plot(x)
%% end.rcode
A document author using the R Markdown package can enclose R code chunks in special “fenced code” blocks as in the following code:
```{r}
x <- rnorm(n = 10)
plot(x)
```
While the use of these methods for including code makes it
quick and easy to write a document it makes it more difficult
for an author to do extra processing to chunks of code before
producing the a final document. In contrast, an author creating
an HTML document might wrap R code in <code>
elements as in the following code:
<code class="R">
x <- rnorm(n = 10)
plot(x)
</code>
If an author marked up code chunks in this fashion she could,
for example, make use of tools which employ the XPATH query
language to locate <code>
elemenents and
perform transformations. If the author gives R code chunks the
class “R”, as in the above example, she could perform
transformations on just the R code chunks.
If an author uses Pandoc Markdown to write a document she can include raw HTML or raw TeX language elements to control the document output. These raw code sections are only processed by Pandoc when creating the associated output formats, and would otherwise be ignored. There is no simple method for creating custom sections or formats within Pandoc Markdown.
While an author using HTML is also unable to expect new and custom elements to be recognised by a web browser, the fact that HTML is a form of XML means an author can invent her own XML elements for document writing. These custom elements could then be processed using an XML transformation tool like XSL Transformations to convert the custom elements to valid HTML code. A markup document format has the benefit of providing a simplified authoring format without sacrificing fine control when required.
Markdown has proven itself to be very useful for document authors, and it is not the suggestion of this report that a markup format replace Markdown entirely. Rather this report proposes that in situation where Markdown is not powerful enough for a document author a markup format like the one described in the next section might provide the solution. Importantly, a well designed markup document should allow an author to recover a Markdown document as output, thus providing a readable plain text document. While a format like Markdown is designed to satisfy a set of known transformations a format based on markup can also satisfy future unknown transformations, e.g. extracting subsets of elements.
The idea of using Markup as the basis for authoring documents has been championed before by Deb Nolan and Duncan Temple Lang (e.g., in their book XML and Web Technologies for Data Sciences with R and the XDynDocs package for R). The proposal made in this report is essentially a much simplified approach that aims to provide a lower barrier to entry.
A document
markup format
In the previous section we described some of the limitations
in the Markdown document authoring format. It is our proposal
that an authoring format based in XML provides more control and
flexibility when authoring a document. When an author uses a
Markdown format she is limited to the formatting tags and
transformations found in Markdown; similarly an author using
HTML markup is limited to HTML tags. Authoring a document in
XML, however, permits an author not only to include all of the
tags and transformations afforded by HTML, but also any
customised tags or transformations she may require. A
document
author is free to invent new personalised
tags to suit her current document transformation needs. In this
section we describe such a custom document
markup
format.
The transformable document format described in this report is
an XML file with document
as the root element. This
document has two child elements: metadata
and
body
.
The metadata
element contains the document metadata,
with elements for the document title
and
subtitle
, author
information,
date
of publication, and a description
section. An example metadata
element follows:
<metadata>
<title>Today should be a holiday</title>
<author>
<name>Ashley Noel Hinton</name>
<email>ahin017@aucklanduni.ac.nz</email>
</author>
<date>25 December 2015</date>
</metadata>
The body
element contains the document's main
content. The following elements are used in the same way as they
are used in HTML
(https://www.w3.org/TR/html-markup/elements.html):
-
a
– hyperlink -
code
– code fragment -
em
– emphatic stress -
figcaption
– figure caption -
figure
– figure with optional caption -
h1
– heading -
h2
– heading -
h3
– heading -
img
– image -
li
– list item -
ol
– ordered list -
p
– paragraph -
pre
– preformatted text -
q
– quoted text -
section
– section -
strong
– strong importance -
ul
– unordered list
The <url>
element is introduced in the
document
format to indicate a hyperlink where the
enclosed URL is both the href and the value. The following code
block demonstrates the use of the url
element:
<ul>
<li>modular</li>
<li>reusable</li>
<li>shareable</li>
<li><url>https://github.com/anhinton/conduit</url></li>
</ul>
The resulting output:
- modular
- reusable
- shareable
- https://github.com/anhinton/conduit
The document
XML format uses
<code>
elements to indicate blocks of
computer code, just as in HTML. Dynamic code chunks which are to
be executed are marked using the class
attribute to
code
. For example chunks of R code which are to be
executed used the Knitr package are wrapped in a
<code>
element with
class="knitr"
. An author can also provide a
name
attribute for the knitr code chunk, as well as
knitr options
. A document author can also use
CDATA
sections to wrap code with reserved XML
characters. The following code demonstrates how to include an R
code chunk to be executed with Knitr:
<code class="knitr" name="knitrDemo" options="tidy=FALSE"><![CDATA[x <- rnorm(n = 10)
mean(x)]]></code>
And the following is the result of executing this code chunk:
x <- rnorm(n = 10) mean(x)
## [1] 0.3505599
The document
format also makes use of the
include
element from XInclude
(http://www.w3.org/2001/XInclude) namespace to
include XML content from external files. This allows
document
authors to embed other documents which may
be authored separately from the main document. There is no
simple method of doing this directly in either HTML or Pandoc
Markdown.
The next sections describes some simple transformations which
can be performed on the document
markup format
using freely available open source tools. This report was itself
written in the document
markup format—the
source code is available at report.xml.
Transforming the document
markup format
This section describes how freely available open source tools
can be used to transform the document
markup format
in different ways. The examples include creating an HTML
document, preparing R code chunks for executing using Knitr,
processing XInclude elements, and some steps towards creating a
PDF document. The command line tools xsltproc
is
used in the following examples, but many other tools are
available to do the transformations described.
1. Transforming to HTML
The document
markup format can be easily
transformed into HTML using XSL Transformations
(https://www.w3.org/TR/1999/REC-xslt-19991116). XSLT
stylsheets are XML documents which describe how another XML
document can be transformed. They can be used to produce new XML
documents, such as HTML, or other plain text formats. The
command line XSLT processor xsltproc
(http://www.xmlsoft.org/) can be used to apply an
XSLT stylesheet to an XML document to produce an HTML output
document.
As a large part of the document
format is based
on HTML already we do not have to desrcibe very many
transformations in our XSLT stylesheet. We will need to
transform the metadata
section of the
document
to HTML head
elements and an
appropriate title section. We also need to transform our custom
<url>
elements to HTML hyperlinks.
The full XSLT stylesheet used in this example can be found at
examples/documentToHtml.xsl. The XSLT code used to
transform url
elements to HTML hyperlinks is as
follows:
<xsl:template match="url">
<xsl:element name="a">
<xsl:attribute name="href">
<xsl:value-of select="node()"/>
</xsl:attribute>
<xsl:value-of select="node()"/>
</xsl:element>
</xsl:template>
The source document
examples/toHtml.xml contains the following XML:
<?xml version="1.0" encoding="UTF-8" ?>
<document>
<metadata>
<title>Today should be a holiday</title>
<author>
<name>Ashley Noel Hinton</name>
<email>ahin017@aucklanduni.ac.nz</email>
</author>
<date>25 December 2015</date>
</metadata>
<body>
<p>For many years I have believed that 25 December should be a
public holiday, and I am now prepared to provide evidence for
this.</p>
<ol>
<li>There aren't any other holidays in December.</li>
<li>Schools are usually closed anyway.</li>
</ol>
<p>More information can be found at
<url>https://en.wikipedia.org/wiki/December_25</url>.</p>
</body>
</document>
This document can be transformed to HTML using the following
call to xsltproc
:
xsltproc -o examples/toHtml.html examples/documentToHtml.xsl examples/toHtml.xml
The resulting HTML document can be viewed at examples/toHtml.html.
2. Process xi:include
elements
The document
markup format uses XInclude
elements (http://www.w3.org/2001/XInclude) to embed
text from external documents. These documents referenced in
these elements can be processed and embedded directly into the
output document using the command line tool
xsltproc
(http://www.xmlsoft.org/).
We will use the same XSLT stylesheet we used in the previous
example. The source document
examples/processXinclude.xml contains the following
XML:
<?xml version="1.0" encoding="UTF-8" ?>
<document xmlns:xi="http://www.w3.org/2001/XInclude">
<metadata>
<title>Today should be a holiday</title>
<author>
<name>Ashley Noel Hinton</name>
<email>ahin017@aucklanduni.ac.nz</email>
</author>
<date>25 December 2015</date>
</metadata>
<body>
<p>For many years I have believed that 25 December should be a
public holiday, and I am now prepared to provide evidence for
this.</p>
<xi:include href="evidenceList.xml" parse="xml"/>
<p>More information can be found at
<url>https://en.wikipedia.org/wiki/December_25</url>.</p>
</body>
</document>
The element <xi:include
href="evidenceList.xml" parse="xml"/>
indicates that
the XML included in examples/evidenceList.xml is to
be included in the output document.
We will add the --xinclude
tag to our call to
xsltproc
to process the XInclude elements when we
do our transformation:
xsltproc --xinclude -o examples/processXinclude.html examples/documentToHtml.xsl examples/processXinclude.xml
The resulting HTML document can be viewed at examples/processXinclude.html.
3. Subsetting elements: prepare R code chunks for Knitr
The Knitr package lets document authors embed chunks of R
code in special comment code and execute these chunks to produce
an output document. The use of comments to indicate code makes
it difficult to perform custom actions on R code marked up in
this way. The document
markup format wraps chunks
of R code to be executed by Knitr in <code
class="knitr">
elements. This allows an author using
the document
markup format to perform any
operations she likes on chunks of R code. An XSLT stylesheet can
be used to transform chunks of code marked up in this fashion
into Knitr R code chunks in a Knitr HTML document.
The full XSLT stylesheet used in this example can be found at
examples/documentToRhtml.xsl. The XSLT code used to
transform <code class="knitr">
elements to
Knitr R code chunks is as follows:
<xsl:template match="code[@class='knitr']">
<xsl:comment><xsl:text>begin.rcode </xsl:text><xsl:value-of select="@name"/><xsl:if test="@options"><xsl:text>, </xsl:text><xsl:value-of select="@options"/></xsl:if>
<xsl:text>
</xsl:text>
<xsl:value-of select="node()"/>
<xsl:text>
end.rcode</xsl:text></xsl:comment>
</xsl:template>
The source document
examples/knitrChunk.xml contains the following
XML:
<?xml version="1.0" encoding="UTF-8" ?>
<document>
<metadata>
<title>Plotting in R</title>
<author>
<name>Ashley Noel Hinton</name>
<email>ahin017@aucklanduni.ac.nz</email>
</author>
<date>25 December 2015</date>
</metadata>
<body>
<p>A plot to celebrate 25 December:</p>
<code class="knitr"><![CDATA[x <- rnorm(n = 10)
plot(x)]]></code>
<p>More information can be found at
<url>https://en.wikipedia.org/wiki/December_25</url>.</p>
</body>
</document>
This document can be transformed to Knitr HTML using the
following call to xsltproc
:
xsltproc -o examples/knitrChunk.Rhtml examples/documentToRhtml.xsl examples/knitrChunk.xml
The resulting Knitr HTML document can be viewed at examples/knitrChunk.Rhtml.
4. Extended transformations
The previous three examples have shown single transformations
on documents written in the document
markup
format. Authors of document
files are not limited
to just one transformation, however. In the previous example we demonstrated how a
module author can convert a document with marked up chunks of R
code into a document which can then be processed using the Knitr
package in R. The Knitr HTML
document produced in the previous example can be converted
to HTML using the following code in R:
library(knitr) oldwd <- setwd("examples") knit(input = "knitrChunk.Rhtml") setwd(oldwd)
The resulting HTML document can be viewed at examples/knitrChunk.html.
This output document is the result of two transformation
steps, using two different tools (xsltrproc
, and
the Knitr package in R), each producing an output document. The
document
format can be used in this way to author
documents that require several transformation steps, and several
transformation tools. The result of each transformation can be
provided as a source document for the following
transformation. In the following discussion we explore how a
document
author might manage multiple
transformations on a document
.
Discussion
The document
markup format provides a reasonably
simple authoring format in which an author can write documents
for one or several output formats. Like Markdown, the
document
format allows an author to target HTML and
PDF as output formats, among others. Unlike Markdown the
document
format is not limited to a known set of
transformations. Adding a custom transformation to the
document
format is as easy as using a custom XML
element while authoring—the transformation of this custom
element could then be defined in an XSLT stylesheet, for
example.
The kinds of transformations available to an author using the
document
format are as many and diverse as those
available when using XML. This report has described two basic
transformations:
-
Transformation to HTML using an
XSLT stylesheet and the
xsltproc
command line tool. -
Incorporating external XML files
using XInclude's
<xi:include>
elements, processed with thexsltproc
command line tool.
This report has also demonstrated how transformations can be
applied to subsets of elements, as in
the example where chunks of R code wrapped in <code
class="knitr">
elements were transformed into Knitr R
code chunks using an XSLT stylesheet and
xsltproc
. The ability to perform transformations on
subsets of elements using the document
markup
format allows an author much finer control over the output
produced than when using Markdown. For instance, an author may
wish to produce a teacher's and student's copy of a document,
with some sections only visible to the teacher. Subsetting
elements would allow an author to do produce a student's and as
teacher's output from the same source
document
—this would likely require separate
source documents if done using Markdown.
It is of course also possible to subset and to perform similar transformations on Markdown documents using regular expressions—an author using a Markdown format could, for example, include custom tags in her document indicating custom transformations or subsets. Finding and transforming specific XML elements is made easy by using existing XML query tools like XPath. While custom tags and subsets in a Markdown document could be found and transformed using regular expressions this would place a greater burden on transformation authors, and would be much easier to get wrong.
It is worth noting that an author using a markup format like
the document
format described in this report is
making some sacrifices in terms of simplicity of authoring in
order to gain greater control over transformations. Authoring in
Markdown allows an author to “format”
her source document in such a way that the source can also be
read as an output document. Authoring in XML makes a source
document less immediately readable. For example, though Markdown
list formatting can be fiddly to manage as an author it has the
advantage of appearing like list output. A markup list format,
like that found in HTML, consists of many list tags which are
intended to tell a computer that the content is a list, not to
be read by a human. In this respect an author has to know more
“code” to author using a markup format than when using a
Markdown format.
While only HTML and Knitr HTML output were demonstrated in
this report, transformation to other formats is of course
possible. One popular format for sharing articles and reports is
the PDF format. One method for producing a PDF from the
document
format is to use the Pandoc
document converter on the HTML output produced in the first example. The transformation can be
performed using the following call to Pandoc
:
pandoc -s -o examples/toHtml.pdf examples/toHtml.html
The resulting PDF document can be viewed at examples/toHtml.pdf. If a
document
author wanted to have greater control over
the PDF produced she might instead use an XSLT stylesheet and
xsltproc
to transform the document
to
the LaTeX format. This could then be transformed to PDF using a
tool like pdflatex
.
The production of PDF and other formats, and the secondary transformation example in this
report, where a Knitr HTML file produced from a
document
is transformed into HTML, demonstrate how
a document
author may want to perform multiple
transformation, employing multiple tools. For example an author
may wish to do all of the following to a
document
:
- Merge XML from external documents indicated by
<xi:include>
elements. - Convert the document to Knitr HTML.
- Process the code chunks in the Knitr HTML to produce and HTML output.
A potential method for handling a pipeline of such
transformations is the OpenAPI architecture (Introducing
OpenAPI, OpenAPI
version 0.6). A document author could describe each of the
transformation steps as an OpenAPI module
, and use
these modules to describe the entire transformation in an
OpenAPI pipeline
.
Using an OpenAPI pipeline
to describe a
document
transformation doesn't just provide a
quick method of performing all transformations at
once—wrapping transformation code in OpenAPI
module
s also provides a means by which the author
of a transformation pipeline
, or anyone else, can
modify and extend the transformation. For example, if a user of
a transformation pipeline
following the steps
listed above wished to perform another transformation between
step 1 and step 2, she would only need to create a
module
which took the output produced in step 1 as
input, and produced the input required by step 2 as
output—this new module
could then be placed
into a copy of the pipeline
. The production of a
teacher's and student's version of the HTML output could be
produced by branching the pipeline
to process a
subset of elements in the appropriate way. Similarly, a
different output format—PDF, for example—could be
produced by adding the appropriate transformation
module
s to the pipeline
while
still producing the original HTML output.
Summary
In this report we have described how an XML authoring format
provides document authors with greater control over document
transformations than popular Markdown formats allow. We have
described a simple XML document
format that
includes features of standard HTML formatting as well as custom
elements for document transformation. We have shown several
examples of how the document
format can be used in
common document transformations. The document
format described is a good candidate for documents which require
multiple transformations, allowing an author to employ various
tools and to produce multiple output documents in multiple
formats.
Technical requirements
- Conduit version 0.6-3, a prototype OpenAPI glue system R package, was used to produce the final version of this report (https://github.com/anhinton/conduit/releases/tag/v0.6-3).
- Knitr version 1.12.3, an R package, was used for the transformations in this report (http://yihui.name/knitr/).
- Pandoc version 1.16.02 was used for the transformations in this report (http://pandoc.org).
- R version 3.3.1 was used for the transformations in this report (https://www.r-project.org/).
- All of the transformations described in this report were produced on a machine running Ubuntu 16.04 LTS 64-bit (http://www.ubuntu.com/).
-
xmllint
usinglibxml
version 20903 was used in the transformations which produced this report (http://www.xmlsoft.org/). -
xsltproc
usinglibxml
20903,libxslt
10128 andlibexslt
817 was used for the transformations in this report (http://www.xmlsoft.org/).
Resources
- The transformation to HTML example uses the source document examples/toHtml.xml, and the XSLT stylesheet examples/documentToHtml.xsl.
- The processing XInclude elements example using the source document examples/processXinclude.xml, and the XML file examples/evidenceList.xml.
- The subsetting elements example uses the source document examples/knitrChunk.xml, and the XSLT stylesheet examples/documentToRhtml.xsl.
- The extended transformation example uses the output document examples/knitrChunk.Rhtml produced by the third example as its source document.
- This report was produced using an OpenAPI pipeline
executed with Conduit
version 0.6-3. The source
document
is available at report.xml. The transformation pipeline can be found at transform/toHtml/pipeline.xml—the pipeline's modules are at transform/toHtml/convertToRhtml.xml, transform/toHtml/xinclude.xml, transform/toHtml/knitToHtml.xml, and transform/toHtml/substituteEntities.xml. The pipeline result object can be found at toHtml.tar.gz. The R script used to execute this pipeline is available at transform.R.
A transformable markup document format by Ashley Noel Hinton and Paul Murrell is licensed under a Creative Commons Attribution 4.0 International License.