10 May 2016
The OpenAPI architecture is designed to help people connect with data. The architecture specifies an XML schema for wrapping pieces of data analysis code in modules, and combining modules in pipelines. The architecture also specifies the requirements of OpenAPI glue systems—software which can interpret and execute modules and pipelines (Introducing OpenAPI). In this report we describe the changes to the OpenAPI architecture implemented in version 0.5.
Version 0.3 of OpenAPI introduced a host attribute to the
module specification as a method for guaranteeing that module
source script requirements would be met
(OpenAPI
version 0.3). This report describes how the module host
attribute has been replaced with host
elements in
version 0.5. This implementation permits more control over the
host types supported. This report describes
how vagrant
and docker
hosts are
supported in version 0.5. Version 0.5 of OpenAPI also introduces
a persistent result format for saving and sharing the results of
executing modules and pipelines.
The improvements described in this report have been incorporated into Conduit version 0.5. The Conduit package is a prototype OpenAPI glue system implemented as an R package.
Version 0.5 of the OpenAPI architecture introduces a persistent result format for returning the results of executing a module or a pipeline. An OpenAPI result object provides any glue system with persistent access to a module' or pipeline's output objects. This provides a possible means for a glue system to cache module and pipeline results. The result format also allows the result of pipeline execution to be consumed by the glue system at a future time, or for the pipeline outputs to be consumed on another glue system or machine entirely.
The main mechanisms for producing persistent module and pipeline results are module and pipeline XML files. For a module result a glue system should produce a module XML file which, on execution, produces the same named output objects as the original module, with the same vessel types and formats. Similarly, for a pipeline a glue system should produce a pipeline XML file which, on execution, produces the same named output objects as the component modules in the original pipeline.
A glue system must be able to produce a gzipped tar archive
which contains the module result for each module it has
executed. This archive should have a single directory named for
the module at its top level. This directory should contain a
module XML file, which echoes the name of top directory. For
example, the result of a module named ‘blockdata’
should contain a directory named blockdata
which
contains the file blockdata.xml
. This directory can
also contain any other file resources required by the module
result XML to produce the module's outputs.
A glue system must also be able to produce a gzipped tar
archive which contains the pipeline result for each pipeline it
has executed. As with module result archives, the pipeline result
archive should contain at its top level a directory named for the
executed pipeline. This directory should contain a pipeline result
XML file named pipeline.xml
. This directory can also
contain any other file resources required by the pipeline result
XML to produce its component's
outputs. Only pipeline.xml
is required by the
pipeline result archive specification, but it is expected that an
archive will also contain the pipeline's module XML files, and the
files these modules require to produce their outputs. For example,
Conduit v0.5 includes a named directory for each module in the
pipeline result archive—each module directory has the same
structure as a module result archive.
A glue system should also be able to unpack module and pipeline results from the gzipped tar archives described above. Once unpacked a glue system can easily read and execute the archive's module and pipeline XML files in the usual way.
A glue system should at the very least be able to recover modules and pipelines from the pipeline and module result archives it produces. Ideally a glue system should produce result archives from which any simple glue system can recover pipeline and module results. To achieve this it is recommended that pipeline and module result XML files have minimal system requirements. In general this means that a module or pipeline archive should contain everything that is required to produce the module outputs, and should invoke as little processing as possible to produce these outputs.
One method for creating lightweight module result archives is for the glue system to generate a ‘dummy’ module XML file, which simply names its outputs directly. For example, a module might produce the following output:
<output name="birdPicture">
<file ref="birdpicture.pdf"/>
<format formatType="text">PDF file</format>
</output>
The file produced by this module output could be included in the module result archive alongside the module result XML file. The module result XML could produce this file output by including the output XML above.
This technique can also be used for URL vessel types. A module might contain the following output:
<output name="episodeTable">
<url ref="http://openapi.org/raw/episodeTable.html"/>
<format formatType="text">HTML file</format>
</output>
The module result XML file could include this output XML directly to provide this output.
The case of internal vessel types is not so immediately straightforward to solve without some work by the glue system. Consider an R-language module with the following output:
<output name="suburbs">
<internal symbol="suburbNames"/>
<format formatType="text">R list</format>
</output>
While a glue system will have a built-in method for passing
internal outputs to the inputs of subsequent modules in normal
operation, there is no guarantee that the glue system which
recovers a module result will use the same mechanism. In this case
the glue system must act as a module author and write a
module source script which produces an internal language object to
be named as an output. For example, a glue system could serialize
the output named above to a file
called suburbNames.rds
with the R
function saveRDS()
. This file could be placed in the
module result archive alongside the module result XML file. The
glue system could wrap the following source script in the module
result XML:
suburbNames <- readRDS("suburbNames.rds")
The module result XML can now name this output using the same output code as above, as this internal object will now be created upon execution of the module result.
Where a module names a host
machine
(see later
section) for execution of its source scripts it is recommended
that the subsequent module result should not require use
of a host to produce its outputs. This should improve the
portability and ‘weight’ of the result archive. On the
other hand, if the code authored by the glue system to represent
the result does have significant system requirements, it may make
sense to specify a (different) host for the module result.
The following examples demonstrate how persistent pipeline and module results have been implemented in Conduit v0.5.
The result of executing a module is a moduleResult
object:
mod1 <- loadModule(name = "mod1", ref = system.file("extdata", "test_pipeline", "module1.xml", package = "conduit")) modRes1 <- runModule(mod1) class(modRes1)
## [1] "moduleResult" "componentResult"
A moduleResult
object can be exported to a module
result tar archive using the export()
function:
modExport1 <- export(modRes1) basename(modExport1)
## [1] "mod1.tar.gz"
The resulting tar archive—mod1.tar.gz—contains the module result XML file (mod1.xml) and the files it requires to reproduce the module outputs:
## [1] "mod1/mod1.xml" "mod1/script.R" "mod1/x.rds"
Exported module result archives can be recovered using
the importModule()
function:
recoveredMod1 <- importModule(tarfile = modExport1, name = "recoveredMod1") class(recoveredMod1)
## [1] "module"
The module result produced by running this recovered module within the same glue system should be indentical to the original module result, i.e. module results should ‘round trip.’
The result of running a pipeline is
a pipelineResult
object:
pipeline1 <- loadPipeline( name = "pipeline1", ref = system.file("extdata", "test_pipeline", "pipeline.xml", package = "conduit")) pipelineRes1 <- runPipeline(pipeline1) class(pipelineRes1)
## [1] "pipelineResult" "componentResult"
A pipelineResult
object can be exported to a
pipeline result tar archive using the export()
function:
pplExport1 <- export(pipelineRes1) basename(pplExport1)
## [1] "pipeline1.tar.gz"
The resulting tar archive—pipeline1.tar.gz—contains the pipeline result XML file (pipeline.xml) and the files it requires to reproduce the pipeline outputs, including the module XML files required to produce the result of each module in the pipeline:
## [1] "pipeline1/module1/" "pipeline1/module1/module1.xml" ## [3] "pipeline1/module1/script.R" "pipeline1/module1/x.rds" ## [5] "pipeline1/module2/" "pipeline1/module2/module2.xml" ## [7] "pipeline1/module2/numbers.rds" "pipeline1/module2/Rplots.pdf" ## [9] "pipeline1/module2/script.R" "pipeline1/pipeline.xml"
Exported pipeline result archives can be recovered using
the importPipeline()
function:
recoveredPpl1 <- importPipeline(tarfile = pplExport1, name = "recoveredPpl1") class(recoveredPpl1)
## [1] "pipeline"
Running the pipeline recovered from a pipeline result archive should, within the same glue system, also produce a pipeline result identical to the original pipeline result—pipeline results, like module results, ‘round trip.’
OpenAPI version 0.3 introduced module hosts to help solve the ‘dependency problem’ ('Module host' in OpenAPI version 0.3). In this section we will describe the dependency problem, and how module hosts offer a solution to this problem. We then describe changes to host modules in OpenAPI version 0.5. We conclude his section with a demonstration of two types of module hosts—docker and vagrant—and examples of how support for these host types has been implemented in Conduit v0.5.
The dependency problem is the problem of ensuring an OpenAPI glue system can meet the hardware and software requirements of any given module's source scripts. There are three broad variations on the dependency problem: meeting the requirements of a module source script; providing a module's specified language; providing an environment for the glue system software itself.
Within a module's source scripts it is reasonable to expect
that a script author will want to make use of installable
libraries and packages within the platform. For example, an R
script author may make use of
the gridSVG
package from the Comprehensive R Archive Network
(CRAN). Executing
a module with “library(gridSVG)
” in
the author's source script would fail if the glue system does
not have access an R session in which the gridSVG
package available. Similarly a Python script author might call
“from TwitterAPI import TwitterAPI
”,
which would fail if the glue system could not access a Python
session in which
the TwitterAPI
package from the Python
Package Index is available.
A module author may require a specific version of a
software platform to be available for her module scripts. For
example, a module mighty specify that it requires
“R >= 3.0
”. It would be desirable
for a glue system to indicate if it is unable to meet this
requirement.
Conceivably a module could have even more specific platform
requirements, including fine-grained details about the system
on which the platform software is run. How might a glue system
provide "R >= 3.1
”, alongside
“java >= 1.7
”, on an
“Ubuntu 14.04 64-bit
” system?
A glue system itself might require a particular software or
hardware environment to run. For
example, conduit
,
a glue system distributed as an R package, was created to run
in “R >= 3
” on Ubuntu
14.04. Though it can probably be installed in R on Windows, it
almost certainly will not work, as it makes assumptions about
system paths which are only satisfied in Linux.
Module hosts were introduced in OpenAPI v0.3 to solve the first variation of the dependency problem: meeting module source script dependencies. A module host is a real or virtual computer, accessible by the glue system, which meets the software and system dependencies of a module's source script—the glue system executes the a module's source scripts on the host machine instead of executing the code locally. In this section we describe the changes to the specification of module hosts in OpenAPI v0.5.
In OpenAPI v0.3 a module host was a machine which could be accessed using the SSH protocol. The module host was specified using a host attribute, as in the following example:
<module language="R" host="conduit@openapi.org:2222">
...
</module>
The use of the SSH protocol allowed a glue system to connect to many types of host machines, both physical and virtual, using a single interface. However, using the SSH method made the glue system responsible for managing authentication with the remote host, and did not allow a module author to take advantage of host machine-specific authentication and execution methods.
In OpenAPI version 0.5 the module host attribute has been
replaced with host
elements. A host
element can contain elements describing how the glue system can
connect to a variety of host machine types. A glue system can
support any host
type specified in the architecture
specification. OpenAPI v0.5 specifies docker
and vagrant
host elements for a glue system to
connect to a host on
a Docker container and on
a Vagrant machine,
respectively. An OpenAPI glue system is responsible for preparing
the resources required for a module host
machine to
execute a module's source scripts, and for retrieving the outputs
resulting from executing a module.
The following sections provide details for specifying a host for module execution using Docker and Vagrant.
A module author can specify that a module is to be executed on
a Docker container host using the docker
element. This element has one required attribute,
‘image’, and one optional attribute,
‘guestdir’. The ‘image’ attribute accepts
the name of
a Docker
image to be used for execution. The ‘guestdir’
attribute accepts the file path of the directory where the module
source scripts will be executed on the Docker container (guest
machine). A module author specifies a module should be executed
using a Docker container which uses the Docker image
“rocker/r-base” as in the following example:
<module language="R">
<host>
<docker image="rocker/r-base"/>
</host>
...
</module>
Support for docker
host elements has been
implemented
in Conduit
v0.5. When the ‘guestdir’ is not provided, as in
the above example, Conduit will execute in the
“/home/conduit” directory on the Docker
container. Preparing and retrieving of module input and output
objects is simplified for docker
hosts in Conduit
v0.5 by syncing the glue system's module output directory directly
with the Docker container. The Conduit package requires that
Docker be installed on the system, and that the user running the R
session is a member of the ‘docker’ group. Conduit 0.5
was tested using Docker version 1.11.1. Adding a user to the
‘docker’ group in Ubuntu Linux is described in
the Docker
documentation.
The following example demonstrates how a module with
a docker
module host can be executed in Conduit
v0.5. The module
file dockerModule.xml
contains host
XML specifying that the module source
scripts should be executed on a Docker container made from the
“rocker/r-base” image:
<host>
<docker image="rocker/r-base"/>
</host>
The module is read into Conduit in the usual fashion:
dockerModule <- loadModule(name = "dockerModule", ref = "dockerModule.xml")
Then the module is executed in the usual fashion:
result1 <- runModule(module = dockerModule)
If the user running Conduit is in the ‘docker’
group, and Docker is installed, the module source scripts will be
executed on the Docker container. This will produce
a moduleResult
object which can be exported to a
module result archive.
export1 <- export(result1)
The module result archive from this example is available for inspection at dockerModule.tar.gz.
A module author can specify that a module is to be executed on
a Vagrant machine using the vagrant
element. The vagrant
element has one required
attribute, ‘vagrantfile’, and two optional attributes,
‘guestdir’ and ‘hostdir’. The
‘vagrantfile’ attribute requires a file path to
a Vagrantfile
on the local system. The ‘hostdir’ attribute allows
the module author to name a directory on the machine running a
glue system (host system) which is to
be synced
with the Vagrant machine. The ‘guestdir’ attribute
allows the module author to specify the directory on the Vagrant
machine (guest machine) to be synced with
‘hostdir’. The following demonstrates how a module
author specifies a Vagrant host machine:
<module language="R">
<host>
<vagrant vagrantfile="~/vagrant/vagrant-conduit/Vagrantfile"/>
</host>
...
</module>
In this example the module source scripts will be executed using the Vagrant machine defined using the Vagrantfile found at “~/vagrant/vagrant-conduit/Vagrantfile” on the machine running a glue system.
Support for vagrant
host elements has been
implemented
in Conduit
v0.5. When the module author does not specify a
‘hostdir’ Conduit will use the directory containing
the specified ‘vagrantfile’ as the synced folder. If
the author does not specify a ‘guestdir’ Conduit will
use the “/vagrant” directory as the target for the
synced folder on the Vagrant machine (guest machine). Conduit v0.5
prepares the host machine by preparing the ‘hostdir’,
and thus the ‘guestdir’, with the resources required
to execute the module. After executing the module in
‘guestdir’ on the Vagrant machine, Conduit returns the
subsequent outputs to the glue system's own module output
directory. Conduit will use the defaults for
‘hostdir’, and ‘guestdir’ in the Vagrant
host XML described above.
The Conduit package requires that Vagrant be installed on the system, and that the file named in ‘vagrantfile’ exists on the local filesystem. It also requires that the Vagrant machine named be running when the module is executed—Conduit will not start a stopped Vagrant machine. Conduit 0.5 was tested using Vagrant version 1.8.1.
The following example demonstrates how a module with
a vagrant
module host can be executed in Conduit
v0.5. The module
file vagrantModule.xml
contains host
XML specifying that the module source
scripts should be executed on a Vagrant machine described in a
Vagrantfile found on the local machine at
“~/vagrant/vagrant-conduit/Vagrantfile”, with this
machine having already been started. The Vagrantfile used in this
example, and its provisioning scripts, can
be downloaded
from github.
<host>
<vagrant vagrantfile="~/vagrant/vagrant-conduit/Vagrantfile"/>
</host>
The module is read into Conduit in the usual fashion:
vagrantModule <- loadModule(name = "vagrantModule", ref = "vagrantModule.xml")
Then the module is executed in the usual fashion:
result2 <- runModule(module = vagrantModule)
If: (a) Vagrant is installed on the machine running Conduit;
(b) there is a Vagrantfile at
“~/vagrant/vagrant-conduit/Vagrantfile”; and (c) this
Vagrant machine has been started, the module source scripts will
be executed in the Vagrant machine. This will produce
a moduleResult
object which can be exported to a
module result archive.
export2 <- export(result2)
The module result archive from this example is available for inspection at vagrantModule.tar.gz.
In this report we have described the introduction of persistent result archives for modules and pipelines in OpenAPI version 0.5. These result archives allow for the outputs produced by an OpenAPI module or pipeline to be preserved and recovered at a later time, or even in another glue system or on another machine. Within a glue system this provides a mechanism for caching module results, preventing a glue system from having to execute a computationally- or time-intensive task multiple times. It also provides a mechanism for sharing these results with other users. A pipeline result can serve as a simplified method for a user to incorporate results from modules which may not be practical to execute on her local machine, while still employing module and pipeline XML.
We have also described how module host machines can help to
solve an aspect of the dependency problem, by guaranteeing a glue
system can meet the system requirements of a module's source
scripts. The changes to module host
elements
implemented in OpenAPI v0.5 provide more nuanced access to Docker
and Vagrant virtual computer environments, allowing a module
author to take advantage of each of these systems. Via this
mechanism a module result archive is something like a
self-contained executable mini-program, producing output results
on any machine with a suitable glue system environment. The
OpenAPI v0.5 specification also provides a path for adding other
host types in the future. The authors believe that a more
considered implementation of an ssh
host type is
desireable in future developments of OpenAPI.
Two technical reports demonstrate how docker
hosts
can be used in authoring OpenAPI
pipelines. An
Improved Pipeline for CPI Data (2016)
and An
OpenAPI Pipeline for NZ Crime Data (2016) are written using
Conudit v0.5.
vagrant
host type uses a Vagrant machine built from the Vagrantfile and
provisiong scripts found in version 0.5 of
the vagrant-conduit
repository on github.
docker
host example in this report uses the
module file dockerModule.xml.vagrant
host example in this report uses the
module file vagrantModule.xml.docker
host type to build
a user-friendly web front-end for running an OpenAPI
pipeline.docker
host type to
build a pipeline to explore NZ Crime data.