OpenAPI version 0.5

Ashley Noel Hinton
ahin017@aucklanduni.ac.nz

Paul Murrell
paul@stat.auckland.ac.nz

Department of Statistics, University of Auckland

10 May 2016

The OpenAPI architecture is designed to help people connect with data. The architecture specifies an XML schema for wrapping pieces of data analysis code in modules, and combining modules in pipelines. The architecture also specifies the requirements of OpenAPI glue systems—software which can interpret and execute modules and pipelines (Introducing OpenAPI). In this report we describe the changes to the OpenAPI architecture implemented in version 0.5.

Version 0.3 of OpenAPI introduced a host attribute to the module specification as a method for guaranteeing that module source script requirements would be met (OpenAPI version 0.3). This report describes how the module host attribute has been replaced with host elements in version 0.5. This implementation permits more control over the host types supported. This report describes how vagrant and docker hosts are supported in version 0.5. Version 0.5 of OpenAPI also introduces a persistent result format for saving and sharing the results of executing modules and pipelines.

The improvements described in this report have been incorporated into Conduit version 0.5. The Conduit package is a prototype OpenAPI glue system implemented as an R package.

Persistent result formats for modules and pipelines

Version 0.5 of the OpenAPI architecture introduces a persistent result format for returning the results of executing a module or a pipeline. An OpenAPI result object provides any glue system with persistent access to a module' or pipeline's output objects. This provides a possible means for a glue system to cache module and pipeline results. The result format also allows the result of pipeline execution to be consumed by the glue system at a future time, or for the pipeline outputs to be consumed on another glue system or machine entirely.

Requirements

The main mechanisms for producing persistent module and pipeline results are module and pipeline XML files. For a module result a glue system should produce a module XML file which, on execution, produces the same named output objects as the original module, with the same vessel types and formats. Similarly, for a pipeline a glue system should produce a pipeline XML file which, on execution, produces the same named output objects as the component modules in the original pipeline.

A glue system must be able to produce a gzipped tar archive which contains the module result for each module it has executed. This archive should have a single directory named for the module at its top level. This directory should contain a module XML file, which echoes the name of top directory. For example, the result of a module named ‘blockdata’ should contain a directory named blockdata which contains the file blockdata.xml. This directory can also contain any other file resources required by the module result XML to produce the module's outputs.

A glue system must also be able to produce a gzipped tar archive which contains the pipeline result for each pipeline it has executed. As with module result archives, the pipeline result archive should contain at its top level a directory named for the executed pipeline. This directory should contain a pipeline result XML file named pipeline.xml. This directory can also contain any other file resources required by the pipeline result XML to produce its component's outputs. Only pipeline.xml is required by the pipeline result archive specification, but it is expected that an archive will also contain the pipeline's module XML files, and the files these modules require to produce their outputs. For example, Conduit v0.5 includes a named directory for each module in the pipeline result archive—each module directory has the same structure as a module result archive.

A glue system should also be able to unpack module and pipeline results from the gzipped tar archives described above. Once unpacked a glue system can easily read and execute the archive's module and pipeline XML files in the usual way.

Recommendations

A glue system should at the very least be able to recover modules and pipelines from the pipeline and module result archives it produces. Ideally a glue system should produce result archives from which any simple glue system can recover pipeline and module results. To achieve this it is recommended that pipeline and module result XML files have minimal system requirements. In general this means that a module or pipeline archive should contain everything that is required to produce the module outputs, and should invoke as little processing as possible to produce these outputs.

One method for creating lightweight module result archives is for the glue system to generate a ‘dummy’ module XML file, which simply names its outputs directly. For example, a module might produce the following output:

<output name="birdPicture">
  <file ref="birdpicture.pdf"/>
  <format formatType="text">PDF file</format>
</output>

The file produced by this module output could be included in the module result archive alongside the module result XML file. The module result XML could produce this file output by including the output XML above.

This technique can also be used for URL vessel types. A module might contain the following output:

<output name="episodeTable">
  <url ref="http://openapi.org/raw/episodeTable.html"/>
  <format formatType="text">HTML file</format>  
</output>

The module result XML file could include this output XML directly to provide this output.

The case of internal vessel types is not so immediately straightforward to solve without some work by the glue system. Consider an R-language module with the following output:

<output name="suburbs">
  <internal symbol="suburbNames"/>
  <format formatType="text">R list</format>  
</output>

While a glue system will have a built-in method for passing internal outputs to the inputs of subsequent modules in normal operation, there is no guarantee that the glue system which recovers a module result will use the same mechanism. In this case the glue system must act as a module author and write a module source script which produces an internal language object to be named as an output. For example, a glue system could serialize the output named above to a file called suburbNames.rds with the R function saveRDS(). This file could be placed in the module result archive alongside the module result XML file. The glue system could wrap the following source script in the module result XML:

suburbNames <- readRDS("suburbNames.rds")

The module result XML can now name this output using the same output code as above, as this internal object will now be created upon execution of the module result.

Where a module names a host machine (see later section) for execution of its source scripts it is recommended that the subsequent module result should not require use of a host to produce its outputs. This should improve the portability and ‘weight’ of the result archive. On the other hand, if the code authored by the glue system to represent the result does have significant system requirements, it may make sense to specify a (different) host for the module result.

Implementation in Conduit v0.5

The following examples demonstrate how persistent pipeline and module results have been implemented in Conduit v0.5.

The result of executing a module is a moduleResult object:

A moduleResult object can be exported to a module result tar archive using the export() function:

The resulting tar archive—mod1.tar.gz—contains the module result XML file (mod1.xml) and the files it requires to reproduce the module outputs:

Exported module result archives can be recovered using the importModule() function:

The module result produced by running this recovered module within the same glue system should be indentical to the original module result, i.e. module results should ‘round trip.’

The result of running a pipeline is a pipelineResult object:

A pipelineResult object can be exported to a pipeline result tar archive using the export() function:

The resulting tar archive—pipeline1.tar.gz—contains the pipeline result XML file (pipeline.xml) and the files it requires to reproduce the pipeline outputs, including the module XML files required to produce the result of each module in the pipeline:

Exported pipeline result archives can be recovered using the importPipeline() function:

Running the pipeline recovered from a pipeline result archive should, within the same glue system, also produce a pipeline result identical to the original pipeline result—pipeline results, like module results, ‘round trip.’

Running modules on host machines

OpenAPI version 0.3 introduced module hosts to help solve the ‘dependency problem’ ('Module host' in OpenAPI version 0.3). In this section we will describe the dependency problem, and how module hosts offer a solution to this problem. We then describe changes to host modules in OpenAPI version 0.5. We conclude his section with a demonstration of two types of module hosts—docker and vagrant—and examples of how support for these host types has been implemented in Conduit v0.5.

What is the dependency problem?

The dependency problem is the problem of ensuring an OpenAPI glue system can meet the hardware and software requirements of any given module's source scripts. There are three broad variations on the dependency problem: meeting the requirements of a module source script; providing a module's specified language; providing an environment for the glue system software itself.

  1. Source script dependencies

    Within a module's source scripts it is reasonable to expect that a script author will want to make use of installable libraries and packages within the platform. For example, an R script author may make use of the gridSVG package from the Comprehensive R Archive Network (CRAN). Executing a module with “library(gridSVG)” in the author's source script would fail if the glue system does not have access an R session in which the gridSVG package available. Similarly a Python script author might call “from TwitterAPI import TwitterAPI”, which would fail if the glue system could not access a Python session in which the TwitterAPI package from the Python Package Index is available.

  2. Platform dependencies

    A module author may require a specific version of a software platform to be available for her module scripts. For example, a module mighty specify that it requires “R >= 3.0”. It would be desirable for a glue system to indicate if it is unable to meet this requirement.

    Conceivably a module could have even more specific platform requirements, including fine-grained details about the system on which the platform software is run. How might a glue system provide "R >= 3.1”, alongside “java >= 1.7”, on an “Ubuntu 14.04 64-bit” system?

  3. Glue system dependencies

    A glue system itself might require a particular software or hardware environment to run. For example, conduit, a glue system distributed as an R package, was created to run in “R >= 3” on Ubuntu 14.04. Though it can probably be installed in R on Windows, it almost certainly will not work, as it makes assumptions about system paths which are only satisfied in Linux.

Module hosts in OpenAPI v0.5

Module hosts were introduced in OpenAPI v0.3 to solve the first variation of the dependency problem: meeting module source script dependencies. A module host is a real or virtual computer, accessible by the glue system, which meets the software and system dependencies of a module's source script—the glue system executes the a module's source scripts on the host machine instead of executing the code locally. In this section we describe the changes to the specification of module hosts in OpenAPI v0.5.

In OpenAPI v0.3 a module host was a machine which could be accessed using the SSH protocol. The module host was specified using a host attribute, as in the following example:

<module language="R" host="conduit@openapi.org:2222">
  ...
</module>

The use of the SSH protocol allowed a glue system to connect to many types of host machines, both physical and virtual, using a single interface. However, using the SSH method made the glue system responsible for managing authentication with the remote host, and did not allow a module author to take advantage of host machine-specific authentication and execution methods.

In OpenAPI version 0.5 the module host attribute has been replaced with host elements. A host element can contain elements describing how the glue system can connect to a variety of host machine types. A glue system can support any host type specified in the architecture specification. OpenAPI v0.5 specifies docker and vagrant host elements for a glue system to connect to a host on a Docker container and on a Vagrant machine, respectively. An OpenAPI glue system is responsible for preparing the resources required for a module host machine to execute a module's source scripts, and for retrieving the outputs resulting from executing a module.

The following sections provide details for specifying a host for module execution using Docker and Vagrant.

Docker host containers

A module author can specify that a module is to be executed on a Docker container host using the docker element. This element has one required attribute, ‘image’, and one optional attribute, ‘guestdir’. The ‘image’ attribute accepts the name of a Docker image to be used for execution. The ‘guestdir’ attribute accepts the file path of the directory where the module source scripts will be executed on the Docker container (guest machine). A module author specifies a module should be executed using a Docker container which uses the Docker image “rocker/r-base” as in the following example:

<module language="R">
  <host>
    <docker image="rocker/r-base"/>
  </host>
  ...
</module>

Support for docker host elements has been implemented in Conduit v0.5. When the ‘guestdir’ is not provided, as in the above example, Conduit will execute in the “/home/conduit” directory on the Docker container. Preparing and retrieving of module input and output objects is simplified for docker hosts in Conduit v0.5 by syncing the glue system's module output directory directly with the Docker container. The Conduit package requires that Docker be installed on the system, and that the user running the R session is a member of the ‘docker’ group. Conduit 0.5 was tested using Docker version 1.11.1. Adding a user to the ‘docker’ group in Ubuntu Linux is described in the Docker documentation.

The following example demonstrates how a module with a docker module host can be executed in Conduit v0.5. The module file dockerModule.xml contains host XML specifying that the module source scripts should be executed on a Docker container made from the “rocker/r-base” image:

<host>
  <docker image="rocker/r-base"/>
</host>

The module is read into Conduit in the usual fashion:

Then the module is executed in the usual fashion:

If the user running Conduit is in the ‘docker’ group, and Docker is installed, the module source scripts will be executed on the Docker container. This will produce a moduleResult object which can be exported to a module result archive.

The module result archive from this example is available for inspection at dockerModule.tar.gz.

Vagrant host machines

A module author can specify that a module is to be executed on a Vagrant machine using the vagrant element. The vagrant element has one required attribute, ‘vagrantfile’, and two optional attributes, ‘guestdir’ and ‘hostdir’. The ‘vagrantfile’ attribute requires a file path to a Vagrantfile on the local system. The ‘hostdir’ attribute allows the module author to name a directory on the machine running a glue system (host system) which is to be synced with the Vagrant machine. The ‘guestdir’ attribute allows the module author to specify the directory on the Vagrant machine (guest machine) to be synced with ‘hostdir’. The following demonstrates how a module author specifies a Vagrant host machine:

<module language="R">
  <host>
    <vagrant vagrantfile="~/vagrant/vagrant-conduit/Vagrantfile"/>
  </host>
  ...
</module>

In this example the module source scripts will be executed using the Vagrant machine defined using the Vagrantfile found at “~/vagrant/vagrant-conduit/Vagrantfile” on the machine running a glue system.

Support for vagrant host elements has been implemented in Conduit v0.5. When the module author does not specify a ‘hostdir’ Conduit will use the directory containing the specified ‘vagrantfile’ as the synced folder. If the author does not specify a ‘guestdir’ Conduit will use the “/vagrant” directory as the target for the synced folder on the Vagrant machine (guest machine). Conduit v0.5 prepares the host machine by preparing the ‘hostdir’, and thus the ‘guestdir’, with the resources required to execute the module. After executing the module in ‘guestdir’ on the Vagrant machine, Conduit returns the subsequent outputs to the glue system's own module output directory. Conduit will use the defaults for ‘hostdir’, and ‘guestdir’ in the Vagrant host XML described above.

The Conduit package requires that Vagrant be installed on the system, and that the file named in ‘vagrantfile’ exists on the local filesystem. It also requires that the Vagrant machine named be running when the module is executed—Conduit will not start a stopped Vagrant machine. Conduit 0.5 was tested using Vagrant version 1.8.1.

The following example demonstrates how a module with a vagrant module host can be executed in Conduit v0.5. The module file vagrantModule.xml contains host XML specifying that the module source scripts should be executed on a Vagrant machine described in a Vagrantfile found on the local machine at “~/vagrant/vagrant-conduit/Vagrantfile”, with this machine having already been started. The Vagrantfile used in this example, and its provisioning scripts, can be downloaded from github.

<host>
  <vagrant vagrantfile="~/vagrant/vagrant-conduit/Vagrantfile"/>
</host>

The module is read into Conduit in the usual fashion:

Then the module is executed in the usual fashion:

If: (a) Vagrant is installed on the machine running Conduit; (b) there is a Vagrantfile at “~/vagrant/vagrant-conduit/Vagrantfile”; and (c) this Vagrant machine has been started, the module source scripts will be executed in the Vagrant machine. This will produce a moduleResult object which can be exported to a module result archive.

The module result archive from this example is available for inspection at vagrantModule.tar.gz.

Discussion

In this report we have described the introduction of persistent result archives for modules and pipelines in OpenAPI version 0.5. These result archives allow for the outputs produced by an OpenAPI module or pipeline to be preserved and recovered at a later time, or even in another glue system or on another machine. Within a glue system this provides a mechanism for caching module results, preventing a glue system from having to execute a computationally- or time-intensive task multiple times. It also provides a mechanism for sharing these results with other users. A pipeline result can serve as a simplified method for a user to incorporate results from modules which may not be practical to execute on her local machine, while still employing module and pipeline XML.

We have also described how module host machines can help to solve an aspect of the dependency problem, by guaranteeing a glue system can meet the system requirements of a module's source scripts. The changes to module host elements implemented in OpenAPI v0.5 provide more nuanced access to Docker and Vagrant virtual computer environments, allowing a module author to take advantage of each of these systems. Via this mechanism a module result archive is something like a self-contained executable mini-program, producing output results on any machine with a suitable glue system environment. The OpenAPI v0.5 specification also provides a path for adding other host types in the future. The authors believe that a more considered implementation of an ssh host type is desireable in future developments of OpenAPI.

Two technical reports demonstrate how docker hosts can be used in authoring OpenAPI pipelines. An Improved Pipeline for CPI Data (2016) and An OpenAPI Pipeline for NZ Crime Data (2016) are written using Conudit v0.5.

Technical requirements

Resources