OpenAPI version 0.6

Ashley Noel Hinton
ahin017@aucklanduni.ac.nz

Paul Murrell
paul@stat.auckland.ac.nz

Department of Statistics, The University of Auckland

20 June 2016

The OpenAPI architecture specifies a method for sharing and combining pieces of data analysis code so as to help people connect with data. OpenAPI describes how pieces of data analysis code can be wrapped in modules, and how these modules can be combined in pipelines which describe a whole data analysis workflow. Modules and pipelines are created using simple XML, and are intended to promote sharing and re-use. Modules and pipelines can be executed in OpenAPI glue system software (Introducing OpenAPI, OpenAPI version 0.3, OpenAPI version 0.5).

This report describes the changes to the OpenAPI architecture in version 0.6. Module host elements were implemented in OpenAPI version 0.5; version 0.6 has extended module hosts so it is possible to provide a module's host information through a module input. The way a module author specifies a module's language has also been changed in version 0.6, with the introduction of the language element. This element also allows the author to specify details of the version of the language required.

The changes described in this report have been implemented in Conduit version 0.6. Conduit is a prototype OpenAPI glue system implemented as an R package. Further details of changes to Conduit v0.6 are provided at the end of this report.

Passing module host information through a module input

Module host elements were introduced in OpenAPI v0.5 to provide a method for meeting module source script dependencies (OpenAPI version 0.5). One of the limitations of using the docker or vagrant host elements described in OpenAPI v0.5 was that a module author was assuming the host machine described had been created and was available for use when a module is executed. Module host elements were static, meaning module authors had no way of creating a custom host machine as part of an OpenAPI pipeline. Anyone wishing to run a module which required a host machine was also responsible for ensuring this host machine was ready to be accessed.

OpenAPI version 0.6 introduces the moduleInput element to the list of elements available inside a module host element. Unlike docker or vagrant host elements, the moduleInput element indicates that the module's host is to be provided in one of the module's input elements. This allows a module author to create a module host machine and reference it in the output of a module. A pipeline can then pass this output to a module waiting for its host to be provided by one of its inputs.

The following code shows how a module author can specify a module input as the source of the host machine details:

<module xmlns="http://www.openapi.org/2014/">
  <language>R</language>
  <host>
    <moduleInput name="hostMachine"/>
  </host>
  <input name="hostMachine">
    <file ref="hostMachine.xml"/>
    <format formatType="text">XML file</format>
  </input>
</module>

The moduleInput element contains a name attribute, which the module author uses to name the input which will provide the host machine XML. In the above code the “hostMachine” input is named—note the moduleInput name attribute must match exactly the name of one of the module's input names. When this module is executed the glue system reads the details of the host machine described in “hostMachine.xml” and executes the module on this host machine. This XML document should contain only XML to describe a docker or vagrant host machine as introduced in OpenAPI v0.5.

The ‘simpleInputHost’ pipeline example demonstrates how a module host can be passed into a module as an input. The pipeline contains two modules—‘createVagrant’ and ‘normalList’. A pipe connects the output ‘vagrantMachine.xml’ to an input of the same name in ‘normalList’. The XML code for simpleInputHost/pipeline.xml follows:

<?xml version="1.0"?>
<pipeline xmlns="http://www.openapi.org/2014/">
  <description>demonstrate use of &lt;host&gt;&lt;moduleInput/&gt;&lt;/host&gt;</description>
  <component name="createVagrant" type="module">
    <file ref="createVagrant.xml"/>
  </component>
  <component name="normalList" type="module">
    <file ref="normalList.xml"/>
  </component>
  <pipe>
    <start component="createVagrant" output="vagrantMachine.xml"/>
    <end component="normalList" input="vagrantMachine.xml"/>
  </pipe>
</pipeline>

The ‘createVagrant’ module wraps a bash script, which creates a Vagrantfile on the local machine, and starts the vagrant machine. The script creates an XML file, ‘vagrantMachine.xml’, containing OpenAPI vagrant host XML. The module returns the ‘vagrantMachine.xml’ as an output. The XML code for createVagrant.xml follows:

<?xml version="1.0"?>
<module xmlns="http://www.openapi.org/2014/">
  <language>bash</language>
  <description>create a vagrant machine and spin it up</description>
  <source>
    <script><![CDATA[#! /bin/bash
# create a vagrant machine from 'hashicorp/precise32' box in current
# directory

## make vagrant machine directory and go to it
vagrantdir=~/vagrant/precise32
if [ ! -e $vagrantdir ]
then
    mkdir -p $vagrantdir
fi
cd $vagrantdir

## create Vagrantfile
if [ ! -e Vagrantfile ]
then
    vagrant init hashicorp/precise32
fi

## do vagrant up
vagrant up

## back to old dir to create OpenAPI host XML
cd $OLDPWD
echo "<vagrant vagrantfile=\"~/vagrant/precise32/Vagrantfile\"/>" > vagrantMachine.xml]]></script>
  </source>
  <output name="vagrantMachine.xml">
    <file ref="vagrantMachine.xml"/>
    <format formatType="text">XML File</format>
  </output>
</module>

The ‘normalList’ module wraps a “python2” script. The module has a moduleInput host, which names the input ‘vagrantMachine.xml’. The module has one input, ‘vagrantMachine.xml’. The script will be executed on the host machine passed in by this input. The script produces a list of random numbers, which is returned as output ‘x’. The XML code for normalList.xml follows:

<?xml version="1.0"?>
<module xmlns="http://www.openapi.org/2014/">
  <language>python2</language>
  <host>
    <moduleInput name="vagrantMachine.xml"/>
  </host>
  <description>generate list of 10 numbers from norm(0,1)</description>
  <input name="vagrantMachine.xml">
    <file ref="vagrantMachine.xml"/>
    <format formatType="text">XML file</format>
  </input>
  <source>
    <script><![CDATA[#! /usr/bin/python2
import random

## generate list of 10 from norm(0,1)
x = [0] * 10
for i in range(len(x)):
    x[i] = random.gauss(0, 1)]]></script>
  </source>
  <output name="x">
    <internal symbol="x"/>
    <format formatType="text">python list</format>
  </output>
</module>

When this pipeline is executed in an OpenAPI glue system the glue system will execute the ‘normalList’ on the host machine created by ‘createVagrant’, even if this machine did not exist when the pipeline was loaded. This pipeline can be executed in Conduit on a machine with Vagrant installed with the following code. Please note that executing the ‘createVagrant’ module will create and start a new vagrant machine in the directory ‘~/vagrant/precise32’ on the machine running Conduit. If this directory does not exist it will be created. If the ‘hashicorp/precise32’ Vagrant box has not been downloaded to your machine it will be downloaded first. The vagrant machine will not be halted by this pipeline, and should be halted manually from ‘~/vagrant/precise32’.

simpleInputHost <- loadPipeline(name = "simpleInputHost", ref = file.path("examples",
    "simpleInputHost", "pipeline.xml"))

result1 <- runPipeline(pipeline = simpleInputHost)
result1Tarball <- export(result1)

The vagrant host XML created by the ‘createVagrant’ module can be found at pipelines/simpleInputHost/createVagrant/vagrantMachine.xml. The result of running ‘simpleInputHost’ can be found at simpleInputHost.tar.gz.

Specifying module language version details

In OpenAPI v0.3 a module author specifies the execution language for a module's source scripts using the module language attribute (OpenAPI version 0.3). In OpenAPI v0.6 this attribute has been replaced with a language element. OpenAPI v0.6 also introduces attributes to the language element for specifying the language version required to execute a module's source scripts.

Each OpenAPI module must contain a language element as its first element. The value of the language element should be the language of execution for module's source scripts. The following code demonstrates how a module author indicates a module should be executed using the “python” language:

<module xmlns="http://www.openapi.org/2014/">
  <language>python</language>
  ...
</module>

A module author can now also specify the minimum and maximum version of the language required using either the minVersion or the maxVersion attribute, or both. These attributes should be provided with a version number string appropriate to the language named. The following code demonstrates how a module author can specify that a version of the “R” language between versions “2.14.1” and “3.0.2” should be used:

<module xmlns="http://www.openapi.org/2014/">
  <language minVersion="2.14.1" maxVersion="3.0.2">R</language>
  ...
</module>

If a module author instead requires an exact version of a language for script execution she can specify this using the new version attribute of the language element. The version attribute should be provided with a version number string appropriate to the language named. The following code demonstrates how a module author can specify that version “3.5.1+” of the “python” language should be used:

<module xmlns="http://www.openapi.org/2014/">
  <language version="3.5.1+">python</language>
  ...
</module>

Supporting language versions in OpenAPI glue systems

While OpenAPI version 0.6 has introduced the ability to specify the version of a language used to execute a module's source scripts, it is not a requirement of OpenAPI v0.6 that a glue system must respect such specifications. Rather it is intended that module language version information provide a glue system user with a means for debugging problematic module execution. For example, a glue system may provide an option to warn a module user when version requirements are not met. Language version attributes may also serve as a means for recording the actual version of a language used when creating module and pipeline result objects. This could provide useful information about how certain module and pipeline results were achieved, which could aid in the reproducibility of a module or pipeline.

The next section includes a description of how Conduit version 0.6 has implemented module language version information.

Changes in Conduit version 0.6

Conduit is a prototype OpenAPI glue system implemented as an R package. Conduit version 0.6 implements the changes to OpenAPI v0.6 described in this report—passing module host information through a module input and specifiying module language version details.

This section describes how module language elements have been implemented in Conduit v0.6 to optionally provide a warning to the user when the language used to execute a module does not meet the module author's requirements. Conduit will also record the exact version of a language used for module execution in its persistent module results. Finally, language version information has been used to determine which of the two commonly installed versions of the Python language is to be used for module execution.

Warning about language version violation

In Conduit version 0.6 a new argument, warnVersion has been added to the runModule() function, used to execute a module. When this argument is passed the default value of FALSE the module will not behave any differently than in previous versions. However, when warnVersion is set to TRUE Conduit will give a warning if any of a module's language version attributes are violated.

The ‘listOfThings’ module specifies that it is to be executed using version “2.14.1” of the “R” language. The XML code for listOfThings.xml follows:

<?xml version="1.0"?>
<module xmlns="http://www.openapi.org/2014/">
  <language version="2.14.1">R</language>
  <source>
    <script><![CDATA[### create a list of things
listOfThings <- list(
    one = rnorm(n = 100),
    two = LETTERS,
    three = iris,
    four = outer(1:12, 1:12))
]]></script>
  </source>
  <output name="listOfThings">
    <internal symbol="listOfThings"/>
    <format formatType="text">R list object</format>
  </output>
</module>

If this module is executed in a version of R other than 2.14.1 and the module is executed with runModule(..., warnVersion = TRUE) Conduit will give a warning. The following demonstrates the ‘listOfThings’ module being executed in an R version other than 2.14.1:

listOfThings <- loadModule(name = "listOfThings", ref = file.path("examples",
    "warnVersion", "listOfThings.xml"))
result2 <- runModule(module = listOfThings, targetDirectory = tempdir(), warnVersion = TRUE)
## Warning in warnLanguageVersion(module = module, moduleResult =
## moduleResult): R 3.3.0 was not exactly version 2.14.1 when executing module
## listOfThings
result2tarball <- export(result2)

The result of running ‘listOfThings’ can be found at listOfThings.tar.gz.

Recording language version used for execution

Persistent result formats for modules were introduced in OpenAPI version 0.5. In Conduit v0.6 the exact version of the language used to execute a module's source scripts is stored in the module result created. Information on the language used to execute the ‘listOfThings’ example above can be seen with the following R code:

cat(paste0("Language: '", result2$component$language$language, "'"), paste0("Version: '",
    result2$component$language$version, "'"), sep = "\n")
## Language: 'R'
## Version: '3.3.0'

Using language version to choose major Python version

There are currently two major versions of Python in common use, Python versions 2.7.11 and 3.5.1 being the most recent stable releases. The Python Software Foundation says “Python 3.x is the present and future of the language” (Should I use Python 2 or Python 3 for my development activity?). However, many common software libraries only support Python 2, and the 2.7 release of Python will receive security and bug fixes from the core development team until 2020 (PEP 373 -- Python 2.7 Release Schedule).

Previous versions of Conduit used the system command /usr/bin/python to execute the source scripts of “python”-language modules. On many Linux systems this meant that Python v2.x would be used. From version 0.6 Conduit will try to execute any module which specifies “python” as its language using Python v3.x via the system command /usr/bin/python3. Versions 2.x or 3.x of Python can be invoked by providing “python2” and “python3” respectively as a module's language.

In Conduit v0.6 the following two module XML examples will be executed using Python v3.x:

<module xmlns="http://www.openapi.org/2014/">
  <language>python</language>
  ...
</module>
<module xmlns="http://www.openapi.org/2014/">
  <language>python3</language>
  ...
</module>

If a module author wishes to execute a module's source scripts using Python v2.x she can specify the “python2” language as in the following module XML:

<module xmlns="http://www.openapi.org/2014/">
  <language>python2</language>
  ...
</module>

The introduction of module language version attributes allows another mechanism for controlling the version of Python used to execute module source scripts. If a module author specifies that a module's scripts should be executed using the “python” language with a maxVersion less than version “3.0.0” Conduit v0.6 will execute these scripts as if the author had specified “python2” as the module language. The following module XML will be executed using Python v2.x in Conduit v0.6:

<module xmlns="http://www.openapi.org/2014/">
  <language maxVersion="2.8">python</language>
  ...
</module>

Similarly, if a module author specifies that a module requires an exact version of “python” less then version “3.0.0” using the version attribute Conduit v0.6 will execute this module's scripts as if “python2” had been specified. This will occur even if the version of Python v2.x used by Conduit does not match version exactly. The following module XML will be executed using Python v2.x in Conduit v0.6:

<module xmlns="http://www.openapi.org/2014/">
  <language version="2.7.11+">python</language>
  ...
</module>

While Conduit v0.6 will select a Python version for execution in the way described above, it necessarily depends on there being Python executables at /usr/bin/python2 and /usr/bin/python3. Most Linux distributions provide a method to install Python version 2.x and 3.x from their software repositories, and many install one of both by default.

Technical requirements

Resources


Creative Commons License
OpenAPI version 0.6 by Ashley Noel Hinton and Paul Murrell is licensed under a Creative Commons Attribution 4.0 International License.