20 June 2016
The OpenAPI architecture specifies a method for sharing and combining pieces of data analysis code so as to help people connect with data. OpenAPI describes how pieces of data analysis code can be wrapped in modules, and how these modules can be combined in pipelines which describe a whole data analysis workflow. Modules and pipelines are created using simple XML, and are intended to promote sharing and re-use. Modules and pipelines can be executed in OpenAPI glue system software (Introducing OpenAPI, OpenAPI version 0.3, OpenAPI version 0.5).
This report describes the changes to the OpenAPI architecture
in version 0.6. Module host elements were implemented in
OpenAPI version 0.5; version 0.6 has extended module hosts so
it is possible to provide a module's host information through
a module input. The way a module author specifies a module's
language has also been changed in version 0.6, with the
introduction of the language
element. This
element also allows the author to specify details of the
version of the language required.
The changes described in this report have been implemented in Conduit version 0.6. Conduit is a prototype OpenAPI glue system implemented as an R package. Further details of changes to Conduit v0.6 are provided at the end of this report.
Module host elements were introduced in OpenAPI v0.5 to provide
a method for meeting module source script dependencies
(OpenAPI version 0.5).
One of the limitations of using the docker
or vagrant
host elements described in OpenAPI v0.5
was that a module author was assuming the host machine described
had been created and was available for use when a module is
executed. Module host elements were static, meaning module
authors had no way of creating a custom host machine as part of
an OpenAPI pipeline. Anyone wishing to run a module which
required a host machine was also responsible for ensuring this
host machine was ready to be accessed.
OpenAPI version 0.6 introduces the moduleInput
element to the list of elements available inside a
module host
element. Unlike docker
or vagrant
host elements,
the moduleInput
element indicates that the module's
host is to be provided in one of the module's
input
elements. This allows a module author to
create a module host machine and reference it in the output of a
module. A pipeline can then pass this output to a module waiting
for its host to be provided by one of its inputs.
The following code shows how a module author can specify a module input as the source of the host machine details:
<module xmlns="http://www.openapi.org/2014/">
<language>R</language>
<host>
<moduleInput name="hostMachine"/>
</host>
<input name="hostMachine">
<file ref="hostMachine.xml"/>
<format formatType="text">XML file</format>
</input>
</module>
The moduleInput
element contains a
name
attribute, which the module author uses to
name the input which will provide the host machine XML. In the above
code the “hostMachine” input is named—note the
moduleInput
name
attribute must match
exactly the name of one of the module's input
names. When this module is executed the glue system reads the
details of the host machine described in
“hostMachine.xml” and executes the module on this
host machine. This XML document should contain only XML to
describe a docker
or
vagrant
host machine as introduced in OpenAPI
v0.5.
The ‘simpleInputHost’ pipeline example demonstrates how a module host can be passed into a module as an input. The pipeline contains two modules—‘createVagrant’ and ‘normalList’. A pipe connects the output ‘vagrantMachine.xml’ to an input of the same name in ‘normalList’. The XML code for simpleInputHost/pipeline.xml follows:
<?xml version="1.0"?> <pipeline xmlns="http://www.openapi.org/2014/"> <description>demonstrate use of <host><moduleInput/></host></description> <component name="createVagrant" type="module"> <file ref="createVagrant.xml"/> </component> <component name="normalList" type="module"> <file ref="normalList.xml"/> </component> <pipe> <start component="createVagrant" output="vagrantMachine.xml"/> <end component="normalList" input="vagrantMachine.xml"/> </pipe> </pipeline>
The ‘createVagrant’ module wraps a bash script,
which creates a Vagrantfile on the local machine, and starts the
vagrant machine. The script creates an XML file,
‘vagrantMachine.xml’, containing
OpenAPI vagrant
host XML. The module returns the
‘vagrantMachine.xml’ as an output
. The
XML code for
createVagrant.xml
follows:
<?xml version="1.0"?> <module xmlns="http://www.openapi.org/2014/"> <language>bash</language> <description>create a vagrant machine and spin it up</description> <source> <script><![CDATA[#! /bin/bash # create a vagrant machine from 'hashicorp/precise32' box in current # directory ## make vagrant machine directory and go to it vagrantdir=~/vagrant/precise32 if [ ! -e $vagrantdir ] then mkdir -p $vagrantdir fi cd $vagrantdir ## create Vagrantfile if [ ! -e Vagrantfile ] then vagrant init hashicorp/precise32 fi ## do vagrant up vagrant up ## back to old dir to create OpenAPI host XML cd $OLDPWD echo "<vagrant vagrantfile=\"~/vagrant/precise32/Vagrantfile\"/>" > vagrantMachine.xml]]></script> </source> <output name="vagrantMachine.xml"> <file ref="vagrantMachine.xml"/> <format formatType="text">XML File</format> </output> </module>
The ‘normalList’ module wraps a
“python2” script. The module has
a moduleInput
host
, which names the
input ‘vagrantMachine.xml’. The module has
one input
, ‘vagrantMachine.xml’. The
script will be executed on the host machine passed in by this
input. The script produces a list of random numbers, which is
returned as output
‘x’. The XML code for
normalList.xml
follows:
<?xml version="1.0"?> <module xmlns="http://www.openapi.org/2014/"> <language>python2</language> <host> <moduleInput name="vagrantMachine.xml"/> </host> <description>generate list of 10 numbers from norm(0,1)</description> <input name="vagrantMachine.xml"> <file ref="vagrantMachine.xml"/> <format formatType="text">XML file</format> </input> <source> <script><![CDATA[#! /usr/bin/python2 import random ## generate list of 10 from norm(0,1) x = [0] * 10 for i in range(len(x)): x[i] = random.gauss(0, 1)]]></script> </source> <output name="x"> <internal symbol="x"/> <format formatType="text">python list</format> </output> </module>
When this pipeline is executed in an OpenAPI glue system the
glue system will execute the ‘normalList’ on the host
machine created by ‘createVagrant’, even if this
machine did not exist when the pipeline was loaded. This pipeline
can be executed in Conduit on a machine with Vagrant installed
with the following code. Please note that executing the
‘createVagrant’ module will create and start a new
vagrant machine in the directory
‘~/vagrant/precise32
’ on the machine
running Conduit. If this directory does not exist it will be
created. If the ‘hashicorp/precise32’ Vagrant box has
not been downloaded to your machine it will be downloaded
first. The vagrant machine will not be halted by this pipeline,
and should be halted manually from
‘~/vagrant/precise32
’.
simpleInputHost <- loadPipeline(name = "simpleInputHost", ref = file.path("examples", "simpleInputHost", "pipeline.xml")) result1 <- runPipeline(pipeline = simpleInputHost) result1Tarball <- export(result1)
The vagrant
host
XML created by the
‘createVagrant’ module can be found
at pipelines/simpleInputHost/createVagrant/vagrantMachine.xml. The
result of running ‘simpleInputHost’ can be found
at simpleInputHost.tar.gz.
In OpenAPI v0.3 a module author specifies the execution
language for a module's source scripts using
the module
language
attribute
(OpenAPI
version 0.3). In OpenAPI v0.6 this attribute has been
replaced with a language
element. OpenAPI v0.6 also
introduces attributes to the language
element for
specifying the language version required to execute a module's
source scripts.
Each OpenAPI module
must contain
a language
element as its first element. The value of the
language
element should be the language of
execution for module's source scripts. The following code
demonstrates how a module author indicates a module should be
executed using the “python” language:
<module xmlns="http://www.openapi.org/2014/">
<language>python</language>
...
</module>
A module author can now also specify the minimum and maximum
version of the language required using either the
minVersion
or the maxVersion
attribute, or both. These attributes should be provided with a version
number string appropriate to the language named. The following
code demonstrates how a module author can specify that a version of the
“R” language between versions “2.14.1” and
“3.0.2” should be used:
<module xmlns="http://www.openapi.org/2014/">
<language minVersion="2.14.1" maxVersion="3.0.2">R</language>
...
</module>
If a module author instead requires an exact version of a
language for script execution she can specify this using the new
version
attribute of the language
element. The version
attribute should be provided
with a version number string appropriate to the language
named. The following code demonstrates how a module author can
specify that version “3.5.1+” of the
“python” language should be used:
<module xmlns="http://www.openapi.org/2014/">
<language version="3.5.1+">python</language>
...
</module>
While OpenAPI version 0.6 has introduced the ability to specify the version of a language used to execute a module's source scripts, it is not a requirement of OpenAPI v0.6 that a glue system must respect such specifications. Rather it is intended that module language version information provide a glue system user with a means for debugging problematic module execution. For example, a glue system may provide an option to warn a module user when version requirements are not met. Language version attributes may also serve as a means for recording the actual version of a language used when creating module and pipeline result objects. This could provide useful information about how certain module and pipeline results were achieved, which could aid in the reproducibility of a module or pipeline.
The next section includes a description of how Conduit version 0.6 has implemented module language version information.
Conduit is a prototype OpenAPI glue system implemented as an R package. Conduit version 0.6 implements the changes to OpenAPI v0.6 described in this report—passing module host information through a module input and specifiying module language version details.
This section describes how module
language
elements have been implemented in Conduit
v0.6 to optionally provide a warning to the user when the language
used to execute a module does not meet the module author's
requirements. Conduit will also record the exact version of a
language used for module execution in its persistent module
results. Finally, language version information has been used to
determine which of the two commonly installed versions of the
Python language is to be used for module execution.
In Conduit version 0.6 a new argument, warnVersion
has been added to the runModule()
function, used to
execute a module. When this argument is passed the default value
of FALSE
the module will not behave any differently
than in previous versions. However, when warnVersion
is set to TRUE
Conduit will give a warning if any of
a module's language version attributes are violated.
The ‘listOfThings’ module specifies that it is to be executed using version “2.14.1” of the “R” language. The XML code for listOfThings.xml follows:
<?xml version="1.0"?> <module xmlns="http://www.openapi.org/2014/"> <language version="2.14.1">R</language> <source> <script><![CDATA[### create a list of things listOfThings <- list( one = rnorm(n = 100), two = LETTERS, three = iris, four = outer(1:12, 1:12)) ]]></script> </source> <output name="listOfThings"> <internal symbol="listOfThings"/> <format formatType="text">R list object</format> </output> </module>
If this module is executed in a version of R other than 2.14.1
and the module is executed with runModule(..., warnVersion
= TRUE)
Conduit will give a warning. The following
demonstrates the ‘listOfThings’ module being
executed in an R version other than 2.14.1:
listOfThings <- loadModule(name = "listOfThings", ref = file.path("examples", "warnVersion", "listOfThings.xml")) result2 <- runModule(module = listOfThings, targetDirectory = tempdir(), warnVersion = TRUE)
## Warning in warnLanguageVersion(module = module, moduleResult = ## moduleResult): R 3.3.0 was not exactly version 2.14.1 when executing module ## listOfThings
result2tarball <- export(result2)
The result of running ‘listOfThings’ can be found at listOfThings.tar.gz.
Persistent result formats for modules were introduced in OpenAPI version 0.5. In Conduit v0.6 the exact version of the language used to execute a module's source scripts is stored in the module result created. Information on the language used to execute the ‘listOfThings’ example above can be seen with the following R code:
cat(paste0("Language: '", result2$component$language$language, "'"), paste0("Version: '", result2$component$language$version, "'"), sep = "\n")
## Language: 'R' ## Version: '3.3.0'
There are currently two major versions of Python in common use, Python versions 2.7.11 and 3.5.1 being the most recent stable releases. The Python Software Foundation says “Python 3.x is the present and future of the language” (Should I use Python 2 or Python 3 for my development activity?). However, many common software libraries only support Python 2, and the 2.7 release of Python will receive security and bug fixes from the core development team until 2020 (PEP 373 -- Python 2.7 Release Schedule).
Previous versions of Conduit used the system
command /usr/bin/python
to execute the source scripts
of “python”-language modules. On many Linux systems
this meant that Python v2.x would be used. From version 0.6
Conduit will try to execute any module which specifies
“python” as its language using Python v3.x via the
system command /usr/bin/python3
. Versions 2.x or 3.x
of Python can be invoked by providing “python2” and
“python3” respectively as a module's language.
In Conduit v0.6 the following two module XML examples will be executed using Python v3.x:
<module xmlns="http://www.openapi.org/2014/">
<language>python</language>
...
</module>
<module xmlns="http://www.openapi.org/2014/">
<language>python3</language>
...
</module>
If a module author wishes to execute a module's source scripts using Python v2.x she can specify the “python2” language as in the following module XML:
<module xmlns="http://www.openapi.org/2014/">
<language>python2</language>
...
</module>
The introduction of module language version attributes allows
another mechanism for controlling the version of Python used to
execute module source scripts. If a module author specifies that
a module's scripts should be executed using the
“python” language with a maxVersion
less
than version “3.0.0” Conduit v0.6 will execute these
scripts as if the author had specified “python2” as
the module language. The following module XML will be executed
using Python v2.x in Conduit v0.6:
<module xmlns="http://www.openapi.org/2014/">
<language maxVersion="2.8">python</language>
...
</module>
Similarly, if a module author specifies that a module requires
an exact version of “python” less then version
“3.0.0” using the version
attribute
Conduit v0.6 will execute this module's scripts as if
“python2” had been specified. This will occur even if
the version of Python v2.x used by Conduit does not
match version
exactly. The following module XML will
be executed using Python v2.x in Conduit v0.6:
<module xmlns="http://www.openapi.org/2014/">
<language version="2.7.11+">python</language>
...
</module>
While Conduit v0.6 will select a Python version for execution
in the way described above, it necessarily depends on there being
Python executables at /usr/bin/python2
and /usr/bin/python3
. Most Linux distributions
provide a method to install Python version 2.x and 3.x from their
software repositories, and many install one of both by
default.
OpenAPI version 0.6 by Ashley Noel Hinton and Paul Murrell is licensed under a Creative Commons Attribution 4.0 International License.