The problem

Why help people connect with data

Decision-making in the Twenty-first Century is both characterised and complicated by an abundance of data. Increasing amounts of data are becoming available to the public through initiatives around open government and open data. Technology has not only made massive data collection and analysis possible, it has also delivered the means to store and analyse data into the hands and homes of ordinary people. Says Rufus Pollock of Open Knowledge, "what matters is having the data, of whatever size, that helps us solve a problem or address the question we have." (Pollock, 2014)

The need to help people connect with data has been recognised by institutions and governments throughout the world. In 1989 the Association of College and Research Libraries (ACRL) Presidential Committee on Information Literacy described the challenges faced by society in the Information Age. They described the Information Age as an age of increasing abundance of information. The committee found that for people to gain benefit from these swiftly accumulating repositories of data they must be information literate. One cannot be information literate if one cannot "locate, evaluate, and use effectively" the data or information which is required to make good decisions (American Library Association. Presidential Committee on Information Literacy, 1989).

In 2009 the New Zealand Government recognised the importance of making data available by launching the website data.govt.nz to serve as a catalogue for publicly available government data. (Guy, 2009) In 2010 the Government released the New Zealand Government Open Access and Licensing framework (NZGOAL) to guide agencies in releasing their data, through data.govt.nz and otherwise. NZGOAL established standard, open and permissive licenses for government agencies to use for releasing data (Ryall, 2010). The NZGOAL framework was updated in 2015 ("NZGOAL version 2 released", 2015). In 2011 the Government further recognised the importance of data to a democratic society by making a commitment to "to releasing high value public data actively for re-use". ("Declaration on open and transparent government", 2011)

The UK Government has also recognised the importance of making public data available, with the launch of the website data.gov.uk in 2010. The site was launched in recognition of government data as a public resource ("Government launches one-stop shop for data", 2010).

In 2014 the New Zealand Data Futures Forum (NZDFF) was established to advise government ministers on how the increasing amounts of government, business and personal data being collected, and made available, would effect the public service (English & Williamson, 2014). In a discussion document the NZDFF establishes 'inclusion' as one of the four key values New Zealand should embrace in managing access to and use of data. (New Zealand Data Futures Forum, 2014) The report states that "[a]ll parts of New Zealand society should have the opportunity to benefit from data use." (New Zealand Data Futures Forum, 2014, p. 10)

It is not only the initiatives of large organisations and public institutions which demonstrate the need to help people connect with data. Individuals and small organisations are also working to increase access to and use of data. Open Knowledge is a non-profit organisation which sees data as a means "to make informed choices about how we live, what we buy and who gets our vote." ("Open knowledge: about", 2015) The organisation advocates for the release of data in open and permissive formats. It is also involved in educating businesses and individuals in the ways open data can be used in their work.

Individuals and informal groups, too, are connecting with data, and helping other people to do so. In New Zealand people working on or interested in open government and open data projects can discuss and share their projects through the Open Government Ninjas discussion list ("Open New Zealand", 2015; "The open government ninjas", 2015). Around the world people participate in open data hackathons. Hackathons are events where people meet, often over several days, to work collaboratively and voluntarily on new applications for and visualisations of public data sets. These events are intended both to make use of open data, and to increase awareness of the availability and utility of these data sets. ("International open data hackathon", 2015; "Hackathon winners make life easier for public transport users", 2014)

Not enough people are connecting with data

Many further examples could establish that people need and want to connect with data, and that they are in fact doing so. I will now focus on those people who are not currently connecting with data; I will further demonstrate that many people are still incapable of connecting with data. According the The UN Secretary General's Independent Expert Advisory Group on a Data Revolution for Sustainable Development (IEAG), "[t]here are huge and growing inequalities in access to data and information and in the ability to use it" (The UN Secretary General's Independent Expert Advisory Group on a Data Revolution for Sustainable Development, 2014, p. 2). The ability to use data and information that will be the focus of these examples.

While much more data is becoming available to the public, many people are unable to connect meaningfully with data due to their inability to access or make use of them; "[t]he sheer abundance of information will not in itself create a more informed citizenry without a complementary cluster of abilities necessary to use information effectively" ("Information literacy competency standards for higher education", 2000, p. 2). To be information literate a person must be able to "[access] needed information effectively and efficiently" ("Information literacy competency standards for higher education", 2000, p. 9). She must also be able to "[manipulate] data, as needed, transferring them from their original locations and formats to a new context." ("Information literacy competency standards for higher education", 2000, p. 13) The availability of data is of no help to someone who does not know how to access, or how to use these data.

In its consultation with stakeholders the NZDFF "was struck by the low level of general awareness of what is currently occurring [in the areas of data collection, access and use] and the opportunities potentially available." (, 2014, p. 56) NZDFF argues that "open data is of limited value to New Zealanders if they need a degree in statistics to be able to interpret it." (New Zealand Data Futures Forum, 2014, p. 57) NZDFF believes that New Zealanders require data literacy to gain benefit from open data. NZDFF acknowledges that data visualisation and the availability of data visualisation tools are central to helping people understand data, commenting that "[t]here is some activity in the market, but it is an area in which New Zealand could be a lot more proactive." (New Zealand Data Futures Forum, 2014, pp. 56-57)

Hinton and Murrell (2015a) believe that, beyond access to data, the following things are required for people to connect with data: domain knowledge, data science skills, statistical graphics skills and graphical design skills. While the participants in open data hackathons and on open data discussion list might possess several of these skills, it is far from common to find someone who possesses all of them. NZDFF suggests data journalism as a means to helping people make sense of data. (New Zealand Data Futures Forum, 2014) However, during a 2010 data journalism hackathon, participant and journalist Jerry Vermanen confessed, the combination of journalistic skills and the skills needed to work with data was "not a common presence" (Vermanen, 2015).

A software solution

One way of helping people to connect with data is to design software for this purpose. Many software solutions already exist for manipulating and for analysing data; many of these are specialist in nature, and focus on solving particular sets of data problems in specific contexts.

I believe a general framework for creating, describing, and executing data analysis workflows using computers is important for helping more people connect with data. I believe the following features are important in designing a system for describing and executing such workflows:

Workflows should be modular: workflows should be split into components or nodes, each of which can be replaced with another.
Workflows can be easily reused and shared: it should be easy to share workflows with other people, and for these people to run workflows on their own machines.
The system should be extensible: it should be easy for users to create new nodes or components for use in workflows, and it should be easy to share these nodes.
The system should be open and free: the solution should be freely available for users to access, use, modify and share.
The system should not be highly technical: it should be possible for users to make use of workflows in the system without having high levels of analysis and programming skills.
The system should not be monolithic: a user of the system should not be stuck using the system exclusively.
The ideal system is an infrastructure which meets the previous criteria, and on which convenient interfaces can be built; it is not just a convenient interface.
The system should be aware of reproducible research: it should be possible both to make use of scripts embedded in reproducible research documents, and to embed the system's workflows in such documents.

The OpenAPI project describes a framework for creating, sharing and executing workflows of data analysis called 'pipelines'. A 'pipeline' is comprised of several 'modules', each of which does a small task described by a script. Modules are connected by 'pipes' which connect one module's output to another module's inputs—this can be described visually by a directed node and edge graph. Modules and pipelines are described by lightweight XML documents which can easily be shared and executed by 'glue systems' (Hinton & Murrell, 2015a).

Of course, OpenAPI is not the only software project which might be used to help connect people with data. In the next several sections I will consider several software projects which might be used to solve this problem. I believe the projects I have chosen are representative of broader categories of software. The projects were also chosen based on our perception of how fair it was to test them using our criteria, as these criteria might not have been part of their design.

First I will examine KNIME (Version 2.11.3, 2015), which I have chosen to represent the class of software which attempts to provide a visual programming interface to data analysis. Other examples of this class include R Analytic Flow (2015) and Red-R (2010).

The second project, Galaxy, is a web-based platform for performing bio-medical analysis. Though Galaxy has very specific aims it has commonalities with other projects: like KNIME it uses visual programming in parts of its interface; it is a web-based data analysis solution, like OpenCPU (2015) and R Cloud (2015).

Lastly I will examine Gapminder World (2015), a web-based tool for viewing animated visualisations of countries' development statistics. This tool was chosen as a representative of websites and web-based tools which display visualisations of public data to help increase public knowledge through data. Wiki New Zealand (2015) is another example of such a website. It is not our expectation that such web-based software will meet many of our desired features; these sites attempt to help connect people with data through data visualisations for analysis and discussion rather than by putting the full analysis suite in the users' hands. I hope that the examination of Gapminder World will not seem unfair, but will rather serve to highlight some of the different approaches to the problem.

For each of these projects I have attempted to create a simple workflow represented by three nodes:

consume some data,
processes these data, and,
produces a visualisation of these data.

Guided by the list of desirable features for workflow software I have attempted to answer the following questions:

Can a user substitute the data node, the processing node, or the visualisation node for another?
Can the workflow be reused and shared?
Can a user create new node? Can the user choose the language used in the new node?
How open is the software?

KNIME

KNIME is an open source application for managing data analysis. Data analysis workflows are divided into constituent nodes representing steps in the process. The workflow is created and managed in a visual programming environments using node and edge graphs to represents nodes and how information is passed between them ("KNIME quickstart guide", 2015). I installed KNIME version 2.11.3 in Ubuntu 14.04 for the testing done below.

To examine KNIME I followed the instructions for 'Building a Workflow' in the KNIME quickstart guide ("KNIME quickstart guide", 2015, pp. 5–10). I did not include the 'Interactive Table' node described in the guide. I have (1) consumed data from the file 'iris.csv' using the 'File Reader' node, (2) performed a cluster analysis on the data using the 'k-Means' node, and (3) produced a scatterplot of the results of the analysis using the 'Scatter Plot' node. It should be noted that I was unable to locate the 'IrisDataSet' directory mentioned in the quickstart guide. Instead I used the file 'iris.csv', found in the 'Example Workflow' directory which KNIME creates in the first workspace nominated when running KNIME for the first time.

Nodes are selected from the KNIME application's Node Repository panel, and are dragged with the mouse into the workflow editor. A node is connected to another by clicking and dragging from one node's output port to another node's input port. A node is configured by right-clicking and selecting 'Configure...'. A window appears in which the node can be configured. For example, configuring the 'File Reader' node from the quickstart guide allows the user to select a source file containing the data required. Nodes cannot be run until they have been configured.

Workflows, or parts of workflows, are executed by right clicking a node and selecting 'Execute'. When executed a node will first execute all the nodes which precede it in the workflow, if they have been configured. If executing a node produces graphical outputs, as does the 'Scatter Plot' node, these outputs can be viewed by right-clicking the nodes and selecting the appropriate 'View:' option.

The example workflow, including the 'iris.csv' data, has been exported and is available in quickstart_example_base.zip

Swapping out nodes

Within KNIME one node in a workflow can be substituted for another node using the same process as when a workflow is created. For step (1), a different data set can be selected by choosing a different source file in the 'File Reader' node's configuration. I have provided an example workflow where data is read from the file kyphosis.csv. This file was produced from the 'kyphosis' dataset in R. The workflow created with this dataset was exported to quickstart_example_data.zip. For step (2) a 'Fuzzy c-Means' node replaces the 'k-Means' node (exported to quickstart_example_process.zip). For step (3) a 'Scatter Matrix' node replaces the 'Scatter Plot' node (exported to quickstart_example_plot.zip). These workflows were all variations on the original workflow.

Reuse and sharing

Workflows can be exported from KNIME via the menu command 'File > Export KNIME Workflow...'. The workflow to be exported is selected from a workflow browser window. The file to which the workflow is to be exported is selected through a file browser window. Data can be embedded in the exported file, or excluded by selecting 'Exclude data from export' ("KNIME quickstart guide", pp. 20–21).

Workflows exported in this manner can be imported into KNIME using the menu command 'File > Import KNIME Workflow...'. The user selects 'Select archive file' in the dialog window, and specifies the archive file using a file browser window ("KNIME quickstart guide", pp. 19–20).

The exported workflows in the previous section were all exported with their data embedded. The example workflow for our basic example has been exported without data embedded for the sake of comparison at quickstart_example_base_nodata.zip.

Creating new nodes

Nodes in KNIME are created using a customised version of Eclipse IDE plugins ("Developer guide", 2015, Section 1). I have noted that while a process for creating new nodes which are native to KNIME exists, the process of creating our own native nodes is beyond the scope of this paper. Rather, I have investigated whether new nodes can be created from existing R scripts. R objects can be used in KNIME by installing the 'KNIME Interactive R Statistics Integration' extension. To install this extension the user selects the KNIME menu item 'File > Install KNIME Extensions...', types 'R Statistics' into the filter pane, selects 'KNIME & Extensions > KNIME Interactive R Statistics Integration', and clicks 'Next'. The user should then follow the on-screen instructions to complete the installation ("KNIME update site", 2015).

KNIME then has an R menu in its 'Node Repository' pane. I have attempted to create a node for step (3) of our workflow, which will plot the results of the k-means analysis using R. I used the 'R View (Table)' node, which "[a]llows execution of an R script from within KNIME. The view resulting from this script is returned in the output image port of this node" (KNIME version 2.11.3, 2015, Help: R View (Table)). This node provides a data frame called 'knime.in' to the R script. The script is provided to the node through its configuration window. The following R script was provided to the node:

library(lattice) xyplot(knime.in$"sepal
	width"~knime.in$"sepal length", data = knime.in, groups =
	Cluster, xlab = "sepal length", ylab = "petal length", main =
	"Scatterplot of iris data")

This workflow has been exported as quickstart_example_Rscript.zip.

As well as R integration, other software languages can be integrated into KNIME using the 'Install KNIME Extensions...' option. These include integration of Python, Perl, and Weka Data Mining.

Accessibility and openness

KNIME is freely available for download to Windows, Mac OS X, and Linux platforms. It is distributed under a GPL license, allowing it "to be downloaded, distributed, and used freely" (KNIME version 2.11.3, 2015).

The R software integration used in the 'Creating new nodes' section above requires that R the 'Rjava' R package are installed.

Workflows are stored as XML files, as are the configuration files for nodes within a workflow; data in exported workflows is stored in a binary format (by inspection). Nodes are described through the Eclipse plugin architecture ("Developer guide", 2015, Section 1).

KNIME summary

KNIME workflows are modular, and can be shared with other users for use on other machines. It is possible for users to create new nodes for use in KNIME, but not trivial. Creating a new native KNIME node requires knowledge of programming Eclipse plugins. It is however possible to create custom nodes containing scripts for various languages, e.g. R scripts. The KNIME software interface is reasonably non-technical, employing visual programming and common graphical user interface layouts. The KNIME software is free and open to use. KNIME workflows are moderately monolithic; KNIME workflows are only intended to be opened by KNIME software. However, as workflows are exported as XML files it would be possible to read and write KNIME workflows with other software. The underlying XML seems intended as a means for working within the KNIME visual programming environment, rather than as a general framework for describing and sharing workflows.

Galaxy

Galaxy is a web-based platform for performing bio-medical data analysis. Analyses are arranged in 'Histories', consisting of the initial datasets, the transformed data at every step in the analysis, and the final dataset. The user selects 'Tools' to transform or analyse datasets and return another dataset. 'Workflows' describe the steps performed in an analysis, but do not contain the datasets produced ("Learn Galaxy", 2015). I assessed Galaxy using the public instance at https://usegalaxy.org/. It is not necessary to register an account to use this instance, but Histories and Workflows will not be preserved if a user does not log in.

To examine Galaxy I attempted to reproduce an analysis similar to the analysis done for KNIME above. I (1) loaded the data, (2) converted the data to another format, and (3) plotted the data. For step (1) I loaded the 'iris.csv' into Galaxy by selecting 'Get Data > Upload File from your computer' from the 'Tools' panel. I selected 'Choose local file' and selected 'iris.csv' using the file browser. I clicked 'Start', and then 'Close.' Galaxy loaded the file, and displayed the result in the History panel.

The tools in Galaxy are focused on genetic analysis. As I have no experience in this area I opted not to use any of the tools, as I am unable to meaningfully assess their output. As such our step (2) consists of transforming our data to a tab-separated format. I selected 'Text Manipulation > Convert delimiters to TAB' from the 'Tools' panel. From the subsequent dialog I selected 'Convert all > Commas', and 'In Dataset: 1: iris.csv', then selected 'Execute'.

Finally, step (3), I plotted the data. I selected 'Graph/Display Data > Scatterplot of two numeric columns'. I selected '2: Convert on data 1' for the Dataset, 'Column: 1' for the x axis, and 'Column: 3' for the y axis, and clicked 'Execute'. I was able to view the resulting scatterplot by selecting the 'View data' icon from '3: Scatterplot on data 3' in the 'History' pane.

This History was exported to file as Galaxy-History-examplebase.tar.gz

Swapping out Tools

Within Galaxy a new tool can be added to a History using the process described above. For step (3) the 'Graph/Display Data > Histogram of a numeric column' tool replaces 'Scatterplot' in the exported History Galaxy-History-exampleplot.tar.gz. This new 'History' was created by selecting 'History options > Copy History' from the 'History' pane.

Due to our lack of familiarity with the tools in Galaxy we have not provided an example where step (2) has been substituted.

As Galaxy stores the results of tool transformations as data sets I employed a different approach to substituting the data step. First, a 'Workflow' was extracted from our original example by selecting 'History options > Extract Workflow' from the 'History' pane, then selecting 'Create workflow' on the subsequent page. A new 'History' was created by choosing 'History options > Create New' from the 'History' pane. The data file kyphosis.csv was uploaded to the 'History' the same as in the earlier example. The 'Workflow' extracted earlier was run, by navigating to the 'Workflow' page and selecting run from the drop-down menu of the 'Workflow' created on this page. On the subsequent page '1: kyphosis.csv' was selected as the Input Dataset. This failed to run, as the 'kyphosis.csv' did not have numeric data in the columns selected by the Scatterplot tool in the extracted 'Workflow'. 'Run this job again' was selected from the failed Scatterplot object in the 'History' pane, and two 'Column: 2' was selected for the x axis. The resulting 'History' has been exported as Galaxy-History-exampledata.tar.gz. The extracted 'Workflow' has been exported as Galaxy-Workflow-Workflow_constructed_from_history__example_base_.ga.

Reuse and sharing

A Galaxy 'History' can be exported to a file by selecting 'History options > Export to file' from the 'History' pane of the Galaxy's 'Analyze Data' page.

A 'Workflow' can be exported by navigating to the 'Workflow' page and selecting 'Download or Export' from the the the 'Workflow' object's drop-down menu. The downloadable 'Workflow' file is made available through a hyperlink on the subsequent page.

A 'Workflow' can be imported by selecting 'Upload or import workflow' on the 'Workflow' page. An exported 'Workflow' file can be selected through the file browser and selecting 'Import'.

Creating new Tools

Tools in Galaxy are executable scripts which can be executed on the server hosting the Galaxy instance. Details on how to specify a new tool can be found at https://wiki.galaxyproject.org/Admin/Tools. As Galaxy does not appear to provide any basic interface for wrapping a script in a tool interface I have not attempted to create a new tool due to time and space limitations.

Accessibility and openness

Galaxy is freely available to use at http://usegalaxy.org/. An account is required to save the user's 'Workflow' and 'History' objects, which requires submitting an email address. Galaxy is also available on several other public servers (https://wiki.galaxyproject.org/PublicGalaxyServers). It can also be installed on the user's own server under an open source Academic Free License allowing free use, modification, and distribution. The user will need a machine running UNIX, Linux, or Mac OS X, and Python 2.6 or 2.7 ("Galaxy download and installation", 2015). Some of Galaxy's tools have further software dependencies. The Galaxy server software is available at no charge.

'Workflow' and 'History' objects are stored in a JSON format; datasets inside exported 'Workflow' objects are stored in the original format, or as text files for the results of tools; graphical output in exported 'Workflow' objects is stored as PDF files (by inspection).

Galaxy summary

Galaxy workflows are modular, and can be reused and shared by other users on other machines. For most users Galaxy would be used on a public server, as setting up an instance of the Galaxy server software is technically demanding. It is possible for users to create new Galaxy tools, but this requires knowledge of creating executable scripts on Linux machines. The Galaxy software is open and free to use. Workflows are created in a browser-based interface which would be familiar to most users; workflows are represented using a visual programming interface. Galaxy workflows and histories are mostly monolithic, as they are only intended to be used in a Galaxy server environment. However, as workflows and histories can be exported as JSON files it would be possible to read and write Galaxy workflows in other software. Galaxy provides a means for publishing and sharing analyses through the built-in Galaxy Pages feature ("What are Galaxy Pages?", 2014).

Gapminder World

Gapminder World is an online application which uses the Trendalyzer software project to display time series of countries' development statistics ("About Gapminder", 2015). The resulting graphic is an interactive bubble chart. The user can select two datasets to be plotted along the x- and y-axes. The bubble plotted for each country vary in size according to a third dataset to be selected by the use. These three datasets are plotted for a given year; the user can change the year displayed, or the graphic can be played, which animates the changes in the data over time. The user can choose to view a pre-prepared graph from a menu, or can select the datasets to be displayed through drop-down menus. The software comes with a large range of datasets, but new datasets cannot be loaded into the system by the user.

While Gapminder World allows the user to display different combinations of datasets, the workflow of the software is essentially fixed. However it is suggested on Gapminder's 'Frequently Asked Questions' page for that a user can use Google's Motion Chart Gadget to produce an animated bubble chart with the user's choice of data ("Frequently asked questions (FAQ)", 2015). Therefore, rather than examine Gapminder World directly I have followed the instructions on displaying data using a Google Motion Chart ("Quick guide to the Motion Chart", 2015). Motion Chart is part of the Google Drive (https://encrypted.google.com/drive/) software platform; use of this software requires a Google account. I have attempted to recreate the graph "Is child mortality falling?" (2015) from Gapminder World.

The data sources for this example, provided by Gapminder World, are "GDP per capita by purchasing power parities" (2015), "Infant mortality rate" (2015), and "Total population" (2015). The Excel files provided by these sources are available at gd001-gdp-per-capita.xlsx, gd002-infant-mortality-rate.xlsx, and gd003-total-population.xlsx.

The first step (1) was to load the data into a Google Drive spreadsheet. This involved copying the data from the Excel data files into the spreadsheet directly, or preparing a spreadsheet in some other software and importing it into Google Drive. Due to the manual nature of this task only four countries were plotted; Australia, New Zealand, South Africa and United Kingdom. No explicit data processing step (2) was performed. The data was plotted and animated (3) by selecting 'Insert > Chart > Trend > Motion Chart' in the open spreadsheet in Google Drive. The 'Infant Mortality Rate' data was selected for the x-axis, 'GDP' data for the y-axis. A log transformation was selected for both of these axes. The bubble 'Size' and 'Color' were set to 'Population'. The spreadsheet used in this example, along with the resulting motion chart, are available at https://docs.google.com/spreadsheets/d/1Og-yUSVCx0s_S3Hrw97l0O5HHynyQy8dJJjjYkOyNFA/edit?usp=sharing; the axis transformations and choices of bubble size and colour are not preserved in the shared link, and must be manually set whenever the sheet is opened.

Swapping out nodes

Gapminder World allows the user to select different data sources from its available sources. New data sources cannot be loaded. Gapminder World only provides one plotting and animating step, using Trendalyzer software. Using Google's Motion Chart means the user has available the full flexibility of the Google Drive Sheets software, the extent of which I will not examine here.

Reuse and sharing

Gapminder World graphs can be shared using a web-link provided by the 'Share graph' button in the program. This link will preserve the data changes the user has made in the program.

Creating new nodes

Gapminder World does not offer any facility for creating new nodes.

Accessibility and openness

Gapminder World is freely available to use at http://www.gapminder.org/world. The software is available under a Terms of Service agreement found at http://www.gapminder.org/world_includes/tou.html. An offline version of Gapminder World for Windows, Mac and Linux computers can be downloaded from http://www.gapminder.org/world-offline/. The terms of service for this application are available from inside the application.

Workflows are stored internally to the program, and cannot be easily accessed or edited.

Gapminder World summary

Gapminder World is not intended for creating, executing or sharing data analysis workflows; it is intended for displaying visualisations of data, and allowing users to interact with these visualisations. Gapminder World graphs feature modular data sources, but have fixed processing and visualisation steps. The data sources can only be replace by datasets included in the software. It is not possible to create new nodes in the workflow for a Gapminder World graphic; indeed the workflow is not made visible to the user at all. Gapminder World is free to use, and uses a simple graphical user interface which should be accessible to most users. The visualisations created by Gapminder World can be shared with other users, who must view them on the Gapminder World website.

The OpenAPI solution

The aim of the OpenAPI project is to create a software framework which helps to connect people with data. By choosing representatives of several classes of possible software solutions I have identified several ways in which current software does not meet the requirements I have identified:

Visual programming software solutions, such as KNIME are somewhat monolithic in that to take advantage of these systems a user must commit to using this system somewhat exclusively. Extension of workflows in these systems tends to be technically demanding.
Web-based software solutions, such as Galaxy, also tend to require that the user commit to using this particular piece of software, and are technically demanding to extend.
Online visualisation sites, like Gapminder World, do not provide modular workflows for sharing and reuse. While they provide useful output for helping people to understand particular datasets they are not useful for exploring new data sets or producing customised variations of analyses.

OpenAPI is a framework for creating, executing, and sharing data workflows. It consists of three main parts:

Modules which represent simple tasks,
Pipelines which describe a workflow consisting of several modules, and,
Glue systems which read and execute modules and pipelines.

OpenAPI modules and pipelines are specified using XML. Module XML wraps a script which describes a job to be done in a language like R or Python. The module XML describes the inputs required by this script, and the outputs it will produce. Pipeline XML consists of a list of modules, and of pipes; a pipe describes how a module's output is to be provided as another module's input. Pipelines allow the user to create simple and complex workflows. A glue system is any software which can read module and pipeline XML, execute the wrapped module scripts, and provide module outputs as inputs described in the pipeline XML (Hinton & Murrell, 2015).

To examine OpenAPI I have created a pipeline which employs the same steps as the KNIME example above. In step (1) I read data in from the iris.csv CSV file, in step (2) I perform a k-means cluster analysis, and in step (3) I plot the results of this analysis. The pipeline XML file created for this example can be found at pipeline.xml. Step (1) consists of two modules, iris_csv.xml and readCSV.xml. The processing step (2) was done by the module kmeans_cluster.xml. The plotting step (3) was done by the module clusterplot.xml.

The example pipeline and its modules were executed using the 'develop' branch of the prototype glue system 'conduit', an R package, as it was on 2015-06-19. This was installed in R Version 3.2.0 64-bit on an Ubuntu Linux 14.04 machine. A tarball of the version of 'conduit' used is available at conduit_package_demo.tar.gz. The R code used to run the 'conduit' example pipelines can be found at conduit-examples.R. This closely follows the instructions for using 'conduit' version 0.1 (Hinton, 2015). The plot produced by the basic example can be found at Rplots.pdf.

Swapping out modules

OpenAPI pipelines are modular; each module in a pipeline can be replaced by another module or pipeline of modules. To swap out a module a user replaces the module component XML of the module being replaced, and the pipes connecting this module's inputs and outputs, in the pipeline XML file. The replacement module(s) must consume appropriate inputs and produce appropriate outputs; where the desired replacement module does not consume or produce appropriate inputs or outputs the OpenAPI framework makes it possible to provide conversions to appropriate inputs and outputs by means of further modules.

An archive of a modified version of the 'conduit' example pipeline with a different data source (1) and its modules can be found at conduit_data.tar.gz, and the plot produced by this pipeline at Rplots.pdf. An archive of the pipeline which produces an alternative plot (3) can be found at conduit_plot.tar.gz, and the plot produced by this pipeline at Rplots.pdf.

Reuse and sharing

Modules and pipelines are described using XML files, which can easily be shared with other users. The pipeline archives in the previous section, and the basic example conduit_base.tar.gz can be unpacked and executed on any machine capable of running the 'conduit' package. Modules may also refer to text source scripts which can be shared as easily as the module XML.

Creating new modules

OpenAPI workflows are extensible, as new modules can be easily created by wrapping a script in module XML. The XML used in module and pipeline files can be read and understood by humans as well as machines; OpenAPI users do not need to have technical knowledge beyond XML formatting to use or create modules and pipelines. OpenAPI module XML allows scripts to be entered directly into the XML, or a script file can be reference by file location (Hinton & Murrell, 2015). The modules used to demonstrate swapping out modules in the earlier section were all created in this manner.

Authors of new OpenAPI modules are limited to the languages supported by the glue system used to execute pipelines and modules. The 'conduit' package currently supports R, Python, and shell scripts. Authors of glue systems can choose which language source scripts they will support.

Accessibility and openness

OpenAPI is open and free: the pipeline and module XML frameworks are described in a publicly shared document specification; developers are free to write software which can consume OpenAPI XML files. There is no single OpenAPI glue system software; rather the OpenAPI framework asks that any glue system meet certain requirements, which include being able to execute a pipeline's module scripts in the intended order (Hinton & Murrell, 2015).

The 'conduit' package for R is a prototype glue system which is available for users to install locally under a GPL open source license (Conduit Version 0.1-1, 2015). The 'conduit' package has only been tested on Ubuntu Linux 12.04 and 14.04 machines at present. This package is in an early stage of development and is not easy for non-technical users to install and use.

OpenAPI is designed as an infrastructure for creating, executing and sharing data workflows, while not demanding that a user be trapped in the OpenAPI system by using it. In fact, OpenAPI attempts to make it possible to capture scripts and workflows from authors who had not been working with any knowledge of OpenAPI. Further, modules and pipelines are capable of preserving the details of their source scripts' inputs and outputs information as XML, meaning the workings of an OpenAPI module or pipeline can be easily consumed by software other than an OpenAPI glue system (Hinton & Murrell, 2015).

References

About Gapminder. (2015). Stockholm, Sweden: Gapminder Foundation. Retrieved from http://www.gapminder.org/about-gapminder/

American Library Association. Presidential Committee on Information Literacy. (1989). American Library Association Presidential Committee on Information Literacy: Final report (Research report). Washington, D.C.: Author. Retrieved from http://www.ala.org/acrl/publications/whitepapers/presidential

Blankenberg, D., Kuster, G. V., Coraor, N., Ananda, G., Lazarus, R., Mangan, M., … Taylor, J. (2010). Galaxy: A web-based genome analysis tool for experimentalists. Current Protocols in Molecular Biology, 19–10.

Declaration on open and transparent government. (2011, August). Retrieved March 29, 2015, from https://www.ict.govt.nz/guidance-and-resources/open-government/declaration-open-and-transparent-government/

Developer guide. (2015). Zurich, Switzerland: KNIME.com AG. Retrieved from https://tech.knime.org/developer-guide

English, B., & Williamson, M. (2014, February 12). Government considers data use of the future [Media release]. Retrieved March 29, 2015, from http://beehive.govt.nz/release/government-considers-data-use-future

Frequently asked questions (FAQ): Can I buy the software to make my own animations? (2015). Stockholm, Sweden: Gapminder Foundation. Retrieved from http://www.gapminder.org/faq_frequently_asked_questions/#1

Galaxy download and installation. (2015). In Galaxy Wiki. Galaxy Project. Retrieved from https://wiki.galaxyproject.org/Admin/GetGalaxy

Gapminder World. (2015). Stockholm, Sweden: Gapminder Foundation. Retrieved from http://www.gapminder.org/world

GDP per capita by purchasing power parities: Gapminder documentation 001. (2015). (Version 14). Gapminder Foundation. Retrieved from http://www.gapminder.org/data/documentation/gd001/

Giardine, B., Riemer, C., Hardison, R. C., Burhans, R., Elnitski, L., Shah, P., … Nekrutenko, A. (2005). Galaxy: A platform for interactive large-scale genome analysis. Genome Research, 15(10), 1451–1455.

Goecks, J., Nekrutenko, A., Taylor, J., & Team, T. G. (2010). Galaxy: A comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol, 11(8), R86.

Government launches one-stop shop for data. (2010, January 20). Retrieved March 29, 2015, from http://web.archive.org/web/20120209213422/http://webarchive.nationalarchives.gov.uk/+/http://www.cabinetoffice.gov.uk/newsroom/news_releases/2010/100121-data.aspx

Guy, N. (2009, November 4). Government takes steps to demystify data [Media release]. Retrieved March 29, 2015, from http://www.beehive.govt.nz/release/government-takes-steps-demystify-data

Hackathon winners make life easier for public transport users. (2014, May 26). Retrieved March 29, 2015, from https://at.govt.nz/about-us/news-events/hackathon-winners-make-life-easier-for-public-transport-users/

Hinton, A. N. (2015, March 4). Guide to using conduit. Retrieved from https://anhinton.github.io/usingConduit/usingConduit.html

Hinton, A. N., & Murrell, P. (2015a). Introducing OpenAPI (Technical report No. 1). Auckland, New Zealand: Department of Statistics, University of Auckland. Retrieved from http://stattech.wordpress.fos.auckland.ac.nz/2015-01-introducing-openapi/

Hinton, A. N., & Murrell, P. (2015b, February 18). Conduit: Prototype glue system for OpenAPI (Version 0.1-1). Retrieved from https://github.com/anhinton/conduit

Infant mortality rate: Gapminder documentation 002. (n.d.). (Version 2). Gapminder Foundation. Retrieved from http://www.gapminder.org/data/documentation/gd002/

Information literacy competency standards for higher education (2000). Chicago, Illinois: Association of College and Research Libraries; American Library Association. Retrieved from http://www.ala.org/acrl/standards/informationliteracycompetency

International open data hackathon. (2015). Retrieved March 29, 2015, from http://opendataday.org/

Is child mortality falling? (2015). Stockholm, Sweden: Gapminder Foundation. Retrieved from http://www.bit.ly/QISGYa

KNIME quickstart guide. (2015). Zurich, Switzerland: KNIME.com AG. Retrieved from http://tech.knime.org/files/KNIME_quickstart.pdf

KNIME update site. (2015). Zurich, Switzerland: KNIME.com AG. Retrieved from https://www.knime.org/downloads/update

KNIME version 2.11.3. (2015). Zurich, Switzerland: KNIME.com AG. Retrieved from http://www.knime.org/knime

Learn Galaxy. (2015). In Galaxy Wiki. Galaxy Project. Retrieved from https://wiki.galaxyproject.org/Learn

New Zealand Data Futures Forum. (2014). Harnessing the economic and social power of data (Discussion document). New Zealand: Author. Retrieved from https://www.nzdatafutures.org.nz/sites/default/files/NZDFF_harness-the-power.pdf

NZGOAL version 2 released. (2015, April 23). Retrieved June 15, 2015, from https://www.ict.govt.nz/news-and-updates/government-ict-updates/nzgoal-version-2-released/

Oaglue: Prototype R-based glue system for OpenAPI project. (2014, August 19). Retrieved from https://github.com/pmur002/oaglue

Open knowledge: About. (2015). Retrieved March 28, 2015, from https://okfn.org/about/

Open New Zealand. (2015). Retrieved March 29, 2015, from https://wiki.open.org.nz/wiki/display/main/Welcome

OpenCPU: An API for embedded scientific computing. (2015). Retrieved from https://www.opencpu.org/

Pollock, R. (2013, April 25). Forget big data, small data is the real revolution. Retrieved March 28, 2015, from http://www.theguardian.com/news/datablog/2013/apr/25/forget-big-data-small-data-revolution

Quick guide to the Motion Chart. (2015). Stockholm, Sweden: Gapminder Foundation. Retrieved from http://www.gapminder.org/upload-data/motion-chart/

R Analytic Flow. (2015). Tokyo, Japan: Ef-prime, inc. Retrieved from http://www.ef-prime.com/products/ranalyticflow_en/support.html

RCloud. (2015). AT&T. Retrieved from http://stats.research.att.com/RCloud/

Red-R: Visual programming for R. (2010). Retrieved from https://web.archive.org/web/20130620231300/http://red-r.org/

Ryall, T. (2010, August 6). More government information for reuse [Media release]. Retrieved April 16, 2015, from http://beehive.govt.nz/release/more-government-information-reuse

The open government ninjas. (2015). Retrieved March 29, 2015, from http://groups.open.org.nz/groups/ninja-talk/

The UN Secretary General’s Independent Expert Advisory Group on a Data Revolution for Sustainable Development. (2014). A world that counts: Mobilising the data revolution for sustainable development (Research report). Author. Retrieved from http://www.undatarevolution.org/report/

Total population: Gapminder documentation 003. (2015). (Version 3). Gapminder Foundation. Retrieved from http://www.gapminder.org/data/documentation/gd003/

Vermanen, J. (2015). Harnessing external expertise through hackthons. Retrieved March 29, 2015, from http://datajournalismhandbook.org/1.0/en/in_the_newsroom_6.html

What are Galaxy Pages? (2014). In Galaxy Wiki. Galaxy Project. Retrieved from https://wiki.galaxyproject.org/Learn/GalaxyPages

Wiki New Zealand: The place to play with New Zealand’s data. (2015). Wiki New Zealand Trust. Retrieved from http://wikinewzealand.org/

Helping people to connect with data

The problem

Why help people connect with data

Not enough people are connecting with data

A software solution

KNIME

Swapping out nodes

Reuse and sharing

Creating new nodes

Accessibility and openness

KNIME summary

Galaxy

Swapping out Tools

Reuse and sharing

Creating new Tools

Accessibility and openness

Galaxy summary

Gapminder World

Swapping out nodes

Reuse and sharing

Creating new nodes

Accessibility and openness

Gapminder World summary

The OpenAPI solution

Swapping out modules

Reuse and sharing

Creating new modules

Accessibility and openness

Summary

References