API Design Concept
Going beyond API concepts realized in previous projects in this area like OpenTox and Open PHACTS is necessary because of two reasons:
- A much broader scope of data and tool services will be integrated and harmonized in OpenRiskNet and
- the semantic interoperability layer adding richer scientific annotation imposes additional demands.
OpenRiskNet is trying to bring together all aspects of chemical and nanomaterial risk assessment including access to data sources and processing, analysis and modelling tools from areas such as hazard prediction using read across or QSAR, toxicogenomics and biokinetics modelling and as a European infrastructure it needs to be sustainable and prepared for future extensions regarding the integrated data types and modelling approaches. This is only possible with a more flexible approach, which on one hand allows for extensions but on the other hand avoids too much variety jeopardizing the interoperability of the services.
Additionally, OpenRiskNet will work on making the interfaces smarter by adding a semantic interoperability layer. By querying this layer, a service should provide the following information to be compliant to the OpenRiskNet infrastructure:
- Scientific background of the service, this can just be a link to the relevant publication but also to manuals, tutorials and other training material;
- Technical background like links to source code, installation instructions, license information and deployment options;
- Capabilities of the service, for databases this will include, amongst others, the used data schemata, i.e. the description of the stored data and the associated metadata, as well as search options and, for software tools, the type and amount of generated output including the options and parameters which can be chosen by the user to optimize the results and
- Requirement on input data types and formats and options on the output format.
When looking at the diverse set of requirements, it became clear that a top-down approach, where the OpenRiskNet consortium publishes the specifications for OpenRiskNet compliant APIs and then asks associated partners to adapt their software to these, will result in many iteration cycles until the requirements of all the different areas of risk assessment will be satisfied. Additionally, if the first iterations of the APIs are based on only a subset of the types of services, which need to be supported, design decisions made in these can only be undone with large effort if they prove to be unfit for types of services added later.
Therefore, it was decided to change the design concept to a bottom-up approach, described in more detail in the next section, in which the existing concepts and APIs of partners and associated partners are collected and consolidated first. These will then be harmonized and made interoperable by moving them all more or less at the same time to state-of-the-art API specifications like Open API 3.0 and integrating detailed descriptions and semantic annotations in a stepwise manner.
Stepwise API Design
With the growing amount of biological / toxicological data available, two problems arise: first, the problem of findability (i.e. how to find the available data) and second, the problem of interoperability (i.e. how to use, combine and harmonize this data and make use of it as one unit). There are a number of domain specific efforts across different projects that aim to solve both mentioned problems. In the space of data findability there are e.g.
In the space of data interoperability e.g.
Most of the listed approaches base the work on linked data and associated technologies. Bizer et al. [Bizer C, Heath T, Berners-Lee T. Linked Data - The Story So Far. Int J Semant Web Inf Syst. 2009: 1–22] define linked data as “a method of publishing structured data so that it can be interlinked and become more useful through semantic queries. It builds upon standard Web technologies such as HTTP, RDF and URIs, but rather than using them to serve web pages for human readers, it extends them to share information in a way that can be read automatically by computers. This enables data from different sources to be connected and queried.” The data parts (or associated problems) of OpenRiskNet are closely aligned with the wider domain outlined above. Therefore it makes sense to adopt the solutions to these approaches.
However, the aim of OpenRiskNet is not only data but also operations over data and exposure of these operations via APIs. This is a space that is much less explored. There are existing general approaches, technologies and standards available (e.g. Open Api, JSON API) but there are no existing standards to handle operations (and APIs) in the domain of Linked data.
It is important to observe that when it comes to operations over data, both problems of findability and interoperability apply as well. It would therefore be desirable to treat them in a similar way to data or at least with the same toolset. Taking for granted that the solutions will be sought in the space of Linked Data and considering the OpenRiskNet setup where partners will contribute data services and tools services (operations) that will then be made discoverable and interoperable, the bottom-up approach is more appropriate for the following reasons:
- When it comes to implementation of linked data solutions it is important to strike the right balance between generality and specificity. If a solution is too general, then everything is possible but the level of interoperability might be reduced or consuming such services might become prohibitively complicated. If a solution is too specific, then a higher level of interoperability is possible but the scope can be too limiting. Understanding the domain (list of operations and data shapes) first is therefore of great benefit.
- While people are understanding the concept of linked data more and more, some of the underlying technologies are still relatively new, evolving and not everyone has experience in using them. To ease the entry into this space it is therefore beneficial to take a gradual approach from the non-linked data solutions to the linked data solutions and break the process into manageable chunks of work.
The more detailed proposal of this process is as follows:
Step 1: identify and collect the available data and tools
In this step, the OpenRiskNet partners have identified the data and tools they want to include in the core part of the infrastructure provided by the OpenRiskNet consortium and will continue to do so throughout the project to enrich the infrastructure with services from associated partners. The information collected at this step is:
- Title or name of a dataset or tool
- Short description
- Possible URL(s) where the data or tools are currently available
- Partner name (organisation)
- Contact person (name, email)
This information is now used to establish a registry of all the components involved in the OpenRiskNet and track their status.
If you want to add your service to this collection, please consider to become an associated partner.
Step 2: provide description of data and services
In this ongoing step, the OpenRiskNet partners and associated partners provide a more detailed description of the datasets and tools to be included in the OpenRiskNet. At this stage they are free to describe the existing or planned objects without any constraints related to the API endpoints or data shapes. However, the partners will provide the descriptions in a unified way. To facilitate this, we propose to use the Open Api specification 3.0 (currently rc2) which should cover both data and operations definitions. When it comes to operations it is especially important for partners to describe both inputs and outputs.
The APIs of the following services have been described according to the specifications:
- Jaqpot Quatro
- scaiview
- Squonk
- WikiPathways and BridgeDb
- Weka/lazar
- ChemID Converter
- Tox21/ToxCast
- ToxRefDB
- Open TG-GATES
Step 3: analyze the available descriptions and look for patterns
In this step, we will look at the provided definitions and look for the following patterns:
- Common data types (e.g. compound)
- Common operation types (e.g. molecular identifier conversion)
We will use the identified patterns and harmonize the underlying types (use existing definitions where available and formally define new ones). We will also at this step propose a set of preferred vocabularies and ontologies that partners could use to annotate their definitions (in the next step).
A couple of points regarding the types are in order here. When it comes to operations it is very important to define
operation inputs (types) and outputs (types). This provides means to match and combine the different operations and
therefore provide higher level of interoperability. To use an analogy from the functional programming languages, if
function f takes type a and returns type b (f : a -> b)
and function g takes type b and returns type c (g : b -> c)
one can chain them in the following way: g.f : a -> c
. This in turn opens the possibility to develop workflow systems
that have well defined underlying semantics.
Furthermore, operations themselves can be annotated which provides a means for discoverability. For example, if all operations dealing with the chemical ID conversion are annotated as such (provided that such a concept is defined), it is possible to build user interfaces where users can browse through different operations and in combination with type information decide which are the most appropriate for them, regardless of the actual chemical ID used by a particular service.
Step 4: annotate the existing definitions and make the transition to linked data
In this step, partners will annotate existing definitions with vocabulary and ontology terms and by doing so, build the linked data semantic layer. Here the actual implementation details will become of great importance. There are a couple of reasons for this.
First, it is reasonable to expect that sections of definitions will be hard to annotate since there simply may not be appropriate definitions available.
Second, the choice of tools will have an impact on the actual implementations and the ease-of-use for the developers. Having these two constraints in mind we propose JSON-LD as a serialization format. There are three important reasons for this:
Most developers are already familiar with JSON and enjoy the ease of use JSON provides. Therefore the use of JSON-LD lowers the barrier of adoption and makes the transition from non-linked data to linked data almost seamless. Since JSON-LD allows for blank nodes to be used for predicates, it makes cases where users may not find appropriate terms less severe (and still technically correct). More on this subject can be found here: http://manu.sporny.org/2013/rdf-identifiers/
JSON-LD provides means to split the annotations and the data by providing the option to pass the JSON-LD context in the HTTP response link header. This is an important implementation detail that allows developers to migrate their existing APIs without modifying the underlying response building code.
Step 5: decide on (or develop) the underlying discoverability protocol
The findability requirement of the OpenRiskNet demands that all the APIs expose the information about their capabilities in a unified manner and make it available for consumption to the discovery service(s).
Currently we identified two possible solutions but given that both are under active development we need to postpone the decision on which one to use. Also, findings from steps 1-4 may provide further means for a more educated choice. Below is the outline of the two possible approaches along with the identified benefits and downsides.
Open Api
Open Api v3 (Swagger v2) is an API documentation specification. It provides means for API developers to define API endpoints (operations) including their inputs and outputs. A tools ecosystem is available around this specification that allows, for example, generation of interactive documentation from the API specification.
The benefits of choosing Open Api lie in the fact that many developers are already familiar with it and the fact that it is becoming prominent in the API design community. Also, we are proposing to use Open Api v3 in step 2.
The downside of Open Api is that it is currently not targeting linked data but rather traditional REST APIs that use JSON as a serialisation mechanism. This means we would have to extend Open Api to provide for the desired semantic layer. Open Api does provide means for introducing extensions using the x- attributes and these could be used to achieve our goal but of course these extensions would be non-standard ones. Combining this with the JSON-LD properties such as allowing blank nodes for properties and the possible separation between response bodies and context, this may be a feasible solution.
Finally, Open Api does not uniquely prescribe the API envelope (paging, query parameter formation) which is usually needed and is important when it comes to API integration. We could use ideas from the JSON API specification to define those.
Hydra
Hydra is a vocabulary for hypermedia driven APIs. Its aim is to solve the same problem as Open Api is solving for the traditional REST APIs but for linked data. It uses JSON-LD as a serialisation mechanism and also provides ready-to use means for API navigation including paging.
The benefits of using Hydra are that it targets linked data and therefore seamlessly combines the linked data and API definitions using the same (linked data) tools.
Hydra is not yet a W3C specification and is in a draft state (January 16 2017). As a consequence it might still change. Another consideration is available tooling around the Hydra specification. While the general linked data tools could be used to work with Hydra API specifications it is often helpful in API design to have more specific tooling available. So to provide a better assessment of this point we will collect more hands-on experience using Hydra.