Case Study
DataCure – Data curation and creation of pre-reasoned datasets and searching

Summary

Users will be able to access different OpenRiskNet data sources and specific entries. This can then be manually curated using an OpenRiskNet service and re-submitted to the data source. In an extended version, text mining facilities could be used for data annotation.

A first step will be to define the API and provide the semantic annotation for selected databases (i.e. diXa, FDA datasets, ToxCast and ChEMBL). During the preparation for these case studies, it became clear that the existing ontologies do not cover all requirements of the semantic interoperability layer. Therefore, ontology development and design of the annotation process as an online or an offline/preprocessing step form a central part of this case study.

Objectives

  • This case study will serve as the entry point of curation of all data sources to be used by the remaining use cases;
  • Semantic annotation and API definition for the selected databases will also be carried out in this use case.

Risk assessment framework

DataCure covers the identification of use scenario / chemical of concern / collection of existing information (Tier 0 in the selected framework) and its steps related to:

  • Identification of molecular structure;
  • Collection of support data;
  • Identification of analogues / suitability assessment and existing data.

Use Cases Associated

This case study is associated with UC1 - Merge existing data by a common structure identifier, where a user searches for existing assay information, selects the desired information, and merges the results based on a unique structure identifier. Specifically, the steps to achieve different objectives of the DataCure, include:

  • The user identifies and visualises the molecular structure:

    1. Generation of molecular identifiers for database search
    2. Searching all databases
    3. Data curation
    4. Tabular representation
    5. Visualisation
  • The user collects support data:

    1. Provide data access scheme using the interoperability layer
    2. Access selected databases or flat files in a directory
    3. Query to ontology metadata service and select ontologies, which should be used for annotation
    4. Annotate and index all data sets using text mining extraction infrastructure
    5. Passing to ontology reasoning infrastructure
    6. Generate database of pre-reasoned dataset (semantic integration)
    7. Allow for manual curation
  • The user identifies chemical analogues:

    1. Inventory of molecules (commercially available or listed in databases)
    2. Generate list of chemically similar compounds
    3. Collect data of similar compounds

Databases and tools

The following set of data and tools are proposed to be used and exploited within the DataCure:

  • Physchem, toxicological and omics databases: RDKit, CDK, Chemical Identifier Resolver (NIH), PubChem, registries (e.g. ECHA, INCI), Data Explorer (DC)
  • Ontology/terminology/annotation: SCAIView / JProMiner / BELIEF (Fraunhofer), openBEL

Service integration

A set of physical-chemical properties prediction and ontology services will be integrated.

Currently available services:

Related resources

Tutorial
Workflow: Access TG-GATEs data for seleted compounds, select differentially expressed genes and identifier relevant pathways
Thomas Exner
13 Sep 2018
Abstract:
Example workflow based on OpenRiskNet tools - Pathway identification workflow related to DataCure and AOPlink case studies. This notebook downloads TG-Gates data of 4 compounds and selects genes overexpressed in all sample. The Affymetrix probe sets are then translated into Ensembl gene identifiers using the BridgeDB service and pathways associated with the genes are identified using the WikiPathways service.
Related services:
Jupyter Notebooks

Publisher: OpenRiskNet
Target audience: Risk assessors, Researchers, Data modellers
Open access: yes
Licence: Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Organisations involved: DC
Tutorial