DataCure – Data curation and creation of pre-reasoned datasets and searching
Users will be able to access different OpenRiskNet data sources and specific entries. This can then be manually curated using an OpenRiskNet service and re-submitted to the data source. In an extended version, text mining facilities could be used for data annotation.
A first step will be to define the API and provide the semantic annotation for selected databases (i.e. diXa, FDA datasets, ToxCast and ChEMBL). During the preparation for these case studies, it became clear that the existing ontologies do not cover all requirements of the semantic interoperability layer. Therefore, ontology development and design of the annotation process as an online or an offline/preprocessing step form a central part of this case study.
- This case study will serve as the entry point of curation of all data sources to be used by the remaining use cases;
- Semantic annotation and API definition for the selected databases will also be carried out in this use case.
Risk assessment framework
DataCure covers the identification of use scenario / chemical of concern / collection of existing information (Tier 0 in the selected framework) and its steps related to:
- Identification of molecular structure;
- Collection of support data;
- Identification of analogues / suitability assessment and existing data.
Use Cases Associated
This case study is associated with UC1 - Merge existing data by a common structure identifier, where a user searches for existing assay information, selects the desired information, and merges the results based on a unique structure identifier. Specifically, the steps to achieve different objectives of the DataCure, include:
The user identifies and visualises the molecular structure:
- Generation of molecular identifiers for database search
- Searching all databases
- Data curation
- Tabular representation
The user collects support data:
- Provide data access scheme using the interoperability layer
- Access selected databases or flat files in a directory
- Query to ontology metadata service and select ontologies, which should be used for annotation
- Annotate and index all data sets using text mining extraction infrastructure
- Passing to ontology reasoning infrastructure
- Generate database of pre-reasoned dataset (semantic integration)
- Allow for manual curation
The user identifies chemical analogues:
- Inventory of molecules (commercially available or listed in databases)
- Generate list of chemically similar compounds
- Collect data of similar compounds
Databases and tools
The following set of data and tools are proposed to be used and exploited within the DataCure:
- Physchem, toxicological and omics databases: RDKit, CDK, Chemical Identifier Resolver (NIH), PubChem, registries (e.g. ECHA, INCI), Data Explorer (DC)
- Ontology/terminology/annotation: SCAIView / JProMiner / BELIEF (Fraunhofer), openBEL
A set of physical-chemical properties prediction and ontology services will be integrated.
Currently available services:
Collection of toxicological data sources exposed via OpenToxService type: Database / data source, Application
Interactive computing and workflows sharingService type: Helper tool, Visualisation tool, Processing tool, Analysis tool, Software, Workflow
Scientific workflows make simpleService type: Database / data source, Service, Workflow
Example workflow based on OpenRiskNet tools - Pathway identification workflow related to DataCure and AOPlink case studies. This notebook downloads TG-Gates data of 4 compounds and selects genes overexpressed in all sample. The Affymetrix probe sets are then translated into Ensembl gene identifiers using the BridgeDB service and pathways associated with the genes are identified using the WikiPathways service.