Use case 2
Building a (predictive) model

Diagram:

Use Case 2 Diagram

Name:

Building and using a prediction model

Brief Description:

A user applies an algorithm to a dataset and uses the resulting model on a different dataset. The results are stored as novel dataset within the system.

Actors:

A researcher with background in model building.

Basic Flow:

The user selects an algorithm and a dataset within the system to induce a model. The selection of the algorithm is based on the intended uses, i.e. within a supervised setting using a classification or regression algorithm. The algorithm possesses a number of default parameters, which can be adjusted to the users specifications. The user starts the induction process. Upon termination of the algorithm the user receives the result in form of a model. The user can supply an already existing second dataset and apply the model to it. The results are returned as novel dataset URI.

Preconditions:

  • A dataset, identified by its URI, containing:
    • One or more substance (URIs) each with
      • a target feature (categorical/numerical)
      • optionally additional numerical features (URI)
  • An existing algorithm with default parameters for:
    • Classification, requires
      • a categorical target feature
    • Regression, requires
      • a numerical target feature

Postconditions:

Success End Condition:

  • A model has been induced.
  • If applicable, prediction on the second dataset have been produced and stored using this model.

Failure End Condition:

  • No model has been induced.
  • If applicable, no prediction on the second dataset could be produced and/or the results of applying the model to the second dataset could not be stored.

Minimal Guarantee:

  • None

Relationship to other use cases:

When exchanging the actor Researcher to a validation algorithm, this use case could be part of a possible Validation use case.

Alternate Flows:

NA

Exception Flows:

Possible exceptions:

  • Alternative 1: Only applicable algorithms can be selected:
    • The dataset is empty/not accessible: the system returns an error code
    • The supplied parameters are out of scope: the system returns an error code
  • Alternative 2: User supplies dataset and algorithms without a selection process:
    • As above, but extra:
    • The supplied algorithm cannot be applied to the dataset: the system returns an error code

Extensions:

  • Adopt model construction for Clustering
  • Provide possibilities for feature construction
  • Allow other types of features

Selecting a training data set:

  • Include descriptor calculation services/APIs from SMILES (CDK), PDB files (crystallographic data, MOPAC), images (image analysis software).
  • Preprocess (normalize or scale) of data.
  • Include categorical features.
  • Select subsets of the dataset (compounds and/or descriptors)
  • Datasets are annotated with the the type of variables that they contain (numerical/categorical)