Utopia features: Data Analysis

Utopia features

Data analysis

Investigating a model’s data output requires handling vast amounts of data efficiently and reliably.

We developed the dantro Python package for this task and integrated it into Utopia. dantro implements a data processing pipeline which streamlines handling, transforming, and visualizing data via a sequence of configurable operations. Same as with the Utopia project, the package aims at improving the scientific workflow’s efficiency, reliability, and reproducibility.

Data processing pipeline

+

By integrating dantro, Utopia allows to automate the process of generating and evaluating data:

Performing a model simulation which generates the model data output
Loading the data in a uniform hierarchical format
Carrying out a set of pre-configured data analysis and visualization routines

Similar to the configuration-based definition of simulations, the data processing pipeline puts configuration files center stage. At all stages, operations are dynamically configurable via YAML, the “human friendly data serialization standard”.

To handle large amounts of data, dantro does not load all data directly, but creates proxy object that load data only when needed. Furthermore, it integrates with dask to allow out-of-memory operations.

Data transformation framework

+

With the data transformation framework, dantro and Utopia aim to separate the tasks of processing and visualizing data. The benefit of this approach is that each module can focus on what they do best and allow to simply combine them. With the configuration-based approach, this requires making data transformation accessible via a declarative configuration, same as done for the visualization.

The data transformation framework allows to select the desired data and apply transformations to it. Transformation operations are assembled to a directed acyclic graph, such that a number of optimizations become possible: computing only necessary operations, re-using computed results, caching computationally costly results and reloading them in a later session, and more.

More details