Close to a Functional Package – DataONE Notebooks

One major goal of my summer intern is to develop a package with a set of basic modules for model data inter-comparison. With the newly added two modules this week, the package can almost function now. That means we can start to build some useful provenance-aware workflows with the package.

The two new modules I’ve been working on this week are temporal aggregation and univariate statistics. Temporal aggregation is mainly used to extract summary based on user-specified temporal granularities, such as getting the monthly mean of certain climate variables. This is especially useful for the detection of temporal trends from climate data (e. g. Daymet data). In the module I developed, users can easily choose different temporal granularity (yearly, seasonal, monthly etc) to conduct the temporal aggregation.

The univarate statistics is use to calculate basic statistics along different axis for a single variable. The statistics currently supported include mean, median, sum, standard deviation and variance. This is a pretty flexible module. Users can specify any combination of axis they want. They can get statistics along time axis (e.g. temporal sum) or lat/lon (e.g. spatial mean).

The previous modules I developed, such as spatial subset, mosaic, regrid, are all used to do some basic data processing/preparation. These two new modules start to touch data analysis. The output from those two modules can be visualized using existing visualization modules in Vistrails/UV-CDAT or using new tools that will be developed by a DataONE post-doc who specializes on visualization techniques. With all those basic modules, we can start to explore building concrete workflows.

Another work I spent a lot a time this week is to explore the possibility of putting workflow execution online. If this can be done, users do not need to install Vistrails/UV-CDAT in their computer to use workflows. Vistrails can be run in server mode. However, the setup is really nontrivial. I finally got it to work with quite amount of effort. The basic mechanism is like this: If Vistrails/UV-CDAT is running in server mode, a client or another program can send XML-RPC calls to request the server to execute certain workflow stored in MySQL database in the server side. Python and PHP both can be used to send XML-RPC calls to the server.

After setting Vistrails/UV-CDAT as a server, the next steps would be to build interface to interact with the server and to add parallel processing capability to some of our modules. If Vistrails/UV-CDAT server can reside on a high-performance computer that supports parallel processing, users without installing Vistrails in their computers can make use of the computing resources in the high-performance server to execute workflows.

Leave a Reply Cancel reply