I was sort of multi-tasking this week. First, I continued the development of mosaic module which was not finished last week; second, I started to work on the Daymet scenarios which was identified in the face-to-face meeting at ORNL.
The mosaic module is the most challenging one among all the modules I have developed so far. There are no ready-to-use mosaicking functions in exiting NetCDF libraries (e.g. cdms2), so I have to write everything by myself. It involves constructing a new grid based on input files, calculating the mean values for overlapped areas etc. The even more tricky parts are skewed alignment, padding areas, and missing coordinates for curvilinear NetCDF files (e.g. Daymet files). I finally got it work. However, as there are a lot of computation going on, the speed is low if the files to be merges are large. This is a serious issue as climate data usually have very high spatial/temporal resolution nowadays and the data volume is becoming larger and larger. A possible solution is to explore the power of parallel computing: with the final grid generated, we can distribute computation into multiple cores or machine to speed up the processing.
The Daymet scenario is one of the work flows we want to develop out of this summer intern. Daymet data are daily, gridded surfaces of temperature, precipitation, humidity, and radiation data for United States, Mexico, and Canada (south of 52 degrees North) from 1980-2011. Climate scientist need to get data from Daymet repository, process them (e.g. spatial subset) and get spatial or temporal summaries (e.g. monthly average). This week I developed two modules (GetDaymetTileList and DownloadDaymetTiles) to access Daymet data. In those two modules, user just need to specify the spatial extent (latitude/longitude min and latitude/longitude max) and time range they are interested in, the modules will calculate the tile ids and either generate a tile list or download tiles into local directory. A major issue is still the speed. Again, an improvement would be getting the data in parallel mode.
It can be seen that many modules may benefit from parallel processing, especially for large datasets. Therefore, I started to explore how to incorporate parallel processing capability in Vistrail modules. My initial finding is that parallel processing code (e.g. python multiprocessing) cannot directly run inside a Vistrails module. A solution would be to call separate python script from Vistrails. I’ll explore more on this next week.