Xray and Dask: Out-Of-Core, Labeled Arrays in Python

Loic · on June 12, 2015

As it is in Python, is it compatible with numba[0] if using the @jit(nogil=True) decorator? Having f in ds.groupby('some variable').apply(f) being a jit compiled numba function would be great.

[0]: http://numba.pydata.org/

shoyer · on June 12, 2015

Yeah, Numba makes it awesomely easy to write fast functions in Python that release the GIL. You can already do this directly with dask.array by passing a Numba compiled function to the map_blocks method: http://dask.readthedocs.org/en/latest/array-api.html#dask.ar... -- it should be pretty straightforward to wrap this with xray.

lqdc13 · on June 12, 2015

Why not just sample further or use a machine where this dataset fits in memory? A minute to compute the mean seems unreasonable if the goal is to perform more complex tasks in the future.

shoyer · on June 12, 2015

Indeed, those are both great options when possible. But easy access to parallel computing is still quite useful.

For interactive analysis or building statistical models, you probably do indeed still probably want your data fit in memory. But often it's most useful to make your data smaller by calculating some sort of summary statistics instead of subsampling. For example, if you're interested in climate change, you might want to work with monthly means instead of the original daily or sub-daily data. Currently, climate scientists usually do this sort of thing with command line tools.

As for machines where datasets fit into memory -- that's also great, if you have access to them. But even then, for most operations numpy will be limited to a single core. Calculating the mean of 51GB of data is still pretty slow, even if it already is in memory. Your machine with 256 GB of memory almost assuredly has 32+ cores to go along with it, and it's a shame to let them sit idle.

This post by Nikolay Koldunov gives some more context about the value of dask.array for weather data: http://earthpy.org/dask.html

ngoldbaum · on June 12, 2015

Although you can use numexpr for some easy (albeit modest) threaded speedups and streamlined memory usage. I've had more success coding custom array processing routines in cython where I can easily exploit threads.

devty · on June 12, 2015

excellent work!