Ray + Zarr: a better way to scale plate reconstruction workflows

If you work with plate tectonic reconstructions at scale — hundreds of time steps, global raster grids, millions of sample points — you’ve probably hit the ceiling of single-machine processing. I’ve been exploring different distributed computing frameworks for these workflows, and I’m now firmly in the Ray + Zarr camp. Here’s why.

Why Ray

Ray is a distributed computing framework that gets a lot right for scientific workloads. Three features stand out.

Shared object store

Ray runs a shared-memory object store on every node. When you place a large array in the store with ray.put(), every worker on that node can read it via zero-copy access — no serialization, no duplication. The array is memory-mapped directly into each worker’s address space.

For plate reconstruction workflows, this is transformative. A global rotation model or a reference raster might be hundreds of megabytes. With Ray, you store it once and every worker reads it at memory speed:

import ray
import numpy as np

ray.init()

# Store a large reference raster once
base_raster = np.load("global_topography.npy")  # 500 MB
raster_ref = ray.put(base_raster)  # Stored in shared memory

@ray.remote
def sample_raster(raster, points):
    # raster is a zero-copy view — no 500 MB copy per worker
    return raster[points[:, 0], points[:, 1]]

# All 16 workers share the same memory
futures = [sample_raster.remote(raster_ref, pts) for pts in point_batches]
results = ray.get(futures)

In a framework that copies inputs to each worker, 16 workers means 16 copies — 8 GB of RAM for a single array. Ray uses 500 MB.

Actors for stateful computation

Ray’s actor model lets you keep expensive objects alive across many task invocations. In plate tectonics, loading a rotation model is slow — you don’t want to do it for every time step. With Ray actors, you load it once:

@ray.remote
class ReconstructionWorker:
    def __init__(self, rotation_file, topology_file):
        import pygplates
        self.model = pygplates.RotationModel(rotation_file)
        self.topologies = pygplates.FeatureCollection(topology_file)

    def reconstruct_raster(self, raster, time):
        # self.model stays loaded across calls
        rotated = reconstruct(raster, self.model, time)
        return rotated

# Create a pool of workers — each loads the model once
workers = [ReconstructionWorker.remote(rot_file, topo_file)
           for _ in range(num_cpus)]
pool = ray.util.ActorPool(workers)

# Process 200 time steps across the pool
results = list(pool.map(
    lambda w, t: w.reconstruct_raster.remote(base_raster_ref, t),
    range(0, 200)
))

Each worker initialises its rotation model once, then processes dozens of time steps without reloading. This alone can cut wall-clock time dramatically.

Decentralised scheduling

Ray distributes scheduling decisions across every node in the cluster. Each node’s Raylet daemon manages local resources and places tasks with awareness of data locality — preferring nodes that already hold the required objects. There’s no central scheduler bottleneck, which means Ray scales comfortably to thousands of nodes and millions of tasks.

Why Zarr

Zarr is a chunked, compressed, N-dimensional array format designed for parallel and cloud-native access. It stores each chunk as an independent object — a file on disk or an object in S3/GCS.

For spatio-temporal raster data, this chunk-per-object architecture is exactly what you want:

Parallel I/O: each chunk can be read independently, so N workers can read N chunks simultaneously with no contention.
Selective access: reading a single time step from a (time, lat, lon) array only touches the chunks for that slice — not the entire dataset.
Concurrent writes: workers can write to non-overlapping chunks of the same store in parallel, with no file-level locking.
Cloud-native: unlike HDF5/NetCDF which need POSIX file semantics, Zarr works natively on S3, GCS, and Azure Blob.

The chunking scheme is the key design decision. For plate reconstruction workflows where you typically process one time step at a time, chunking along the time axis is natural:

import zarr

# Create a Zarr store chunked by time step
store = zarr.open("reconstructions.zarr", mode="w")
store.create_dataset(
    "temperature",
    shape=(200, 1801, 3601),    # 200 time steps, 0.1° global grid
    chunks=(1, 1801, 3601),     # one chunk per time step
    dtype="float32",
    compressor=zarr.codecs.BloscCodec(cname="zstd", clevel=3),
)

Each time step is a single chunk. Workers can read and write individual time steps without touching any other part of the store.

Ray + Zarr together

The combination is natural: Zarr provides chunk-addressable storage, Ray provides distributed execution with shared memory. The workflow pattern is:

Store your input rasters as Zarr, chunked to match your processing units.
Use Ray tasks or actors to process each chunk in parallel.
Write results back to a Zarr store, with each worker writing its own chunks concurrently.
Share large read-only reference data (rotation models, coastline polygons) through Ray’s object store.

Here’s a complete example — extracting raster values at sample points across many time steps:

import ray
import zarr
import numpy as np

@ray.remote
def extract_timestep(zarr_path, time_idx, points):
    """Extract raster values at point locations for one time step."""
    store = zarr.open(zarr_path, mode="r")
    raster = store["temperature"][time_idx, :, :]
    lats, lons = points[:, 0], points[:, 1]
    lat_idx = ((90 - lats) * 10).astype(int)
    lon_idx = ((lons + 180) * 10).astype(int)
    return time_idx, raster[lat_idx, lon_idx]

ray.init()

# Share point coordinates via object store (zero-copy on each node)
sample_points = np.load("sample_sites.npy")  # shape (N, 2)
points_ref = ray.put(sample_points)

# Process all 200 time steps in parallel
futures = [
    extract_timestep.remote("reconstructions.zarr", t, points_ref)
    for t in range(200)
]
results = dict(ray.get(futures))

Each task reads exactly one chunk from the Zarr store. The sample point coordinates are shared via Ray’s object store without copying. The tasks are embarrassingly parallel, and Ray schedules them across available cores with locality awareness.

But why not Dask?

This is the question I get most often, and it’s fair — Dask is excellent and deeply embedded in the geoscience Python ecosystem. Xarray uses it natively. If your workflow is ds.mean(dim='time').compute(), Dask is hard to beat.

But for the kind of workflows I’m describing — reconstructing rasters across deep time, sampling at millions of points, running heterogeneous processing pipelines — Ray has meaningful advantages:

Shared memory vs. copies. Dask’s multiprocessing scheduler copies task inputs to each worker. Ray’s object store provides zero-copy reads. When your reference data is large, this difference is measured in gigabytes of RAM.

Stateful workers. Dask’s computation model is purely functional — every task is a stateless function call. If you need to keep a rotation model loaded across many time steps, you either reload it every time or contort your task graph. Ray actors solve this cleanly.

Scheduling at scale. Dask uses a centralised scheduler that tracks every task in the graph. This is fine for moderate workloads, but becomes a bottleneck with millions of fine-grained tasks. Ray’s decentralised Raylet architecture avoids this entirely.

Dynamic workflows. Ray tasks can spawn new tasks based on intermediate results. Dask’s task graphs are typically constructed up front. For workflows where the next step depends on the previous result — say, adaptive mesh refinement or iterative convergence — Ray is more natural.

When I still use Dask: interactive xarray analysis, quick exploratory work on a single machine, and anywhere I want lazy evaluation baked into the xarray API. Dask is the right tool for those jobs. But for production-scale reconstruction pipelines, I reach for Ray.

Getting started

pip install ray[default] zarr

Ray’s documentation is thorough, and the core walkthrough covers tasks, actors, and the object store in about 15 minutes. For Zarr, the Cloud-Native Geospatial guide is an excellent starting point.

If you’re working with plate reconstructions specifically, GPlately provides the reconstruction primitives — the patterns above show how to scale them with Ray and Zarr.

Python Ray Zarr Plate Tectonics Distributed Computing

Authors

Dr. Ben Mather

ARC Industry Research Fellow

I am an ARC Industry Research Fellow in the School of Geography, Earth and Atmospheric Sciences at The University of Melbourne. I am an expert in fusing Earth evolution models with data to understand how groundwater moves critical minerals through the landscape. Related research interests include the cycling of volatiles within the Earth, probabilistic thermal models of the lithosphere to unravel past tectonic and climatic events, and understanding the how enigmatic volcanoes form.

I am a vocal advocate for the integral role of geoscience in responding to challenges we face in transitioning to the carbon-neutral economy. As an expert in my field, I have been interviewed in national and international print media, TV, and radio on a wide variety of subjects including earthquakes, volcanoes, groundwater, and critical minerals.

← Mapping craton edges to find critical metals Nov 18, 2025

GPlately 2.0 is here Jul 14, 2025 →

No results found