This is another parallel computing framework, but a little more lightweight than apache spark.
In task scheduling we break our program into many medium-sized tasks or units of computation, often a function call on a non-trivial amount of data. We represent these tasks as nodes in a graph with edges between nodes if one task depends on data produced by another. A task scheduler is called to execute this graph. Task scheduling logic often hides within other larger framework.
Dask uses a specification to encode a graph – a directed acyclic graph of tasks with data dependencies using common python objects: dicts, tuples, callables. A dask graph is a dictionary that maps keys to computations:

{'x': 1,
'y': 2,
'w': (sum, ['x', 'y', 'z']),
'v': [(sum, ['w', 'z']), 2]}


A key is any hashable value that is not a task. A task is a tuple with a callable as first element. The succeeding elements in the task tuple are arguments for that function.

import h5py
dataset = h5py.File('myfile.hdf5')['/x']

x = da.from_array(dataset, chunks=dataset.chunks)

y = x[::10] - x.mean(axis=0)
y.compute()


import dask.dataframe as dd

df.groupby(df.timestamp.dt.hour).value.mean().compute()


import dask.bag as db
import json

records.filter(lambda d: d['name'] == 'Alice').pluck('id').frequencies()


Besides built-in collections, dask also provides a wide variety of ways to parallelize custom applications.

• Call delayed on the function, not the result
• Call compute() on lots of computation at once. Ideally, you want to make many dask.delayed calls to define your computation and then only call dask.compute at the end
• You need to break up computations into many pieces.You achieve parallelism by having many dask.delayed calls, not by using only a single one.
• Avoid calling delayed within delayed functions.
• Don’t call delayed on other Dask collections. When you place a dask array or dask dataframe into a delayed call that function will receive the Numpy or Pandas equivalent. Instead, it’s more common to use methods like da.map_blocks or df.map_partitions, or to turn your arrays or dataframes into many delayed objects.
partitions = df.to_delayed()