Getting Started with Dask and Kubernetes
Dask enables parallel and out-of-core computation. We couple blocked algorithms with dynamic and memory aware task scheduling to achieve a parallel and out-of-core NumPy clone. We show how this extends the effective scale of modern hardware to larger datasets and discuss how these ideas can be more broadly applied to other parallel collections.
― Matthew Rocklin
If you never heard of Dask, it is parallel programming library for Python. The project was presented at SciPy 2015 by Matthew Rocklin ans is sponsored by NumFOCUS.
Dask is built on top of NumPy and Pandas and extends their familiar interfaces to larger-than-memory and parallel computing environments. Moreover, it has a promising future, as Pandas, Jupyter and scikit-learn maintainers also maintain Dask.
Dask also enables distributed computing in pure Python as opposed to Apache Spark.
This guide uses following tools,
Kubernetes
First, we install kubectl
and kops
using Homebrew or equivalent,
Second, we give kops
a bucket to store the configurations.
Then we set the environment variables for kops
to use,
Now we choose from EC2 instance types. Note that specially when solving embarassingly parallel problems, we will not require more often than not expensive machine, rather we may take advantage of more workers.
Now we launch the cluster creation,
Now cluster is starting and it should be ready in a few minutes.
Once you are done, REMEMBER to tear the cluster down. Otherside you will have to pay for the uptime.
Helm
Let’s get helm
started,
Finally, we install the Dask Helm Chart,
After a few minutes, we check the running services,
When launching the Jupyter server, you will be prompted for a password. The default password is dask
.