Getting Started with Dask and Kubernetes

Dask enables parallel and out-of-core computation. We couple blocked algorithms with dynamic and memory aware task scheduling to achieve a parallel and out-of-core NumPy clone. We show how this extends the effective scale of modern hardware to larger datasets and discuss how these ideas can be more broadly applied to other parallel collections.
― Matthew Rocklin

If you never heard of Dask, it is parallel programming library for Python. The project was presented at SciPy 2015 by Matthew Rocklin ans is sponsored by NumFOCUS.

Dask is built on top of NumPy and Pandas and extends their familiar interfaces to larger-than-memory and parallel computing environments. Moreover, it has a promising future, as Pandas, Jupyter and scikit-learn maintainers also maintain Dask.

Dask also enables distributed computing in pure Python as opposed to Apache Spark.

This guide uses following tools,

Kubernetes

First, we install kubectl and kops using Homebrew or equivalent,

brew install kubectl kops

Second, we give kops a bucket to store the configurations.

aws s3api create-bucket \
    --bucket g4brielvs-kops-state-store
aws s3api put-bucket-versioning \
    --bucket g4brielvs-kops-state-store \
    --versioning-configuration Status=Enabled
aws s3api put-bucket-encryption \
    --bucket g4brielvs-kops-state-store \
    --server-side-encryption-configuration '{"Rules": [{"ApplyServerSideEncryptionByDefault": {"SSEAlgorithm": "AES256"}}]}'

Then we set the environment variables for kops to use,

export KOPS_CLUSTER_NAME=g4brielvs.k8s.local
export KOPS_STATE_STORE=s3://g4brielvs-kops-state-store

Now we choose from EC2 instance types. Note that specially when solving embarassingly parallel problems, we will not require more often than not expensive machine, rather we may take advantage of more workers.

kops create cluster \
    --node-count=4 \
    --node-size=t3.micro \
    --zones=us-east-1a

Now we launch the cluster creation,

kops update cluster --name ${KOPS_CLUSTER_NAME} --yes

Now cluster is starting and it should be ready in a few minutes.

kops validate cluster 

Once you are done, REMEMBER to tear the cluster down. Otherside you will have to pay for the uptime.

kops delete cluster --name ${KOPS_CLUSTER_NAME} --yes

Helm

Let’s get helm started,

brew install kubernetes-helm
helm init
kubectl create serviceaccount --namespace kube-system tiller
kubectl create clusterrolebinding tiller-cluster-rule --clusterrole=cluster-admin --serviceaccount=kube-system:tiller
kubectl patch deploy --namespace kube-system tiller-deploy -p '{"spec":{"template":{"spec":{"serviceAccount":"tiller"}}}}'
helm init --service-account tiller --upgrade

Finally, we install the Dask Helm Chart,

helm install stable/dask

After a few minutes, we check the running services,

kubectl get services

When launching the Jupyter server, you will be prompted for a password. The default password is dask.