Hosting Dash/Plotly and Bokeh Apps on Compute Studio


Over the past month, I’ve been working to add the capability for Compute Studio (C/S) to host plots made with Dash/Plotly and Bokeh. This work also paves the way for supporting Shiny and other plotting libraries. For quite some time, I've admired how easy and fast it is to create valuable tools for exploring data using Bokeh or Dash. It makes me incredibly excited that C/S can now host apps created with them. This is also a way for C/S to meet develepers where they are by supporting the tools that they already use every day. In turn, C/S offers them the capability to share their work with colleagues, friends, or the public at large.

As always, publishing apps on C/S is free, but either the app developer, app sponsor, or app user must pay for the compute costs. One of the requirements when building this out was to make sure that the apps are only running when they are being used. This resulted in a low-cost way for developers to deploy and share their Bokeh and Dash/Plotly creations.

If you read this and are interested in publishing your own app on C/S, head on over to the Publish page. Not sure where to get started? Send me an email at [email protected] and I'll walk you through the publishing process.

Step 1: Run a Dash app in Kubernetes

One of the things that I’ve learned in the short time that I’ve been working with Kubernetes is that you should start simple and check your progress often. So before I started trying to automate anything with the Kubernetes Python API, I created a couple basic YAML files describing the Deployment, Service, and Ingress objects needed to serve my example project.

Once I got this to work, I knew that I was going in the right direction and could more confidently write the Python code required to automatically create the necessary Kubernetes objects.

Step 2: Automate publishing a Dash app in Kubernetes

Next, I translated the YAML files into a new Server class in the cs_workers package to be used to automatically create, delete, and check the status of visualization deployments and services.

from cs_workers.models.clients import Server

viz = server.Server(
    owner="hdoupe",
    title="myviz",
    tag="v1",
    callable_name="start_dash_server",
    incluster=False,
)

# Create a visualization
viz.create()

# Check visualization status
viz.ready_stats()

# Delete visualization
viz.delete()

Step 3: Embed the Viz!

After starting at the low-level of just getting the visualization server running and viewing it in my browser. I needed to see it embedded on a C/S webpage. Beyond just being excited, this was a helpful sanity check to make sure that embedding it as an IFrame was the right approach.

It works!

Next, it was time to share some early progress. I deployed it to the C/S development site and got a big message from my browser saying that I had a problem, specifically an HTTPS problem. For good reason, you can’t embed a link on a site with HTTPS turned on if the embedded link does not support HTTPS. So, I set about learning how to add TLS to a service in a Kubernetes cluster.

Step 4: Enter Traefik

Traefik is a routing service that can be deployed several ways including with Docker and Kubernetes. Traefik routes traffic from your domain to your services based on simple rules such as:

Host(`viz.compute.studio`) && PathPrefix(`/hdoupe/ccc-widget/dash/`)

which corresponds to this URL: viz.compute.studio/hdoupe/ccc-widget/dash/.

Further, it’s straightforward to integrate with your DNS provider and Let’s Encrypt to make your routes secure with TLS and HTTPS.

One of the cool things about open-source software is that you can learn a lot just by following a project and seeing how they approach problems. For example, I learned a lot about Traefik by reading about how JupyterHub and Dask-Gateway use it. This was an enormous help when setting Traefik up for this project.

I’ll spare the technical details for this section, but the take-away is that the viz was live at https://viz.compute.studio/hdoupe/ccc-widget/dash/ and rendering correctly as an IFrame on the development server.

Step 5: Bringing it all together

Now that the cluster is more-or-less ready to start serving visualizations, it’s time to bring in the brains of the operation, aka the frontend of C/S. A key design decision for C/S is that the compute cluster is simple but reliable. The frontend is where most data is stored and where higher-level cluster automation logic lives.

There are three data sources that are important for this project:

Step 6: Spin them up, spin them down

We now have the data we need to answer some key questions, and we have the operations needed to act on these answers. The general lifecycle of a visualization is as follows:

  1. A user goes to the visualization page.

    Is there a deployment running? If not, let’s start one up. A new deployment object is created in the front-end’s database, and a new visualization deployment is created in the compute cluster. While the deployment is spinning up, a JavaScript function is pinging a REST API endpoint checking the status of the deployment. Once it is ready, the snippet embeds the Iframe in the HTML page. The user can now interact with it.

    Another Javascript function is triggered that ping’s the REST API endpoint letting it know the deployment is still in use. The snippet only “phones home” while document.hasFocus() is True. If the user navigates to another tab, opens another window, or their computer falls asleep, the document loses focus and the page no longer phones home.

  2. Another user comes to the visualization page.

    Is there a deployment running? Yes! The timestamp of the deployment’s most recent full page load is updated, and the visualization page continues to phone home as long as the user’s webpage has focus.

  3. It’s been some time since we’ve heard from the viz.

    Is anyone still using this deployment? Every 15 minutes a process runs that asks this for every live deployment. If there hasn’t been a full page load or ping in over 30 minutes, the deployment is spun down. The frontend does this by doing a DELETE request to a REST API endpoint on the compute cluster.

These rules will need some tweaking. Bots caused problems early on because their constant pinging of the viz page meant that the deployment was never spun down. This was corrected to some extent by adding the viz pages to the robots.txt file. However, there are more measures that can be taken in the future to mitigate this.

The check for deleting deployments could also be tweaked by setting a different stale-after time limit for the most recent load time and most recent ping time.

Step 7: Security measures

Now that some of the Kubernetes resources are controlled by a REST API endpoint, I needed to make sure that not just anyone can manipulate the compute cluster. After some research, it seemed like there are two approaches that could be used to authenticate requests to the compute cluster. One is OAuth 2.0, and the other is to use JSON Web Tokens (JWT). I noticed that JupyterHub uses OAuth, but for our situation, adding OAuth seemed like it would add a great deal of complexity to the architecture of C/S since it required an external authentication service. In the end, the simplest way forward was to use a JWT approach where the frontend and compute cluster use a securely stored secret to encode and decode JWT’s. Then, when the cluster receives requests, it ensures that it can decode the JWT token in the request header with the shared JWT secret.

Step 8: Put it in production

Time to deploy! The capability to host Bokeh and Dash/Plotly apps has been live for about 10 days now. Deployments are being spun up and down quite smoothly. This isn’t to say that there haven’t been any bugs. However, the primary components of serving apps to users and spinning them up and down quickly have worked as expected.

Here are the numbers for the first Dash app published on Compute Studio: It’s been live for 10 days, has been deployed 54 times and has been running for about 25% of the time during this period. On average, each deployment is live for about an hour.

Wrapping up

Over the past few months, I have spent a most of my time improving Compute Studio’s infrastructure to be more scalable, more maintainable, and cheaper. Each of these projects has pushed me to learn a different part of the Kubernetes API. By the end of each project, I feel like I have gained at least twice as much Kubernetes and general infrastructure knowledge as I had at the beginning. It’s been an immensely rewarding (and at times frustrating) experience.

This project had two major pain points and I hope that others learn from them:

  1. Before trying Traefik, I tried to use cert-manager and could not get it to work. I spent two days on it, debugging countless errors and bugging my friend who does this stuff full-time. Finally, I swapped over to Traefik, and within a couple hours, I had it working. Further, the routing support wound up being extremely useful for routing requests to the right viz deployment.
  2. The Kubernetes Python client is great if you know what you are looking for. I needed to manage Traefik’s IngressRoute object which is a Custom Resource Definition. You are looking for the CustomObjectsApi, and here’s an example for how to use it with Traefik’s IngressRoute: ingressroute.py.

Thanks for reading! As always, I’d love to hear your feedback either about the visualizations or the engineering behind them. Please feel welcome to email me at hank at compute.studio, join our community chat, or open an issue at the compute-studio repo.