Jupyter and other interactive environments are the go-to tools for most data scientists. However, many production data pipeline platforms (e.g. Airflow, Kubernetes) drag them into non-interactive development paradigms. Hence, when moving to production, the data scientist’s code needs to move from the interactive environment to a more traditional software environment (e.g. declaring workflows as Python classes). This process creates friction since the code needs to cross this gap every time the data scientist deploys their work. Data scientists often pair with software engineers to work on the conversion, but this is time-consuming and costly. It’s also frustrating because it’s just busy work.
We encountered this problem while working in the data space. Eduardo was a data scientist at Fidelity for a few years. He deployed ML models and always found it annoying and wasteful to port the code from his notebooks into a production framework like Airflow or Kubernetes. Ido worked as a consultant at AWS and constantly found that data science projects would allocate about 30% of their time to convert a notebook prototype into a production pipeline.
Interactive environments have historically been used for prototyping and are considered unsuitable for production; this is reasonable because, in our experience, most of the code developed interactively exists in a single file with little to no structure (e.g., a gigantic notebook). However, we believe it’s possible to bring software engineering best practices and apply them to the interactive development world so data scientists can produce maintainable projects to streamline deployment.
Ploomber allows data scientists to quickly develop their code in modular pipelines rather than a giant single file. When developed this way, their code is suitable for deployment to production platforms; we currently support exporting to Kubernetes, AWS Batch, Airflow, Kubeflow, and SLURM with no code changes. Our integration with Jupyter/VSCode/PyCharm allows them to iteratively build these modular pipelines without moving away from the interactive environment. In addition, modularizing the work enables them to create more maintainable and testable projects. Our goal is ease of use, with minimal disturbance to the data scientist’s existing workflows.
Users can install Ploomber with pip, open Jupyter/VSCode/PyCharm, and start building in minutes. We’ve made a significant effort to create a simple tool so people can get started quickly and learn the advanced features when they need them. Ploomber is available at https://github.com/ploomber/ploomber under the Apache 2.0 license. In addition, we are working on a cloud version to help enterprises operationalize models. We’re still working on the pricing details, but if you’d like us to let you know when we open the private beta, you can sign up here: https://ploomber.io/cloud. However, the core of our offering is the open-source framework, and it will remain free.
We’re thrilled to share Ploomber with you! If you’re a data scientist who has experienced these endless cycles of porting your code for deployment, an ML engineer who helps data scientists deploy their work, or you have any feedback, please share your thoughts! We love chatting about this domain since exchanging ideas always sheds light on aspects we haven’t considered before! You may also reach out to me at [email protected].