I started my career searching for Supersymmetry and the Higgs boson on the Large Hadron Collider at CERN, then moved to industry. I spent the last four years building ML infrastructure at Cruise. In both academia and industry, I witnessed researchers, data scientists, and ML engineers spending an absurd share of their time building makeshift tooling, stitching up infrastructure, and battling obscure systems, instead of focusing on their core area of expertise: extracting insights and predictions from data.
This was painfully apparent at Cruise where the ML Platform team needed to grow linearly with the number of users to support and models to ship to the car. What should have just taken a click (e.g. retraining a model when world conditions change – COVID parklets, road construction sites, deployment to new cities) often required weeks of painstaking work. Existing tools for prototyping and productionizing ML/DS models did not enable developers to become autonomous and tackle new projects instead of babysitting current ones.
For example, a widely adopted tool such as Kubeflow Pipelines requires users to learn an obscure Python API, package and deploy their code and dependencies by hand, and does not offer exhaustive tracking and visualization of artifacts beyond simple metadata.
In order to become autonomous, users needed a dead-simple way to iterate seamlessly between local and cloud environments (change code, validate locally, run at scale in the cloud, repeat) and visualize objects (metrics, plots, datasets, configs) in a UI. Strong guarantees around dependency packaging, traceability of artifact lineage, and reproducibility would have to be provided out-of-the-box.
Sematic lets ML/DS developers build and run pipelines of arbitrary complexity with nothing more than minimalistic Python APIs. Business logic, dynamic pipeline graphs, configurations, resource requirements, etc. — all with only Python. We are bringing the lovable aspects of Jupyter Notebooks (iterative development, visualizations) to the actual pipeline.
How it works: Sematic resolves dynamic nested graphs of pipeline steps (simple Python functions) and intercepts all inputs and outputs of each step to type-check, serialize, version, and track them. Individual steps are orchestrated as Kubernetes jobs according to required resources (e.g. GPU, high-memory), and all tracking and visualization information is surfaced in a modern UI. Build assets (user code, third-party dependencies, drivers, static libraries) are packaged and shipped to remote workers at runtime, which enables a fast and seamless iterative development experience.
Sematic lets you achieve results much faster by not wasting time on packaging dependencies, foraging for output artifacts to visualize, investigating obscure failures in black-box container jobs, bookkeeping configurations, writing complex YAML templates to run multiple experiments, etc.
It can run on a local machine or be deployed to leverage cloud resources (e.g. GPUs, high-memory instances, map/reduce clusters, etc.) with minimal external dependencies: Python, PostgreSQL, and Kubernetes.
Sematic is open-source and free to use locally or self-hosted in your own cloud. We will provide a SaaS offering to enable access to cloud resources without the hassle of maintaining a cloud deployment. To get started, simply run `$ pip install sematic; sematic start`. Check us out at https://sematic.dev, star our Github repo, and join our Discord for updates, feature requests, and bug reports.
We would love to hear from everyone about your experience building reliable end-to-end ML training pipelines, and anything else you’d like to share in the comments!