We’re JY & Julien, co-founders of Data Mechanics (https://www.datamechanics.co), a big data platform striving to offer the simplest way to run Apache Spark.
Apache Spark is an open-source distributed computing engine. It’s the most used technology in big data. First, because it’s fast (10-100x faster than Hadoop MapReduce). Second, because it offers simple, high-level APIs in Scala, Python, SQL, and R. In a few lines of code, data scientists and engineers can explore data, train machine learning models, and build batch or streaming pipelines over very large datasets (size ranging from 10GBs to PBs).
While writing Spark applications is pretty easy, managing their infrastructure, deploying them and keeping them performant and stable in production over time is hard. You need to learn how Apache Spark works under the hood, become an expert with YARN and the JVM, manually choose dozens of infrastructure parameters and Spark configurations, and go through painfully slow iteration cycles to develop, debug, and productionize your app.
As you can tell, before starting Data Mechanics, we were frustrated Spark developers. Julien was a data scientist and data engineer at BlaBlaCar and ContentSquare. JY was the Spark infrastructure team lead at Databricks, the data science platform founded by the creators of Spark. We’ve designed Data Mechanics so that our peer data scientists and engineers can focus on their core mission - building models and pipelines - while the platform handles the mechanical DevOps work.
To realize this goal, we needed a way to tune infrastructure parameters and Spark configurations automatically. There are dozens of such parameters but the most critical ones are the amount of memory and cpu allocated to each node, the degree of parallelism of Spark, and the way Spark handles all-to-all data transfer stages (called shuffles). It takes a lot of expertise and trial-and-error loops to manually tune those parameters. To do it automatically, we first run the logs and metadata produced by Spark through a set of heuristics that determines if the application is stable and performant. A Bayesian optimization algorithm uses this analysis as well as data from past recent runs to choose a set of parameters to use on the next run. It’s not perfect - it needs a few iterations like an engineer would. But the impact is huge because this happens automatically for each application running on the platform (which would be too time-consuming for an engineer). Take the example of an application gradually going unstable as the input data grows over time. Without us, the application crashes on a random day, and an engineer must spend a day remediating the impact of the outage and debugging the app. Our platform can often anticipate and avoid the outage altogether.
The other way we differentiate is by integrating with the popular tools from the data stack. Enterprise data science platforms tend to require their users to abandon their tools to adopt their own end-to-end suite of proprietary solutions: their hosted notebooks, their scheduler, their way of packaging dependencies and version-controlling your code. Instead, our users can connect their Jupyter notebook, their Airflow scheduler, and their favourite IDE directly to the platform. This enables a seamless transition from local development to running at scale on the platform.
We also deploy Spark directly on Kubernetes, which wasn’t possible until recently (Spark version 2.3) - most Spark platforms run on YARN instead. This means our users can package their code dependencies on a Docker image and use a lot of k8s-compatible projects for free (for example around secrets management and monitoring). Kubernetes does have its inherent complexity. We hide it from our users by deploying Data Mechanics in their cloud account on a Kubernetes cluster that we manage for them. Our users can simply interact with our web UI and our API/CLI - they don’t need to poke around Kubernetes unless they really want to.
The platform is available on AWS, GCP, and Azure. Many of our customers use us for their ETL pipelines, they appreciate the ease of use of the platform and the performance boost from automated tuning. We’ve also helped companies start their first Spark project: a startup is using us to parallelize chemistry computations and accelerate the discovery of drugs. This is our ultimate goal - to make distributed data processing accessible to all.
Of course, we share this mission with many companies out there, but we hope you’ll find our angle interesting! We’re excited to share our story with the HN community today and we look forward to hearing about your experience in the data engineering and data science spaces. Have you used Spark and did you feel the frustrations we talked about? If you consider Spark for your next project, does our platform look appealing? We don’t offer self-service deployment yet, but you can schedule a demo with us from the website and we’ll be happy to give you a free trial access in exchange for your feedback.