Sarus (YC W22) – Work on sensitive data with differential privacy

Read Post

Hi HN! Maxime, Nicolas, and Vincent here, founders of Sarus (https://www.sarus.tech). Sarus is a privacy engineering software that lets data scientists work on data without the need to access it. It works like a proxy between the practitioner and the data. All queries and data processing jobs are executed on the original data with the privacy guarantees of differential privacy.

When data is sensitive, getting access can be a huge pain. It means going through a long manual validation process that includes designing, and implementing an appropriate data anonymization. It takes weeks to months and some data utility may be lost to the masking requirements.

Sarus makes all of it irrelevant by letting analysts work on data that is never accessed. Analysts only access outputs of their data jobs, and those can be protected with appropriate privacy measures.

With past lives in healthtech, finance, and marketing, we’ve experienced first-hand that data governance has taken a huge part in data operations. It’s a rightful objective to protect data but it should not have to hamstring all innovation. For most data science or analytics objectives, the analyst has no interest in the information of a given individual. They look for patterns that are valid across the dataset. Access to user-level information is just an unfortunate way to get there.

We decided to build Sarus so that data access is no longer a requirement.

The Sarus API proxies all queries, compiles them into a privacy-safe version, runs them on the original data (which never moves outside of our clients’ infrastructure) and outputs the protected results to the practitioner. The protection relies on differential privacy, a mathematical definition of privacy already used by leading tech companies. Differential privacy works by adding calibrated randomness to outputs so that the information of any given individual cannot be inferred. One of its main benefits is that it does not make any assumption on what is sensitive in the data or what the recipient of the output may already know or do. This is the ideal candidate for replacing all manual data governance processes by something fully automated. Each query gets rewritten by Sarus in a way that implements its core principles.

For the core primitives of differential privacy, we leverage the latest research (Dwork & Roth 2014, Abadi 2016, Dong 2019, Koskela 2020 or Wilson 2019) and open source implementations (tensorflow-privacy, Google Differential Privacy, OpenDP, Smartnoise). Our key contribution is to bundle everything into an API that can be queried without seeing the data in the first place. It requires proper privacy accounting (we use PLD accounting as in Koskela 2020) but also setting all the technical parameters that are required by the framework (estimating range of input data, allocating privacy budget across computation steps…). We also optimize the privacy utility trade-off by memoizing previous queries as much as possible.

Wait, but the first thing data scientists do is to check out the data, how do I do that now? Not a problem, the API provides synthetic data samples with the same schema and statistical distribution by default. It effectively replaces the need to see any record, and data scientists can still do feature engineering, test and debug code with it. Of course, synthetic data is not something you would want to build insights or ML models on, you’d use the API to do that on the original data.

How it works: the app is deployed in the cloud infrastructure (any cloud vendor is compatible). The data admin lists relevant data sources from the UI or the API, and grants learning access to practitioners by applying a privacy policy among predefined templates. The synthetic data sample is automatically generated. From there, data scientists can run their analyses with their usual tools (pandas, numpy, TF, scikit-learn, Metabase, Redash, Tableau…), whether from a python SDK or a hiveSQL connector.

Curious? We have released a self-serve demo for you to try it out. It lets you make a dataset available from the Sarus proxy, set up access policies and then, as a data practitioner, use it for analytics and machine learning. It is limited to a handful of datasets but should give you a good understanding of Sarus. You can sign up at https://demo.sarus.tech/signup and begin using Sarus for free, no credit card required (tutorial on https://www.sarus.tech/post/we-just-released-an-open-demo-tr...).

Our model is a software license to run on our clients’ cloud. Our pricing is on a per-dataset per-month basis and starts at $600/month.

Please let us know what you think! We look forward to hearing your questions, feedback, ideas, and experience!

Sarus (YC W22) – Work on sensitive data with differential privacy

Get Top 5 Posts of the Week