Flower (YC W23) – Train AI models on distributed or sensitive data

Read Post

Hey HN - we're Daniel, Taner, and Nic, and we're building Flower (https://flower.dev/), an open-source framework for training AI on distributed data. We move the model to the data instead of moving the data to the model. This enables regulatory compliance (e.g. HIPAA) and ML use cases that are otherwise impossible. Our GitHub is at https://github.com/adap/flower, and we have a tutorial here: https://flower.dev/docs/tutorial/Flower-0-What-is-FL.html.

Flower lets you train ML models on data that is distributed across many user devices or “silos” (separate data sources) without having to move the data. This approach is called federated learning.

A silo can be anything from a single user device to the data of an entire organization. For example, your smartphone keyboard suggestions and auto-corrections can be driven by a personalized ML model learned from your own private keyboard data, as well as data from other smartphone users, without the data being transferred from anyone’s device.

Most of the famous AI breakthroughs—from ChatGPT and Google Translate to DALL·E and Stable Diffusion—were trained with public data from the web. When the data is all public, you can collect it in a central place for training. This “move the data to the computation” approach fails when the data is sensitive or distributed across organizational silos and user devices.

Many important use cases are affected by this limitation:

* Generative AI: Many scenarios require sensitive data that users or organizations are reluctant to upload to the cloud. For example, users might want to put themselves and friends into AI-generated images, but they don't want to upload and share all their photos.

* Healthcare: We could potentially train cancer detection models better than any doctor, but no single organization has enough data.

* Finance: Preventing financial fraud is hard because individual banks are subject to data regulations, and in isolation, they don't have enough fraud cases to train good models.

* Automotive: Autonomous driving would be awesome, but individual car makers struggle to gather the data to cover the long tail of possible edge cases.

* Personal computing: Users don't want certain kinds of data to be stored in the cloud, hence the recent success of privacy-enhancing alternatives like the Signal messenger or the Brave browser. Federated methods open the door to using sensitive data from personal devices while maintaining user privacy.

* Foundation models: These get better with more data, and more diverse data, to train them on. But again, most data is sensitive and thus can't be incorporated, even though these models continue to grow bigger and need more information.

Each of us has worked on ML projects in various settings, (e.g., corporate environments, open-source projects, research labs). We’ve worked on AI use cases for companies like Samsung, Microsoft, Porsche, and Mercedes-Benz. One of our biggest challenges was getting the data to train AI while being compliant with regulations or company policies. Sometimes this was due to legal or organizational restrictions; other times, it was difficulties in physically moving large quantities of data or natural concerns over user privacy. We realized issues of this kind were making it too difficult for many ML projects to get off the ground, especially in domains like healthcare and finance.

Federated learning offers an alternative — it doesn't require moving data in order to train models on it, and so has the potential to overcome many barriers for ML projects.

In early 2020, we began developing the open-source Flower framework to simplify federated learning and make it user-friendly. Last year, we experienced a surge in Flower's adoption among industry users, which led us to apply to YC. In the past, we funded our work through consulting projects, but looking ahead, we’re going to offer a managed version for enterprises and charge per deployment or federation. At the same time, we’ll continue to run Flower as an open-source project that everyone can continue to use and contribute to.

Federated learning can train AI models on distributed and sensitive data by moving the training to the data. The learning process collects whatever it can, and the data stays where it is. Because the data never moves, we can train AI on sensitive data spread across organizational silos or user devices to improve models with data that could never be leveraged until now.

Here’s how it works: (0) Initialize the global model parameters on the server; (1) Send the model parameters to a number of organizations/devices (client nodes); (2) Train model locally on the data of each organization/device (client node); (3) Return the updated model parameters back to the server; (4) On the server, aggregate the model updates (e.g., by averaging them) into a new global model; (5): Repeat steps 1 to 4 until the model converges.

This, of course, is more challenging than centralized learning: we must move AI models to data silos or user devices, train locally, send updated models back, aggregate them, and repeat. Flower provides the open-source infrastructure to easily do this, as well as supporting other privacy-enhancing technologies (PETs). It is compatible with PyTorch, TensorFlow, JAX, Hugging Face, Fastai, Weights & Biases and all the other tools used in ML projects regularly. The only dependency on the server side is NumPy, but even that can be dropped if necessary. Flower uses gRPC under the hood, so a basic client can easily be auto-generated, even for most languages that are not supported today.

Flower is open-source (Apache 2.0 license) and can be run in all kinds of environments: on a personal workstation for development and simulation, on Google Colab, on a compute cluster for large-scale simulations or on a cluster of Raspberry Pi’s (or similar devices) to build research systems, or deployed on public cloud instances (AWS, Azure, GCP, others) or private on-prem hardware. We are happy to help users when deploying Flower systems and will soon make this even easier through our managed cloud service.

You can find PyTorch example code here: https://flower.dev#examples, and more at https://github.com/adap/flower/tree/main/examples.

We believe that AI technology must evolve to be more collaborative, open and distributed than it is today (https://flower.dev/blog/2023-03-08-flower-labs/). We’re eager to hear your feedback, experiences regarding difficulties in training, data access, data regulation, privacy and anything else related to federated (or related) learning methods!

Flower (YC W23) – Train AI models on distributed or sensitive data

Get Top 5 Posts of the Week