‘Hot-swapping’ lets you serve different models on the same GPU machine with only ~2 second swap times (~150x faster than baseline). You can see this in action through a live demo where you try the same prompts on different open source large language models at https://hotswap.outerport.com and see the docs here https://docs.outerport.com.
Running AI models on the cloud is expensive. Outerport came from our own experience working on AI services ourselves and struggling with the cost.
Cloud GPUs are charged by the amount of time used. A long start-up time (from loading models into GPU memory) means that to serve requests quickly, we need to acquire extra GPUs with models pre-loaded for spare capacity (i.e. ‘overprovision’). The time spent on loading models also adds to the cost. Both lead to inefficient use of expensive hardware.
The long start-up times are caused by how massive modern AI models are, particularly large language models. These models are often several gigabytes to terabytes in size. Their sizes continue to grow as models evolve, exacerbating the issue.
GPU capacity also needs to adapt dynamically according to demand, further complicating the issue. Starting up a new machine with another GPU is time consuming, and sending a large model there is also time consuming.
Traditional container-based solutions and orchestration systems (like Docker, Kubernetes) are not optimized for these large, storage-intensive AI models, as they are designed for smaller, more numerous containerized applications (which are usually 50MB to 1GB in size). There needs to be a solution that is designed specifically for model weights (floating point arrays) running on GPUs, to take advantage of things like layer sharing, caching and compression.
We made Outerport, a specialized system to manage and deploy AI models, as a solution to these problems and to help save GPU costs.
Outerport is a caching system for model weights, allowing read-only models to be cached in pinned RAM for fast loading into GPU. Outerport is also hierarchical, maintaining a cache across S3 to local SSD to RAM to GPU memory, optimizing for reduced data transfer costs and load balancing.
Within Outerport, models are managed by a dedicated daemon process which handles transfer to GPU, loading models from registry, and orchestrates the ‘hot swapping’ of multiple models on one machine.
‘Hot-swapping’ lets you provision a single GPU machine to be ‘multi-tenant’, such that multiple services with different models can run on the same machine. For example, this can facilitate A/B testing of two different models or having a text generation & image generation endpoint on the same machine.
We have been busy running simulations to determine the cost reductions we can get from leveraging this multi-model service scheme instead of multiple single-model services. Our initial simulation results show that we can achieve a 40% reduction in GPU running time costs. This improvement can be attributed to the multi-model service’s ability to smoothen out peaks of traffic, enabling more effective horizontal scaling. Overall, less time is wasted on acquiring additional machines and model loading, significantly saving costs. Our hypothesis is that the cost savings are substantial enough to make a viable business while still saving customers significant amounts of money.
We think that there are lots of exciting directions to take from here—from more sophisticated compression algorithms to providing a central platform for model management and governance. Towaki worked on ML systems and model compression at NVIDIA, and Allen used to do research in operations research, which is also why we’re so excited about this problem as something that combines both.
We’re super excited to share Outerport with you all. We’re also intending to release as much as possible of this in an open core model when we’re ready. We would love to know what you think—and experiences you have in working on this, related problems, or any other ideas you might have on this problem!