Show HN: Run any Llama model finetune and more, instantly

7 points by pico_creator a year ago

Hi there, looking for feedback on my new project "Featherless.AI"

The idea is to allow users to run all the models on hugging face instantly. Via the OpenAI API compatible endpoint.

Why? Because its a real chore to download models and spin up GPUs, especially if you want to test multiple models. Not to mention GPUs cost multiple dollars an hour to rent. And if we want more people to use open source AI, we got to make it easier for them to try and play with all of them.

So what if instead of spinning up dedicated GPUs per model (which is what every provider is doing)

We can startup a LLM server, which multiple users can share, and instantly hot swap it in seconds the model its serving??? Make sense right, thats what we do for apps, or shared CPU PHP apps.

---

Turns out thats an incredibly complex problem, that even Hugging Face, together AI and various companies had failed to successfully build. Turns out swapping out GB's of data from disk to GPU in subseconds is hard. And involves optimizing every small thing in the chain

- had to setup and tune a high speed storage cluster

- find a GPU provider with crazy networking speed, and would allow you to scale up/down

- custom write a new pipeline for streaming data from high speed storage cluster

- custom write code to stream that data, straight into GPU as fast as possible

After all that, you now have a GPU server, that can serve one model at a time, but can switch models quickly. So to make this idea economical (since im not billing anyone $8/hour)

- you will need to setup a cluster, to handle multiple request concurrently

- while writing routing code, to ensure all the request for the same models go to the same server

- unless it hits the limit, then you need to start load balancing between servers assigned to that model

- and to backoff the load balancing, so that you can free up servers

- to be hot swapped for other request

Oh also downloading 450+ models, apparently takes up tons of TB's, of expensive high speed clustered storage.

But the end result, a highly dynamic scaling (up or down) infrastructure, to the exact cluster workload, for a large collection of HF models (more models then GPUs of course).

All so that people can use "all the open source AI models", for a few dollars a month. While not worrying about token pricing.

So do give it a try, the free trial account can test any 8B model, and a subscribe account has full OpenAI API access. And give us some feedback, and maybe even a product hunt vote!

PS: This is a work in progress, I yet to rewrite the custom pipeline code for non-llama / non-rwkv models, thats why we are starting with only those 2 architectures first. But we do plan to scale to ALL public models on huggingface.

Also downloading the other 1000+ llama models does take a long time.

jharohit a year ago

Love it man! UX looks good. Why only Llama? How about other new models or architecture?

pico_creator a year ago

Thanks!
We had to rewrite lots of code to optimize for the hot-swap load speeds. So we prioritized llama as it was the most popular group on Hugging Face. And RWKV (which is what we work on in open source space)
But other architectures are coming within a week or two. Up next is probably Mixtral MoE models.
We will keep adding until we add ALL the architectures and models =)

m8than a year ago

[dead]