Deploying Open-Source LLMs As APIs

Open-source LLMs are all the rage, along with concerns about data privacy with closed-source LLM APIs. This tutorial goes through how to deploy your own open-source LLM API Using Hugging Face + AWS

Skanda Vivek
7 min readJul 9


Super intelligent AI llama prompted by author, generated using Leonardo.AI

While ChatGPT and GPT-4 have taken the world of AI by storm in the last half year, open-source models are catching up — slowly but surely. And there has been a lot of ground to cover, to reach OpenAI model performance. In many cases, ChatGPT and GPT-4 are clear winners due to their quality and competitive pricing.

But, open-source models will always have value over closed APIs like ChatGPT/GPT-4 for certain business cases. I have spoken with folks in industries like legal, healthcare, and finance — who have concerns over data and customer privacy. These companies would rather spend thousands of dollars a month (or more) to run open-source models on their own cloud instances (think AWS, Google Cloud, Azure) rather than send data through OpenAI APIs that are used by everyone. These folks understand that right now, open-source LLMs might not perform as well as ChatGPT/GPT-4, and may end up being 10X more expensive due to the costs involved in training, deploying, and hosting models with tens or hundreds of billions of parameters. But they are either willing to test open-source models out for their use cases or wait a few more months till open-source models catch up to closed-source models, and don’t mind being set back a few months.

If you are concerned by potential data sharing issues with closed-source APIs, or just want to understand how to use/make open-source LLMs available to users, this article is for you.

Let’s dive in!

Deploying Hugging Face LLMs On AWS Sagemaker

Over the past few months, there has been a boom in open-source LLMs — a number of them achieving near-ChatGPT quality. For this example, I’m going to walk through deploying a relatively small LLM model, at 7 Billion parameters — Dolly, released by Databricks. According to Databricks, Dolly was the world’s first Open Instruction-Tuned LLM.