Poor man's LLM | CodeCave

A few months ago, I tried self-hosting an LLM on a low-end machine. While it is far from being on par with paid LLMs on the cloud, self-hosted LLMs are still useful in certain situations.

Why not use services like ChatGPT? For me, it is about the cost. In most cases, you do not have to worry about the price if you use AI assistants to aid with your daily tasks (especially with Copilot, you pay a fixed amount and use it as much as you like). Recently, I often find myself having to include a lot of data in prompts and send it to the LLM multiple times. This usually happens when you want to include a huge amount of custom knowledge in the context because the model was not trained on that data. Keep in mind that every time you input a new chat message to an ongoing conversation, the whole conversation gets submitted to the LLM (the web server could remember the conversation, but the LLM itself does not). Therefore, if you have a back-and-forth conversation with a big initial context, the conversation will cost a lot of tokens. It is not uncommon to have dozens of thousands of tokens in the initial context, making the total number of tokens used for a conversation reach 1 million tokens quickly if it is a long conversation. OpenAI typically charges $1 for 1 million tokens (depending on the model and some other factors). I know that it is super cheap already and it is a tiny amount of money for a corporation, but when it comes to personal use, it is not always the case ($1.5 can get you a really good meal where I live).

I decided to try hosting the LLaMA model on my potato PC. It is a machine with 8GB RAM and no GPU. Model: Meta-LLaMA-3.1-8B. The LLM spit out only 2 words per second, which is unacceptably slow for things like chatbots in commercial apps. However, it does serve the following purposes:

Experiments: Sometimes you programmatically call the LLM as part of a script that may use tokens more than expected and is not under your control. And because of the programmatic nature, you may use up all the tokens within a small amount of time.
Non-real-time tasks: When you don’t build things like chatbots. In one of my personal projects, I needed to generate text from a huge amount of input text. I wouldn’t mind if the task took a day or two to complete.

Self-hosted LLMs still have their place, particularly for experiments and non-real-time tasks where speed is not a concern. This is where my self-hosted LLM has been useful, despite its slow speed.

UPDATE (04/02/2025): DeepSeek has just debuted a model which is 10x cheaper than OpenAI’s, addressing the main reason why I went self-hosted. I am considering giving it a shot.