Playing With Local LLMs

Back to the Basics

I have been spending a lot of time messing with local LLMs and getting a feel for the architectural requirements for building LLM services. I can’t say I have done anything groundbreaking through this process, but it has been eye-opening to see how much there is to learn in this space. Right away, I was presented with dozens of terms I had no familiarity with and no basis for understanding. It felt similar to the early days of my career, wading through a bog of acronyms with nothing but a candle and a dream that one day I’d be able to see clearly enough to be useful. Since then, I feel like I have at least upgraded to a torch.

As with anything else, I took it one step at a time and committed to developing at least a baseline understanding of the keywords before diving in and trying to set anything up. I’ll give you the abridged version:

LLM - A language model trained on billions of parameters
Transformer - Neural network architecture used in most LLMs
Token - Numerical representations of a bit of text
Tokenization - The process of converting text to tokens
Quantization - Reduction in model weight precision to reduce compute costs
Model Weights - Determine the level of importance of an input token There is a lot more to learn/understand if you want to take the dive on how LLMs actually work, but that’s at least enough exposition to make hosting one a little less daunting.

Choosing A Model

If a magic “do the math for me” button exists, I haven’t found it. An LLM of your choice can generally guide you through figuring out what your hardware will tolerate, but can you really trust it to give you accurate numbers? I had mixed results that eventually landed me in the ballpark.

Model performance is largely dependent upon the number of parameters. A model with 14 billion parameters will outperform a model with 3 billion parameters. A flagship model is going to have a parameter count in the trillions. Quantization will reduce the precision of the model, but does cut down on the memory requirements quite a bit. It is recommended to use the largest model (parameters) you can load with 4-bit precision quantization. Now, not all 4-bit quantization is created equal. I chose to go with a model quantized with four bit AWQ. This setup is using 15GB of VRAM. I was able to set up a slightly smaller model with GPTQ Int4 quantization on a GPU with 12GB of VRAM. The context windows of these models are 40k and 32k tokens respectively.

Both models support longer context windows; however, I haven’t taken the dive on what it would take to set that up just yet. I would have to leverage RAM in addition to the available VRAM, and that would slow things down significantly.

LLM Server PoC || GTFO

I decided to serve the models with vLLM because it is the most performant option. Their documentation is fairly straightforward and works as long as you have selected a model that will run on your hardware.

vLLM - Quickstart

# Install UV (Python Package Manager)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create Virtual Environment & Install Dependencies
uv venv --python 3.13 --seed
source .venv/bin/activate # I alias this to 'activate'
uv pip install vllm --torch-backend=auto

# Start The LLM Server
uv run vllm serve Qwen/Qwen3-8B-AWQ --gpu-memory-utilization 0.9 --max-model-len 40960 --port 8084 --enable-auto-tool-choice --tool-call-parser hermes

Clients

I’m not sure how much I trust the rankings here, but OpenRouter has a pretty decent overview of what apps are being used to interact with LLMs. For my use case, Kilo Code and OpenClaw ended up being the best options. These things are relatively new to me, so I haven’t fully tuned them to my liking. I will most likely make a follow-up post with agent configurations and workflows. For now, I’m mostly using these things to debug and automate web searches. Nothing mind-blowing, it’s just another force multiplier.

I tried some other clients with mixed results. I found that the only tool call parser that worked pretty consistently was ‘Hermes’, but YMMV. MCP servers were pretty easy to set up. The Burp MCP is just an extension. Most of these things are Python scripts that run alongside whatever tool you’re hosting them with.

The only security advice I can give with OpenClaw is, “don’t be an idiot”. If it has read/write to sensitive information, you’re probably an idiot. If you can access it over the open internet, you’re probably an idiot. If you’re even thinking about giving it access to your bank, you’re definitely an idiot. Everything is fine as long as nothing goes wrong shouldn’t be your threat model.

Where Do We Go From Here?

I like what I’m seeing out of this tech, but it’s the Wild West out here. No authentication, access control, etc. I won’t name the vendor, but there are solutions out there allowing admins to arbitrarily read any/all messages between their users and LLMs without any kind of RBAC, logging, or monitoring in place. Now, IT admins have had similar access for a while, but there’s usually some kind of barrier beyond a platform login before you open that can of worms.

The only piece of advice I can give is to have fun, but be careful. This tech is currently in the stage where it will gladly let you delete System32 or, even worse, it’ll do it for you.