A Developer’s Guide to Running Local OpenSource LLMs and Planning Capacity
A complete primer for developers moving from SaaS APIs like OpenAI to running open-source LLMs locally and in the cloud. Learn what models your MacBook can handle, how to size for RAG pipelines, and how GPU servers change the economics.

From MacBook Pro to GPU Servers: A Developer’s Guide to Running LLMs and Planning Capacity
Introduction
If you’ve used OpenAI’s ChatGPT API, you know the convenience: you send a prompt, they handle the hardware. But what if you want to build your own retrieval-augmented generation (RAG) pipeline with open-source models? Suddenly, you need to think about model sizes, memory footprints, context lengths, and hardware costs.
This article is a complete primer for developers who want to move from SaaS APIs to self-hosted LLMs. We’ll start small—what you can run on a MacBook Pro with 36 GB unified memory and 14 CPU cores—and scale up to cloud GPU servers. Along the way, we’ll cover capacity planning, costs, and performance trade-offs.
What Your MacBook Pro Can Realistically Handle
Specs Recap
- Unified Memory: 36 GB (shared CPU + GPU)
- CPU Cores: 14
- SSD: 1 TB
Model Sizes You Can Run
-
7B–13B models (Best Fit): Examples: LLaMA-2-7B, Mistral-7B, Gemma-7B, Phi-3-mini
-
Memory footprint: 4–8 GB in 4-bit quantization
-
Performance: 10–30 tokens/sec on Apple Silicon
-
Use cases: Chatbots, document summarization, POC RAG
-
30B models (Pushing Limits): Examples: LLaMA-2-30B
-
Memory footprint: ~20–25 GB in 4-bit
-
Performance: 1–5 tokens/sec
-
Use cases: Experimentation only (slow for interactive apps)
-
65B–120B models (Not Feasible):
-
Even quantized weights exceed 36 GB memory
-
Swap to SSD kills performance
How to Start a POC on MacBook
Follow this article Build Your Own Quiz Master AI: A Complete Beginner's Guide to Local RAG Pipelines
Understanding Capacity Planning
When running locally or in the cloud, capacity planning boils down to understanding three key components that drive memory and cost: the model weights, the KV cache, and user concurrency.
-
Weights Storage = parameters × precision
Each parameter in a model is a number. The precision (FP16, INT8, 4-bit) defines how many bytes each number takes. Multiply this by the total parameter count to get the raw weight storage.
- Example: 7B parameters × 0.5 bytes (4-bit) ≈ 3.5 GB of weights.
- A 30B model at 4-bit: 30B × 0.5 ≈ 15 GB of weights.
- This is the baseline cost of just loading the model.
-
KV Cache Growth = context length × layers × hidden size × precision
For every input token, the model generates a pair of key and value vectors in each transformer layer for every attention head. These must be stored to allow efficient next-token prediction. The longer the context, the more tokens → the larger the KV cache.
- For a 7B model, the hidden size and number of layers are smaller, so the cache overhead is modest (~1–2 GB extra for 4k context).
- For a 30B model, with deeper layers and wider hidden states, the KV cache balloons (~5–8 GB extra for the same context).
- The formula grows linearly with context length (double the context, double the cache). This is why long-context applications require much more GPU memory than the raw weight size suggests.
- Batch Size and Concurrency Each concurrent request or batch multiplies the required KV cache. One user is trivial, but 100 users with long contexts means 100× the KV storage. This is why serving many users at once is far more expensive than a single local session.
Rule: Always add 20–30% headroom beyond weight storage to account for KV cache, activation buffers, and framework overhead. Without this, you’ll run out of memory under real workloads even if the weights themselves fit.
Planning for a Local RAG Pipeline
- Indexer Memory Needs: Vector DBs like Chroma or FAISS run fine in 2–4 GB RAM for small datasets.
- Model Inference: Choose a 7B model for responsive performance.
- Context Planning: If your RAG requires long contexts (4k+ tokens), ensure you have headroom. On a 7B model, this is ~2 GB extra memory per request.
Example capacity plan on MacBook (36 GB RAM):
- Model (Mistral-7B, 4-bit): ~5 GB
- Cache for 4k context: ~2 GB per request
- Vector DB + App overhead: ~5–8 GB
- Total: ~12–15 GB used → safe within 36 GB
This means you can run 1–2 concurrent RAG queries comfortably on your Mac.
Moving to Cloud GPUs
At some point, you’ll outgrow your Mac. For more users or larger models:
Common Cloud Options
- Single GPU: A100 40 GB or L4 (for 7B–13B models)
- Dual/Quad GPUs: A100 80 GB × 2–4 (for 30B–70B models)
- Specialized Inference: H100, RTX 6000 Ada for enterprise-grade serving
How Cloud GPUs Improve Performance
- Throughput: Batch 10–100 requests simultaneously
- Longer Contexts: Handle 8k–32k tokens
- Scalability: Add GPUs for more users
Cost Implications
- 7B model on a MacBook: Free, but limited.
- 7B model on A100 GPU: ~$1–2/hour depending on provider.
- 30B–70B model on multi-GPU cluster: $10–20/hour.
- 120B model on 4×H100: $30–50/hour.
Conclusion
If you’re a developer starting with OpenAI SaaS APIs, moving to open-source LLMs is a natural step for control and cost savings. Your MacBook Pro with 36 GB RAM is a perfect sandbox for 7B–13B models and small RAG pipelines. It lets you validate workflows, measure context needs, and plan scaling.
When you’re ready for more users, longer contexts, or bigger models, cloud GPUs unlock performance at a cost. By understanding how model size, quantization, and cache overhead translate into memory and dollars, you can plan capacity wisely.