From MacBook Pro to GPU Servers: A Developer’s Guide to Running LLMs and Planning Capacity

Introduction

If you’ve used OpenAI’s ChatGPT API, you know the convenience: you send a prompt, they handle the hardware. But what if you want to build your own retrieval-augmented generation (RAG) pipeline with open-source models? Suddenly, you need to think about model sizes, memory footprints, context lengths, and hardware costs.

This article is a complete primer for developers who want to move from SaaS APIs to self-hosted LLMs. We’ll start small—what you can run on a MacBook Pro with 36 GB unified memory and 14 CPU cores—and scale up to cloud GPU servers. Along the way, we’ll cover capacity planning, costs, and performance trade-offs.

What Your MacBook Pro Can Realistically Handle

Specs Recap

Unified Memory: 36 GB (shared CPU + GPU)
CPU Cores: 14
SSD: 1 TB

Model Sizes You Can Run

7B–13B models (Best Fit): Examples: LLaMA-2-7B, Mistral-7B, Gemma-7B, Phi-3-mini
Memory footprint: 4–8 GB in 4-bit quantization
Performance: 10–30 tokens/sec on Apple Silicon
Use cases: Chatbots, document summarization, POC RAG
30B models (Pushing Limits): Examples: LLaMA-2-30B
Memory footprint: ~20–25 GB in 4-bit
Performance: 1–5 tokens/sec
Use cases: Experimentation only (slow for interactive apps)
65B–120B models (Not Feasible):
Even quantized weights exceed 36 GB memory
Swap to SSD kills performance

How to Start a POC on MacBook

Follow this article Build Your Own Quiz Master AI: A Complete Beginner's Guide to Local RAG Pipelines

Understanding Capacity Planning

When running locally or in the cloud, capacity planning boils down to understanding three key components that drive memory and cost: the model weights, the KV cache, and user concurrency.

Weights Storage = parameters × precision

Each parameter in a model is a number. The precision (FP16, INT8, 4-bit) defines how many bytes each number takes. Multiply this by the total parameter count to get the raw weight storage.

Example: 7B parameters × 0.5 bytes (4-bit) ≈ 3.5 GB of weights.
A 30B model at 4-bit: 30B × 0.5 ≈ 15 GB of weights.
This is the baseline cost of just loading the model.

KV Cache Growth = context length × layers × hidden size × precision

For every input token, the model generates a pair of key and value vectors in each transformer layer for every attention head. These must be stored to allow efficient next-token prediction. The longer the context, the more tokens → the larger the KV cache.

For a 7B model, the hidden size and number of layers are smaller, so the cache overhead is modest (~1–2 GB extra for 4k context).
For a 30B model, with deeper layers and wider hidden states, the KV cache balloons (~5–8 GB extra for the same context).
The formula grows linearly with context length (double the context, double the cache). This is why long-context applications require much more GPU memory than the raw weight size suggests.

Batch Size and Concurrency Each concurrent request or batch multiplies the required KV cache. One user is trivial, but 100 users with long contexts means 100× the KV storage. This is why serving many users at once is far more expensive than a single local session.

Rule: Always add 20–30% headroom beyond weight storage to account for KV cache, activation buffers, and framework overhead. Without this, you’ll run out of memory under real workloads even if the weights themselves fit.

Planning for a Local RAG Pipeline

Indexer Memory Needs: Vector DBs like Chroma or FAISS run fine in 2–4 GB RAM for small datasets.
Model Inference: Choose a 7B model for responsive performance.
Context Planning: If your RAG requires long contexts (4k+ tokens), ensure you have headroom. On a 7B model, this is ~2 GB extra memory per request.

Example capacity plan on MacBook (36 GB RAM):

Model (Mistral-7B, 4-bit): ~5 GB
Cache for 4k context: ~2 GB per request
Vector DB + App overhead: ~5–8 GB
Total: ~12–15 GB used → safe within 36 GB

This means you can run 1–2 concurrent RAG queries comfortably on your Mac.

Moving to Cloud GPUs

At some point, you’ll outgrow your Mac. For more users or larger models:

Common Cloud Options

Single GPU: A100 40 GB or L4 (for 7B–13B models)
Dual/Quad GPUs: A100 80 GB × 2–4 (for 30B–70B models)
Specialized Inference: H100, RTX 6000 Ada for enterprise-grade serving

How Cloud GPUs Improve Performance

Throughput: Batch 10–100 requests simultaneously
Longer Contexts: Handle 8k–32k tokens
Scalability: Add GPUs for more users

Cost Implications

7B model on a MacBook: Free, but limited.
7B model on A100 GPU: ~$1–2/hour depending on provider.
30B–70B model on multi-GPU cluster: $10–20/hour.
120B model on 4×H100: $30–50/hour.

Conclusion

If you’re a developer starting with OpenAI SaaS APIs, moving to open-source LLMs is a natural step for control and cost savings. Your MacBook Pro with 36 GB RAM is a perfect sandbox for 7B–13B models and small RAG pipelines. It lets you validate workflows, measure context needs, and plan scaling.

When you’re ready for more users, longer contexts, or bigger models, cloud GPUs unlock performance at a cost. By understanding how model size, quantization, and cache overhead translate into memory and dollars, you can plan capacity wisely.

A Developer’s Guide to Running Local OpenSource LLMs and Planning Capacity