Running DeepSeek models locally in 2026 offers cost savings and data privacy, but GPU VRAM is the single constraint that determines whether a model runs, crawls, or crashes outright. This guide provides a concrete sizing table mapping every current DeepSeek variant to specific VRAM thresholds, a quantization decision tree, a working React-based VRAM calculator, and a pre-flight checklist for local deployment.
Table of Contents
Why VRAM Is the Bottleneck That Decides Everything
Running DeepSeek models locally in 2026 offers cost savings and data privacy, but GPU VRAM is the single constraint that determines whether a model runs, crawls, or crashes outright. The DeepSeek model lineup now spans from 1.5 billion parameters up to 671 billion, and picking the wrong size for available hardware means either out-of-memory errors at inference time or an expensive GPU sitting idle with headroom to spare. The VRAM sizing decision should come before everything else: before downloading weights, before choosing a runtime, and before writing any application code.
Model parameter count, combined with precision format and context length, dictates the minimum VRAM requirement. Everything else is secondary.
This guide provides a concrete sizing table mapping every current DeepSeek variant to specific VRAM thresholds, a quantization decision tree, a working React-based VRAM calculator that readers can fork and embed, and a pre-flight checklist for local deployment.
DeepSeek’s 2026 Model Lineup at a Glance
Model Families and Their Use Cases
DeepSeek’s current model families serve distinct workloads. DeepSeek-R1 targets reasoning-heavy tasks such as multi-step math, logic chains, and structured problem decomposition. DeepSeek-V3 and its successors handle general-purpose chat, instruction following, and code assistance. DeepSeek-Coder-V2 is purpose-built for code generation, refactoring, and repository-scale understanding.
Each family ships distilled variants at 1.5B, 7B, 8B, 14B, 32B, and 70B parameter counts. These distilled models are dense transformers extracted from the larger models through knowledge distillation, trading some capability for far lower hardware requirements — the 7B distill, for example, drops VRAM from ~1,342 GB at FP16 (full 671B) to ~14 GB. At the top of the lineup sits the full Mixture-of-Experts (MoE) variant at 671B total parameters.
Parameter Count ≠ Active Parameters: Understanding MoE
The 671B MoE model does not activate all 671 billion parameters for every token. Its Mixture-of-Experts architecture routes each token through approximately 37 billion active parameters per forward pass (per DeepSeek’s published MoE architecture; see the DeepSeek-V3 technical report, arXiv 2412.19437), selected from a much larger pool of expert sub-networks. During inference, computational cost resembles a dense 37B model. However, the full set of weights must still reside in memory (or be swappable from disk at steep performance cost), because any expert could be activated on the next token. This means VRAM allocation must account for the entire 671B weight payload, not just the 37B active slice.
The VRAM-First Sizing Rule
Baseline VRAM Requirements (FP16 / BF16)
At FP16 or BF16 precision, each parameter occupies 2 bytes. Multiplying parameter count by 2 gives the raw weight footprint. Quantization compresses this substantially, and most local deployments use quantized weights.
| Model Size | Min VRAM (FP16) | Approx. VRAM (Quantized) | Recommended VRAM |
|---|---|---|---|
| 1.5B | ~3 GB | ~1.5 GB (Q4) | 4 GB |
| 7B | ~14 GB | ~4.5 GB (Q4) | 6-8 GB |
| 14B | ~28 GB | ~10 GB (Q4) | 12 GB |
| 32B | ~64 GB | ~20 GB (Q4) | 24 GB |
| 70B | ~140 GB | ~40 GB (Q4) | 48 GB |
| 671B (MoE) | ~1,342 GB | ~400 GB+ (Q4) | Multi-GPU / cloud |
These quantized estimates assume 4-bit quantization and exclude KV-cache. At 2K/8K/32K context, add approximately 0.5 GB / 2 GB / 8 GB respectively (model-dependent). For the 671B model specifically, KV-cache at 128K context adds an estimated 50-100 GB; verify with your specific framework.
Why “Minimum” and “Recommended” Differ
Minimum and recommended differ because runtime costs stack on top of raw weight storage. The KV-cache, which stores key-value attention states for every token in the context window, grows linearly with context length for a fixed batch size and model architecture, and can consume several gigabytes at 32K or 128K context. Operating system and GPU driver reservations typically claim 500 MB to 1 GB (higher when a display is connected — up to 2 GB on Windows with the desktop compositor active). Batch size matters too: serving multiple concurrent requests through a local API multiplies KV-cache usage proportionally. A 15% VRAM headroom buffer above the calculated minimum is a practical target to avoid sporadic OOM events during longer generations.
Quantization: Trading Precision for Fit
Quantization Formats Explained (GGUF, GPTQ, AWQ, EXL2)
If you need CPU+GPU hybrid execution or broad compatibility, GGUF is the default choice — it runs natively in llama.cpp and Ollama and supports the widest range of quantization levels. For GPU-only inference through frameworks like vLLM and text-generation-inference, GPTQ and AWQ target maximum throughput; AWQ often edges out GPTQ on quality preservation at equivalent bit widths in community benchmarks, though results vary by model. EXL2, the ExLlamaV2 format, takes a different approach: per-layer bit allocation that squeezes more quality from a fixed VRAM budget, at the cost of being locked to a single inference backend.
DeepSeek officially distributes FP16/BF16 weights. Community contributors produce and host the quantized variants on Hugging Face and Ollama registries, with GGUF being the most widely available format across all model sizes.
Choosing the Right Quant Level
Q8 (8-bit) halves the weight footprint while retaining quality very close to FP16 — the go-to when VRAM allows it.
For a balance between size and fidelity, Q5/Q6 split the difference with minimal perplexity increase over FP16 (typically under 0.5 on WikiText-2).
Most VRAM-constrained deployments land on Q4 (4-bit), which cuts weights to roughly one-quarter of FP16 size. Quality loss is benchmark-dependent but produces no user-visible degradation on standard tasks like HumanEval pass@1 for code or straightforward chat.
Q3 and below introduce noticeable degradation: perplexity increases exceed 1.0 on WikiText-2 in community testing, and outputs on reasoning-heavy tasks lose coherence on multi-step chains.
The practical boundary is Q4 for most use cases. Drop below Q4 only when VRAM is genuinely exhausted and the alternative is stepping down an entire model size tier.
Quantization VRAM Savings Table
| Model | Q8 VRAM | Q5 VRAM | Q4 VRAM | Q3 VRAM |
|---|---|---|---|---|
| 7B | ~7 GB | ~5.5 GB | ~4.5 GB | ~3.5 GB |
| 14B | ~14 GB | ~11 GB | ~9 GB | ~7 GB |
| 32B | ~32 GB | ~25 GB | ~20 GB | ~16 GB |
| 70B | ~70 GB | ~55 GB | ~40 GB | ~35 GB |
These figures cover weights only. Add ~0.5 GB at 2K context, ~2 GB at 8K context, ~8 GB at 32K context for KV-cache (model-dependent).
Matching Models to Common GPU Tiers
Consumer GPUs (8-16 GB)
The RTX 4060 Ti (8 GB) fits the 7B model at Q4-Q5 with room for moderate context lengths. The RTX 4070 (12 GB) opens up 7B at Q8 or 14B at Q4 with tight margins. The RTX 4080 (16 GB) handles 14B at Q4-Q5 comfortably with 8K context, or 7B at Q8 with long context windows. None of these cards should attempt 32B or larger.
Prosumer GPUs (24-32 GB)
For local inference work that demands more than a consumer card, the RTX 4090 (24 GB) and RTX 5090 (32 GB) cover different tiers. The RTX 4090 runs 14B at FP16 or near-FP16 quantization, 32B at Q4-Q5, and can stretch to 32B Q5 with careful context length management. The RTX 5090’s additional 8 GB of VRAM provides more comfortable headroom for 32B models and enables longer context windows. The 32B distilled models at Q4 on a 24 GB card is one of the strongest capability-per-dollar configurations available in 2026.
Professional / Data Center GPUs (48-80 GB)
The RTX A6000 (48 GB) handles 70B at Q4-Q5. The A100 (80 GB) and H100 (80 GB) run 70B at Q8 or even FP16 for the A100 80 GB variant. For the 671B MoE model, even an 80 GB card is insufficient alone; multi-GPU configurations are mandatory.
Multi-GPU and Cloud Configurations
The 671B MoE model requires tensor parallelism across multiple GPUs. A common configuration uses 4-8 A100 80 GB cards, distributing weight shards across devices. Frameworks like vLLM and DeepSpeed support tensor parallelism natively. For the 70B model, two RTX 4090s via PCIe (NVLink is not available on RTX 40-series consumer GPUs) can handle FP16 weights with a PCIe-bandwidth penalty; use a framework with PCIe-based tensor parallelism such as vLLM.
The practical cost threshold where cloud inference becomes more economical than local hardware typically falls around the 671B model. Unless sustained daily usage (roughly 8+ hours of inference per day at current cloud API pricing) justifies the capital expenditure of a multi-GPU server, API access to hosted 671B endpoints costs less.
Build a VRAM Calculator with React
Prerequisites
Node.js ≥ 18.0.0 and npm ≥ 9.0.0 are required. The instructions below have been tested with Vite 5.x and React 18.x. The development server runs on http://localhost:5173 by default.
Project Setup
Scaffold a minimal Vite + React project and install dependencies:
npm create vite@latest vram-calculator -- --template react
cd vram-calculator
npm install
npm run dev
This produces a running development server in under a minute. No additional dependencies are needed for the calculator logic. To produce a production build for static hosting, run npm run build; output will be in the dist/ directory.
Core Calculation Logic
Create a file src/calculateVram.js containing the estimation function:
export function calculateVram({
params,
quantBits = 4,
contextLength = 4096,
kvCacheOverhead = null,
overheadGb = 0.75,
}) {
if (params <= 0 || quantBits <= 0) {
throw new RangeError("params and quantBits must be positive numbers");
}
if (contextLength <= 0) {
throw new RangeError("contextLength must be a positive number");
}
const weightGb = (params * quantBits) / 8;
const kvCacheGb =
kvCacheOverhead !== null
? kvCacheOverhead
: (params * (contextLength / 1024) * 0.5);
const totalGb = weightGb + kvCacheGb + overheadGb;
return {
weightGb: Math.round(weightGb * 100) / 100,
kvCacheGb: Math.round(kvCacheGb * 100) / 100,
overheadGb,
totalGb: Math.round(totalGb * 100) / 100,
};
}
Note on the KV-cache heuristic: The formula above ties KV-cache size to total parameter count, which is a rough approximation. In reality, KV-cache scales with the number of layers, number of KV heads, head dimension, and context length—not total parameter count. This heuristic may under- or over-estimate KV-cache by a factor of 3×; consult model architecture documentation for accurate values. For MoE models in particular, where total parameter count far exceeds the active compute path, this heuristic will significantly overestimate KV-cache.
To verify the weight calculation, create a separate file src/calculateVram.test.js:
import { calculateVram } from "./calculateVram.js";
function assert(condition, message) {
if (!condition) {
throw new Error(`ASSERTION FAILED: ${message}`);
}
}
assert(
calculateVram({ params: 7, quantBits: 4, contextLength: 4096 }).weightGb === 3.5,
"7B Q4 weight calculation failed — expected 3.5 GB"
);
assert(
calculateVram({ params: 14, quantBits: 16, contextLength: 2048 }).weightGb === 28,
"14B FP16 weight calculation failed — expected 28 GB"
);
const result15B = calculateVram({ params: 1.5, quantBits: 4, contextLength: 4096 });
assert(
result15B.totalGb > result15B.weightGb,
"1.5B total should exceed weight (overhead + kv-cache must add > 0)"
);
const largeCtx = calculateVram({ params: 7, quantBits: 4, contextLength: 32768 });
assert(
largeCtx.kvCacheGb > 1,
`7B at 32K context should have kvCacheGb > 1, got ${largeCtx.kvCacheGb}`
);
let threw = false;
try { calculateVram({ params: 0, quantBits: 4, contextLength: 4096 }); }
catch (e) { threw = true; }
assert(threw, "params=0 should throw RangeError");
console.log("All assertions passed.");
Run with node src/calculateVram.test.js during development.
The formula follows the standard approximation: VRAM ≈ (params_in_billions × quantBits / 8) + KV-cache_estimate + overhead_buffer. The KV-cache heuristic here is a simplified linear model; real KV-cache usage varies by architecture, attention head count, and whether grouped-query attention is used. The 15% headroom threshold used in the calculator’s fit assessment matches the guidance in the Sizing section above.
React Calculator Component
Replace the contents of src/App.jsx:
import { useState, useMemo } from "react";
import { calculateVram } from "./calculateVram";
const MODELS = [
{ label: "DeepSeek 1.5B", params: 1.5 },
{ label: "DeepSeek 7B", params: 7 },
{ label: "DeepSeek 8B", params: 8 },
{ label: "DeepSeek 14B", params: 14 },
{ label: "DeepSeek 32B", params: 32 },
{ label: "DeepSeek 70B", params: 70 },
{ label: "DeepSeek 671B (MoE)", params: 671 },
];
const QUANT_OPTIONS = [
{ label: "FP16 (16-bit)", bits: 16 },
{ label: "Q8 (8-bit)", bits: 8 },
{ label: "Q5 (5-bit)", bits: 5 },
{ label: "Q4 (4-bit)", bits: 4 },
{ label: "Q3 (3-bit)", bits: 3 },
];
const GPU_PRESETS = [
{ label: "RTX 4060 Ti (8 GB)", vram: 8 },
{ label: "RTX 4070 (12 GB)", vram: 12 },
{ label: "RTX 4080 (16 GB)", vram: 16 },
{ label: "RTX 4090 (24 GB)", vram: 24 },
{ label: "RTX 5090 (32 GB)", vram: 32 },
{ label: "RTX A6000 (48 GB)", vram: 48 },
{ label: "A100 / H100 (80 GB)", vram: 80 },
{ label: "Custom", vram: 0 },
];
function getFitColor(totalGb, gpuVram) {
if (!gpuVram || gpuVram <= 0) return "#888";
const ratio = totalGb / gpuVram;
if (ratio > 1) return "#e74c3c";
if (ratio > 0.85) return "#f39c12";
return "#27ae60";
}
export default function App() {
const [modelIdx, setModelIdx] = useState(1);
const [quantIdx, setQuantIdx] = useState(3);
const [contextLength, setContextLength] = useState(4096);
const [gpuIdx, setGpuIdx] = useState(3);
const [customVram, setCustomVram] = useState(24);
const model = MODELS[modelIdx];
const quant = QUANT_OPTIONS[quantIdx];
const gpu = GPU_PRESETS[gpuIdx];
const gpuVram = gpu.vram > 0 ? gpu.vram : customVram;
const result = useMemo(
() => calculateVram({ params: model.params, quantBits: quant.bits, contextLength }),
[model.params, quant.bits, contextLength]
);
const fitColor = getFitColor(result.totalGb, gpuVram);
return (
<div style={{ maxWidth: 520, margin: "2rem auto", fontFamily: "system-ui" }}>
<h1>DeepSeek VRAM Calculatorh1>
<label>Model:
<select value={modelIdx} onChange={(e) => setModelIdx(Number(e.target.value))}>
{MODELS.map((m, i) => <option key={m.label} value={i}>{m.label}option>)}
select>
label>
<label style={{ display: "block", marginTop: 12 }}>
Quantization: <strong>{quant.label}strong>
<input type="range" min={0} max={QUANT_OPTIONS.length - 1} value={quantIdx}
onChange={(e) => setQuantIdx(Number(e.target.value))} style={{ width: "100%" }} />
label>
<label style={{ display: "block", marginTop: 12 }}>
Context Length: <strong>{contextLength.toLocaleString()} tokensstrong>
<input type="range" min={512} max={131072} step={512} value={contextLength}
onChange={(e) => setContextLength(Number(e.target.value))} style={{ width: "100%" }} />
label>
<label style={{ display: "block", marginTop: 12 }}>GPU:
<select value={gpuIdx} onChange={(e) => setGpuIdx(Number(e.target.value))}>
{GPU_PRESETS.map((g, i) => <option key={g.label} value={i}>{g.label}option>)}
select>
label>
{gpu.vram === 0 && (
<input type="number" min={1} max={10000} value={customVram}
onChange={(e) => setCustomVram(Math.min(10000, Math.max(1, Number(e.target.value))))} placeholder="VRAM in GB" />
)}
<div style={{ marginTop: 24, padding: 16, border: `3px solid ${fitColor}`, borderRadius: 8 }}>
<p><strong>Weights:strong> {result.weightGb} GBp>
<p><strong>KV-Cache:strong> {result.kvCacheGb} GB (rough estimate — see note below)p>
<p><strong>Overhead:strong> {result.overheadGb} GBp>
<h2 style={{ color: fitColor }}>Total: {result.totalGb} GB / {gpuVram} GBh2>
<p>{result.totalGb > gpuVram ? "⛔ Will not fit" :
result.totalGb > gpuVram * 0.85 ? "⚠️ Tight fit — reduce context or quant level" :
"✅ Comfortable fit"}p>
<p style={{ fontSize: "0.85em", color: "#666" }}>
KV-cache estimate is a rough heuristic (±300%). Actual usage depends on model architecture. Consult model docs for precise values.
p>
div>
div>
);
}
This component renders dropdowns for model and GPU selection, sliders for quantization level and context length, and a color-coded output panel. Green signals at least 15% headroom, yellow indicates the model will load but risks OOM during long generations, and red means the configuration will not fit.
Optional: Node.js GPU Detection Script
For readers who want to automate GPU detection before running the calculator, save this as detect-gpu.mjs:
Note: This script requires Node.js ≥ 18.0.0 (for ESM and top-level await support) and only detects NVIDIA GPUs via nvidia-smi. For AMD GPUs, use rocm-smi; for Intel GPUs, use xpu-smi or intel_gpu_top.
import { execSync } from "node:child_process";
function detectGpuVram() {
try {
const output = execSync(
"nvidia-smi --query-gpu=name,memory.total --format=csv,noheader,nounits",
{ encoding: "utf-8", timeout: 5000 }
);
return output
.trim()
.split(/\r?
/)
.map((line) => {
const parts = line.split(",");
if (parts.length < 2) return null;
const memMb = parts[parts.length - 1].trim();
const name = parts.slice(0, parts.length - 1).join(",").trim();
const parsed = parseInt(memMb, 10);
if (isNaN(parsed)) {
console.error(`Could not parse VRAM for GPU: "${name}" (raw: "${memMb}")`);
return null;
}
return {
name,
totalVramGb: Math.round((parsed / 1024) * 100) / 100,
};
})
.filter(Boolean);
} catch (err) {
if (err.killed) {
console.error("nvidia-smi timed out after 5 seconds. Check GPU driver status.");
} else {
console.error(
"nvidia-smi not found. For AMD GPUs use rocm-smi; for Intel GPUs use xpu-smi or intel_gpu_top."
);
}
return [];
}
}
const gpus = detectGpuVram();
if (gpus.length > 0) {
gpus.forEach((gpu, i) =>
console.log(`GPU ${i}: ${gpu.name} — ${gpu.totalVramGb} GB VRAM`)
);
} else {
console.log("No NVIDIA GPUs detected.");
}
Run with node detect-gpu.mjs. It queries nvidia-smi for each installed GPU’s name and total memory, returning values in GB that can be fed directly into the calculator. The script includes a 5-second timeout to prevent hanging if nvidia-smi stalls during driver initialization.
Pre-Flight Checklist Before You Download
- Confirm GPU model and total VRAM by running
nvidia-smi. - Subtract 500 MB to 1 GB for OS and driver VRAM reservation (up to 2 GB on Windows with a display connected).
- Decide on target context length (2K, 8K, 32K, 128K). Longer context windows consume proportionally more KV-cache VRAM.
- Pick the largest model whose quantized footprint fits remaining VRAM with at least 15% headroom.
- Choose a quantization format supported by the target runtime: GGUF for llama.cpp and Ollama, AWQ or GPTQ for vLLM.
- Download weights from Hugging Face or the Ollama registry.
- Run a short test prompt (e.g., “Explain quicksort in three sentences”) and monitor VRAM live with
nvtopornvidia-smi -l 1. - Benchmark tokens per second. For interactive chat with models ≤32B, if output falls below 10 tok/s, step down one model tier or increase quantization. For 70B models, 2-5 tok/s is typical and expected.
Common Pitfalls and Troubleshooting
OOM at Inference but Not at Load
Weights fit in VRAM at load time, but the KV-cache expansion during generation triggers OOM — especially on long outputs or when context length exceeds what remaining VRAM can hold. Reduce max context length or step down one quantization level.
Slower Than Expected
When a model technically fits but inference is painfully slow, the runtime silently offloads layers to system RAM. Tools like llama.cpp use CPU offloading by default when GPU layers do not fully fit. Check layer allocation logs. If partial offloading is active, reduce model size or quantize further to keep all layers on the GPU.
Quality Feels Off After Quantization
If quantization degrades coherence or breaks reasoning chains, step up one quant level (for example, Q4 to Q5). Alternatively, try a different quantization format; AWQ and EXL2 preserve quality better than GPTQ at the same bit width in reported community perplexity benchmarks, though results vary by model.
Your Decision in 30 Seconds
Choose the model based on VRAM first, everything else second.
Use the React calculator above with actual GPU specs to get a personalized fit assessment in real time. Bookmark the pre-flight checklist for each new deployment. As DeepSeek releases additional models through the remainder of 2026, the same sizing logic applies: parameter count times bits per weight, plus KV-cache, plus overhead, compared against available VRAM.

