The question comes up in every engineering team eventually: should we run AI coding models on our own hardware, or use a cloud service?
On one side, self-hosted setups promise total control. Your code never leaves your network. You pick the model, configure the hardware, and own the entire stack. On the other side, cloud tools like GitHub Copilot, Claude Code, and Lurus Code offer frontier model quality, zero maintenance, and instant setup.
The honest answer in 2025 is that both approaches have real strengths and real limitations. This guide walks through the practical trade-offs so your team can make an informed choice.
The Self-Hosted Stack: How It Works
A typical self-hosted AI coding setup in 2025 looks like this:
Model runtime: Ollama is the most popular option. It handles model downloading, quantization, and GPU acceleration with a simple CLI. Run ollama pull qwen2.5-coder:14b and you have a capable coding model running locally in minutes. Ollama supports over 100 open-weight models and runs on macOS, Linux, and Windows.
IDE integration: Continue.dev is an open-source VS Code and JetBrains extension that connects to local models via Ollama or to cloud APIs. It provides chat, inline editing, and autocomplete. You can mix local and cloud models within the same setup.
Models: The best open-weight coding models in 2025 include Qwen2.5-Coder (7B, 14B, and 32B parameter sizes), DeepSeek-Coder-V2, and Mistral’s Codestral (22B parameters). Mistral also released Devstral, an open-weight model optimized for agentic coding tasks that outperformed several much larger proprietary models on coding agent benchmarks.
The appeal is obvious: your code stays on your machine, you pay no per-token fees, and you own the entire stack.
Hardware Requirements: The Real Cost of Local AI
Here is where the self-hosted story gets complicated. Running large language models locally demands serious hardware, and the relationship between hardware investment and model quality is steep.
GPU Memory (VRAM) Is the Bottleneck
Model quality scales with parameter count, and parameter count demands VRAM. Here are realistic requirements for popular coding models using 4-bit quantization (the standard trade-off between quality and memory):
| Model | Parameters | VRAM Required (4-bit) | Typical GPU |
|---|---|---|---|
| Qwen2.5-Coder 7B | 7 billion | ~8 GB | RTX 4060 Ti |
| Qwen2.5-Coder 14B | 14 billion | ~16 GB | RTX 4080 |
| Qwen2.5-Coder 32B | 32 billion | ~24 GB | RTX 4090 |
| DeepSeek-Coder-V2 Lite | 16 billion | ~12 GB | RTX 4070 Ti |
| Codestral 22B | 22 billion | ~16 GB | RTX 4080 |
| Llama 3.1 70B | 70 billion | ~40 GB | 2x RTX 4090 or A6000 |
An NVIDIA RTX 4090 with 24 GB VRAM costs roughly €1,800-2,200. A workstation-class GPU like the RTX A6000 (48 GB VRAM) runs €4,000-6,000. For 70B+ models, you need multi-GPU setups or enterprise hardware costing €10,000+.
For comparison, a dedicated GPU server from Hetzner starts at around €184/month (excluding VAT). That avoids the upfront capital investment, but you are still responsible for setup, maintenance, and model management.
CPU Fallback and Apple Silicon
Ollama can run models on CPU when GPU memory is insufficient, but inference is dramatically slower. Where a GPU generates 30-50 tokens per second, CPU inference typically produces 5-10 tokens per second for a 7B model and becomes nearly unusable for anything larger.
Apple’s M-series chips (M3 Max, M4 Max) offer unified memory shared between CPU and GPU workloads. An M4 Max with 128 GB of unified memory can run 70B models at usable speeds, making high-end MacBooks surprisingly capable inference machines. The catch: Apple Silicon is still slower than dedicated NVIDIA GPUs, and the hardware is expensive. An M4 Max MacBook Pro with 128 GB memory costs roughly €4,500-5,000.
Model Quality: The Gap That Matters Most
This is the most important factor in the self-hosted vs cloud decision, and the one that is most often glossed over in enthusiast discussions.
Cloud services like Claude Code, GitHub Copilot, and Lurus Code use frontier models: Claude Sonnet/Opus, GPT-4o, Gemini Pro. These models have hundreds of billions of parameters, trained on massive datasets, with extensive RLHF. They are significantly better at complex coding tasks than any model you can run locally in mid-2025.
How big is the gap? Benchmark comparisons show the best local model (Qwen2.5-Coder-32B) scoring within 85-90% of Claude Sonnet on straightforward function generation. That sounds close, but the gap widens on harder tasks:
Where local models perform well:
- Code completion and autocomplete (especially Qwen2.5-Coder)
- Simple refactoring and formatting
- Code explanation for small functions
- Boilerplate generation from patterns
Where cloud models pull ahead:
- Multi-file reasoning across large codebases
- Complex architectural refactoring
- Understanding domain-specific business logic
- Multi-step agentic tasks (planning, executing, verifying)
- Long-context understanding (100K+ token windows)
- Nuanced bug detection and security analysis
The difference is most visible in agentic workflows. Ask a cloud-based agent to “refactor this authentication module to use OAuth2, update all affected tests, and fix any type errors,” and it can execute a multi-step plan across dozens of files. Local 7B-14B models struggle with this coordination. The 32B models handle it better but still fall short on reliability.
The quality gap is narrowing with each quarter. But in mid-2025, it remains meaningful for professional development work.
Latency and Throughput
Self-hosted models have one clear advantage: no network round-trip. For autocomplete suggestions, where milliseconds matter, a local model on a fast GPU feels snappier than a cloud service.
For longer interactions (chat, code review, multi-file edits), the picture reverses. Cloud services run inference on GPU clusters optimized for throughput, generating tokens faster than a local 32B model on a single RTX 4090. In practice, most developers find cloud latency acceptable for interactive work. But if you work with poor or restricted internet connectivity, local inference becomes a practical necessity.
Maintenance and Operational Burden
This is the hidden cost of self-hosting that often gets underestimated.
With a cloud tool, your setup looks like: install extension, enter API key, start coding. Updates happen automatically. Model improvements appear without any action on your part. When a better model is released, you switch with a settings change.
With a self-hosted stack, you are responsible for:
- Hardware maintenance. GPUs fail, drivers need updating, CUDA versions must match your Ollama version.
- Model management. Which model is best for your use case? When should you upgrade? Someone needs to stay current.
- Quantization decisions. 14B at 8-bit or 32B at 4-bit? Each choice involves quality/speed trade-offs that require testing.
- Multi-user infrastructure. Ten developers sharing one GPU server means managing concurrent inference, queueing, and resource allocation.
- Reliability. What happens when the model server goes down?
For a solo developer running Ollama on their laptop, the overhead is minimal. For a team of ten, it becomes a real operations commitment.
Data Sovereignty: Where Self-Hosted Wins (and Where Cloud Can Match It)
The primary argument for self-hosting is data control. Your code never leaves your infrastructure. No third-party DPA is needed. No transfer impact assessment. For organizations handling classified information or defense contracts, this level of control may be a hard requirement.
But for most commercial teams, the question is not “does our code ever leave our network?” It is “does our code stay within our legal jurisdiction, processed by entities subject to our data protection laws?” That is a different question, and one where cloud tools can offer a strong answer.
EU-hosted cloud tools process your data in European data centers, under European law, by European entities. Lurus Code runs on Hetzner infrastructure in Nuremberg (Germany) and Helsinki (Finland). There is no US data transfer, no exposure to FISA Section 702, and no reliance on the EU-US Data Privacy Framework whose long-term stability remains uncertain.
This creates a middle ground: cloud-quality AI with EU data sovereignty. You get frontier models (Claude, GPT-4o, Gemini) routed through EU infrastructure, with a single Data Processing Agreement under Article 28 GDPR with an EU entity.
For teams where data sovereignty (not air-gapped isolation) is the actual requirement, this gives you the best of both worlds. For a deeper look at how GDPR applies to AI coding tools, see our GDPR guide for AI coding tools.
Cost Comparison
Let’s put real numbers on this.
Self-hosted (single developer): An RTX 4090 (€2,000) plus a workstation (€1,500) amortized over three years, plus electricity, works out to roughly €120/month. Add 2-4 hours of maintenance time.
Self-hosted (team of 10): A multi-GPU server (€8,000-12,000), server hardware (€3,000), electricity (~€80/month), and 8-16 hours/month of IT operations time. Effective cost: roughly €50-80 per developer per month.
Cloud tools: GitHub Copilot runs $10-19/seat/month (Enterprise with EU residency is custom pricing). Claude Code ranges from $20-200/month depending on usage tier. Lurus Code uses credit-based pricing starting with a free tier, scaling with usage.
The math is nuanced. For a single developer with existing GPU hardware, self-hosting can be cost-effective. For teams, operational overhead often makes cloud tools cheaper when you account for IT staff time.
Decision Framework: Which Approach Fits Your Team?
Choose self-hosted if you handle classified code that cannot leave your network, you already have GPU infrastructure, your team has ML ops staff, or internet connectivity is restricted.
Choose cloud if you need frontier model quality for complex tasks, want zero maintenance, or need structured features beyond raw inference (CI/CD integration, security scanning, code review).
Choose EU-hosted cloud if data sovereignty is a requirement but air-gapped isolation is not, you want frontier quality routed through European infrastructure, or you need the simplest compliant path without enterprise-tier pricing.
Lurus Code occupies this middle ground: frontier models (Claude, GPT-4o, Gemini) through EU-only infrastructure, with structured code review, security scanning, and CI/CD integration. Cloud quality and EU data sovereignty without maintaining your own model infrastructure.
For a broader comparison of available tools, see our comparison of AI coding tools for European developers.
The Hybrid Approach
Some teams run both. Local models via Ollama + Continue.dev handle autocomplete and simple chat, while a cloud tool handles complex tasks, code review, and security analysis. A practical hybrid setup:
- Autocomplete: Qwen2.5-Coder 7B via Ollama (local, fast)
- Complex tasks: Lurus Code or Claude Code (cloud, frontier quality)
- Code review and security scanning: Cloud-based with CI/CD integration
This gets you local speed for the most latency-sensitive task while using cloud quality where it matters most. The trade-off is managing two systems instead of one.
What the Future Looks Like
The gap between local and cloud models is closing. Qwen2.5-Coder 32B is dramatically better than anything available locally two years ago. Mistral’s Devstral shows that smaller, specialized models can outperform much larger general-purpose ones on specific tasks. Hardware is improving too, with more VRAM per generation and better quantization techniques.
But cloud models are not standing still. Frontier labs invest billions in training, data, and RLHF. The models you access through cloud services will likely always be at least one generation ahead of what you can run locally. The question is whether “one generation behind” is good enough for your work.
For most teams in 2025, cloud-hosted tools offer the better combination of quality, features, and operational simplicity. If data sovereignty matters, EU-hosted cloud options let you get frontier quality without the compliance headaches of US data transfers or the operational burden of self-hosting.
Frequently Asked Questions
Can I run a model as good as GPT-4o or Claude Sonnet locally?
Not in mid-2025. The best local coding models (Qwen2.5-Coder 32B, Codestral 22B) are capable and improving fast, but they score roughly 85-90% of frontier cloud models on straightforward tasks, and the gap widens significantly on complex multi-file reasoning and agentic workflows. Running a model that truly matches frontier quality would require hundreds of billions of parameters and hardware far beyond a typical workstation.
How much does it cost to set up a local AI coding environment?
For a single developer, expect around €2,000-3,500 for a GPU capable of running 14B-32B models comfortably (RTX 4080 or 4090), plus your existing workstation. Ongoing costs are electricity and maintenance time. For a team sharing GPU infrastructure, budget €8,000-15,000 for hardware plus ongoing IT operations time. Compare this against cloud tool subscriptions of €10-200/month per developer with zero hardware investment.
Is self-hosted AI coding GDPR-compliant by default?
If you run models entirely on your own infrastructure with no external API calls, there is no third-party data processing involved, so Article 28 GDPR Data Processing Agreement requirements do not apply to the AI tool itself. However, you are still responsible for securing the infrastructure, and you lose the benefit of having a professional provider handle model updates, security patches, and operational reliability. Self-hosting is not automatically more compliant; it shifts the compliance burden from vendor management to infrastructure management.
Can I use Continue.dev with both local and cloud models at the same time?
Yes. Continue.dev supports configuring different model providers for different tasks. You can set a local Ollama model for autocomplete (where speed matters most) and a cloud API for chat and complex reasoning (where quality matters most). This hybrid approach gives you local speed for simple tasks and cloud quality for hard problems. The configuration is handled through a JSON config file in your project or user settings.
Conclusion
Self-hosted and cloud AI coding tools serve different needs. Self-hosting gives you maximum control and eliminates third-party data exposure entirely. Cloud tools give you frontier model quality, zero maintenance, and structured features that go well beyond raw model inference.
For most teams, the deciding factor is not ideology but practicality. If you need the best possible AI quality for complex coding work, cloud tools win today. If absolute network isolation is a hard requirement, self-hosting is your only option. And if data sovereignty without operational overhead is what you need, EU-hosted cloud tools like Lurus Code offer a middle path that did not exist a few years ago.
Whatever you choose, the worst option is doing nothing. AI coding tools, whether local or cloud-hosted, are a genuine productivity multiplier. Pick the approach that fits your constraints, set it up, and iterate from there. For a detailed look at the GDPR implications of choosing between providers, we have written separately about the compliance dimension of this decision.