Host Your Own AI Code Assistant with Docker, Ollama and Continue!
You can self-host a GitHub Copilot-style code assistant using Ollama, Docker, and the Continue VS Code plugin — no data sent to Microsoft or OpenAI. A dedicated GPU (AMD or Nvidia) with substantial VRAM is effectively required for usable performance; CPU-only setups are too slow for real-time code suggestions. ---
Key Concepts
| Concept | Definition |
|---|---|
| Ollama | Local LLM runtime that can leverage AMD (ROCm) or Nvidia (CUDA) GPUs; runs in Docker |
| Continue | VS Code (and soon Neovim) plugin that provides Copilot-style inline completions and chat using a local Ollama endpoint |
| Open WebUI | Web-based chat interface for Ollama, self-hosted, similar UX to ChatGPT |
| ROCm | AMD's open compute platform required to run LLMs on AMD GPUs; not well-supported on Debian — use Ubuntu |
| Code models tested | CodeLlama 7B, Code Booga 34B, StarCoder 3B |
Notes
Motivation
- GitHub Copilot sends code (including secrets) to Microsoft servers — a privacy/security concern
- Goal: context-aware autocomplete (e.g., auto-suggest `owner`, `group`, `permissions` in an Ansible task), not full AI code generation
- Self-hosting lets you share the instance across multiple devices or users
Hardware Tested
- Intel i5-1340P (12-core Raptor Lake), 16 GB LPDDR5 RAM (non-upgradeable)
- Idle: ~4.6W; under LLM load: 40–60W
- No dedicated GPU — CPU inference only
- Comparable real-world alternative: Mini PC (Beelink, Topton, Minisforum) ~$300–400
- Ryzen 7 5800X3D + AMD Radeon 7900 XTX (24 GB VRAM)
- Cost: ~€1,500 (summer 2023)
- Idle: ~63W; LLM load: 110–425W, average ~130W
- 7900 XTX fully supported by ROCm
Software Stack
- **OS**: Ubuntu Server 22.04 (Debian dropped due to poor ROCm support)
- **AMD driver**: `amdgpu-install` script with ROCm from AMD's website
- **Runtime**: Docker + Docker Compose
- **Models**: pulled via Open WebUI browser interface
Docker Compose Setup
- Two services: `ollama` (ROCm image) + `open-webui`
- Open WebUI on port `8080`; Ollama API on port `11434`
- GPU passthrough via device mounts: `/dev/kfd` and `/dev/dri`
- Local directories mounted for model and settings persistence
- For CPU-only (Latte Panda): use standard `ollama` image, remove GPU device mounts
Continue Plugin Configuration
- Install from VS Code marketplace
- Edit JSON config: set Ollama URL + separate models for **chat** (larger, e.g. 34B/70B) and **autocomplete** (lighter, e.g. 7B or 3B)
- Multiple chat models can be specified simultaneously
Model Performance (Gaming PC / GPU)
- **Code Booga 34B**: best suggestion quality, slightly slower autocomplete, needs ~20 GB VRAM
- **CodeLlama 7B**: slightly off on file-type detection but good suggestions, faster
- **StarCoder 3B**: fast but poor quality — hallucinations, malformed output
- Power draw was ~130W average regardless of model size (7B vs 34B)
- Both models gave sensible Python suggestions in limited testing
Performance (Latte Panda / CPU-only)
- Code Booga 34B: **unusable** — requires 20 GB RAM, only 16 GB available
- CodeLlama 7B: works but text generation is very slow
- StarCoder 3B: marginally faster, but output quality collapsed after the first suggestion
- Autocomplete too slow and unreliable to be practical
Neovim Status
- `model.nvim`, `gen.nvim`: support custom prompts/macros but not inline autocomplete
- `lm.nvim`: does autocomplete but slow even on 3B models; poor output quality
- Continue developers have a Neovim extension in progress
Actionable Takeaways
- Use **Ubuntu** (not Debian) if running AMD GPU with ROCm
- Use the ROCm Docker image for Ollama on AMD; standard image for CPU-only
- Mount `/dev/kfd` and `/dev/dri` in Docker Compose to expose AMD GPU to Ollama
- Set a **lightweight model (7B or 3B) for autocomplete** and a heavier model for chat in Continue's config
- Don't bother with CPU-only setups for real-time code suggestions — GPU is effectively required
- If you already own a gaming/workstation PC with a high-VRAM GPU, this can replace paid SaaS subscriptions
Quotes Worth Keeping
What I want from a quote-unquote AI code assistant is more intelligent and more context-aware auto-suggestions… I would have typed those anyway, but why do that if you can have the machine do it for you.
The fact that you can run a large language model… at your own house using free and open source software and consumer hardware — that's amazing. But at the same time it basically needs a high-end graphics card to work well.