Host Your Own AI Code Assistant with Docker, Ollama and Continue!

Wolfgang's Channel · 2026-05-21 ·▶ Watch on YouTube ·via captions

You can self-host a GitHub Copilot-style code assistant using Ollama, Docker, and the Continue VS Code plugin — no data sent to Microsoft or OpenAI. A dedicated GPU (AMD or Nvidia) with substantial VRAM is effectively required for usable performance; CPU-only setups are too slow for real-time code suggestions. ---

Key Concepts

ConceptDefinition
OllamaLocal LLM runtime that can leverage AMD (ROCm) or Nvidia (CUDA) GPUs; runs in Docker
ContinueVS Code (and soon Neovim) plugin that provides Copilot-style inline completions and chat using a local Ollama endpoint
Open WebUIWeb-based chat interface for Ollama, self-hosted, similar UX to ChatGPT
ROCmAMD's open compute platform required to run LLMs on AMD GPUs; not well-supported on Debian — use Ubuntu
Code models testedCodeLlama 7B, Code Booga 34B, StarCoder 3B

Notes

Motivation

  • GitHub Copilot sends code (including secrets) to Microsoft servers — a privacy/security concern
  • Goal: context-aware autocomplete (e.g., auto-suggest `owner`, `group`, `permissions` in an Ansible task), not full AI code generation
  • Self-hosting lets you share the instance across multiple devices or users

Hardware Tested

  • Intel i5-1340P (12-core Raptor Lake), 16 GB LPDDR5 RAM (non-upgradeable)
  • Idle: ~4.6W; under LLM load: 40–60W
  • No dedicated GPU — CPU inference only
  • Comparable real-world alternative: Mini PC (Beelink, Topton, Minisforum) ~$300–400
  • Ryzen 7 5800X3D + AMD Radeon 7900 XTX (24 GB VRAM)
  • Cost: ~€1,500 (summer 2023)
  • Idle: ~63W; LLM load: 110–425W, average ~130W
  • 7900 XTX fully supported by ROCm

Software Stack

  • **OS**: Ubuntu Server 22.04 (Debian dropped due to poor ROCm support)
  • **AMD driver**: `amdgpu-install` script with ROCm from AMD's website
  • **Runtime**: Docker + Docker Compose
  • **Models**: pulled via Open WebUI browser interface

Docker Compose Setup

  • Two services: `ollama` (ROCm image) + `open-webui`
  • Open WebUI on port `8080`; Ollama API on port `11434`
  • GPU passthrough via device mounts: `/dev/kfd` and `/dev/dri`
  • Local directories mounted for model and settings persistence
  • For CPU-only (Latte Panda): use standard `ollama` image, remove GPU device mounts

Continue Plugin Configuration

  • Install from VS Code marketplace
  • Edit JSON config: set Ollama URL + separate models for **chat** (larger, e.g. 34B/70B) and **autocomplete** (lighter, e.g. 7B or 3B)
  • Multiple chat models can be specified simultaneously

Model Performance (Gaming PC / GPU)

  • **Code Booga 34B**: best suggestion quality, slightly slower autocomplete, needs ~20 GB VRAM
  • **CodeLlama 7B**: slightly off on file-type detection but good suggestions, faster
  • **StarCoder 3B**: fast but poor quality — hallucinations, malformed output
  • Power draw was ~130W average regardless of model size (7B vs 34B)
  • Both models gave sensible Python suggestions in limited testing

Performance (Latte Panda / CPU-only)

  • Code Booga 34B: **unusable** — requires 20 GB RAM, only 16 GB available
  • CodeLlama 7B: works but text generation is very slow
  • StarCoder 3B: marginally faster, but output quality collapsed after the first suggestion
  • Autocomplete too slow and unreliable to be practical

Neovim Status

  • `model.nvim`, `gen.nvim`: support custom prompts/macros but not inline autocomplete
  • `lm.nvim`: does autocomplete but slow even on 3B models; poor output quality
  • Continue developers have a Neovim extension in progress

Actionable Takeaways

  1. Use **Ubuntu** (not Debian) if running AMD GPU with ROCm
  2. Use the ROCm Docker image for Ollama on AMD; standard image for CPU-only
  3. Mount `/dev/kfd` and `/dev/dri` in Docker Compose to expose AMD GPU to Ollama
  4. Set a **lightweight model (7B or 3B) for autocomplete** and a heavier model for chat in Continue's config
  5. Don't bother with CPU-only setups for real-time code suggestions — GPU is effectively required
  6. If you already own a gaming/workstation PC with a high-VRAM GPU, this can replace paid SaaS subscriptions

Quotes Worth Keeping

What I want from a quote-unquote AI code assistant is more intelligent and more context-aware auto-suggestions… I would have typed those anyway, but why do that if you can have the machine do it for you.
The fact that you can run a large language model… at your own house using free and open source software and consumer hardware — that's amazing. But at the same time it basically needs a high-end graphics card to work well.