Host Your Own AI Code Assistant with Docker, Ollama and Continue!

Wolfgang's Channel · 2026-05-21 ·▶ Watch on YouTube ·via captions ·2 min read

TL;DR

You can self-host a GitHub Copilot-style code assistant using Ollama, Docker, and the Continue VS Code plugin — no data sent to Microsoft or OpenAI. A dedicated GPU (AMD or Nvidia) with substantial VRAM is effectively required for usable performance; CPU-only setups are too slow for real-time code suggestions. ---

Key Concepts

Ollama

tap to reveal ↩

Local LLM runtime that can leverage AMD (ROCm) or Nvidia (CUDA) GPUs; runs in Docker

Continue

tap to reveal ↩

VS Code (and soon Neovim) plugin that provides Copilot-style inline completions and chat using a local Ollama endpoint

Open WebUI

tap to reveal ↩

Web-based chat interface for Ollama, self-hosted, similar UX to ChatGPT

ROCm

tap to reveal ↩

AMD's open compute platform required to run LLMs on AMD GPUs; not well-supported on Debian — use Ubuntu

Code models tested

tap to reveal ↩

CodeLlama 7B, Code Booga 34B, StarCoder 3B

Notes

§Motivation

GitHub Copilot sends code (including secrets) to Microsoft servers — a privacy/security concern
Goal: context-aware autocomplete (e.g., auto-suggest owner, group, permissions in an Ansible task), not full AI code generation
Self-hosting lets you share the instance across multiple devices or users

§Hardware Tested

Intel i5-1340P (12-core Raptor Lake), 16 GB LPDDR5 RAM (non-upgradeable)
Idle: ~4.6W; under LLM load: 40–60W
No dedicated GPU — CPU inference only
Comparable real-world alternative: Mini PC (Beelink, Topton, Minisforum) ~$300–400
Ryzen 7 5800X3D + AMD Radeon 7900 XTX (24 GB VRAM)
Cost: ~€1,500 (summer 2023)
Idle: ~63W; LLM load: 110–425W, average ~130W
7900 XTX fully supported by ROCm

§Software Stack

OS: Ubuntu Server 22.04 (Debian dropped due to poor ROCm support)
AMD driver: amdgpu-install script with ROCm from AMD's website
Runtime: Docker + Docker Compose
Models: pulled via Open WebUI browser interface

§Docker Compose Setup

Two services: ollama (ROCm image) + open-webui
Open WebUI on port 8080; Ollama API on port 11434
GPU passthrough via device mounts: /dev/kfd and /dev/dri
Local directories mounted for model and settings persistence
For CPU-only (Latte Panda): use standard ollama image, remove GPU device mounts

§Continue Plugin Configuration

Install from VS Code marketplace
Edit JSON config: set Ollama URL + separate models for chat (larger, e.g. 34B/70B) and autocomplete (lighter, e.g. 7B or 3B)
Multiple chat models can be specified simultaneously

§Model Performance (Gaming PC / GPU)

Code Booga 34B: best suggestion quality, slightly slower autocomplete, needs ~20 GB VRAM
CodeLlama 7B: slightly off on file-type detection but good suggestions, faster
StarCoder 3B: fast but poor quality — hallucinations, malformed output
Power draw was ~130W average regardless of model size (7B vs 34B)
Both models gave sensible Python suggestions in limited testing

§Performance (Latte Panda / CPU-only)

Code Booga 34B: unusable — requires 20 GB RAM, only 16 GB available
CodeLlama 7B: works but text generation is very slow
StarCoder 3B: marginally faster, but output quality collapsed after the first suggestion
Autocomplete too slow and unreliable to be practical

§Neovim Status

model.nvim, gen.nvim: support custom prompts/macros but not inline autocomplete
lm.nvim: does autocomplete but slow even on 3B models; poor output quality
Continue developers have a Neovim extension in progress

Actionable Takeaways

1Use Ubuntu (not Debian) if running AMD GPU with ROCm
2Use the ROCm Docker image for Ollama on AMD; standard image for CPU-only
3Mount /dev/kfd and /dev/dri in Docker Compose to expose AMD GPU to Ollama
4Set a lightweight model (7B or 3B) for autocomplete and a heavier model for chat in Continue's config
5Don't bother with CPU-only setups for real-time code suggestions — GPU is effectively required
6If you already own a gaming/workstation PC with a high-VRAM GPU, this can replace paid SaaS subscriptions

Quotes Worth Keeping

“

What I want from a quote-unquote AI code assistant is more intelligent and more context-aware auto-suggestions… I would have typed those anyway, but why do that if you can have the machine do it for you.

“

The fact that you can run a large language model… at your own house using free and open source software and consumer hardware — that's amazing. But at the same time it basically needs a high-end graphics card to work well.

↓ Down the rabbit hole

· AI/ML · Design

Strategies to Get First 1000 Users — Indie App Developer

An indie developer outlines the distribution channels that actually drove their first 1000 users, emphasizing that distribution is harder…