No description

Find a file

Bob Parsons 3072f5d734 Raise context to 64k with flash-attn + q8 KV cache gemma4 supports 256k, but 16GB VRAM is the limit. Flash attention plus q8 KV cache halve KV memory so 65536 fits mostly on GPU (58% layers). 128k/256k still load but offload KV into system RAM and slow down. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>		2026-06-04 22:37:35 -05:00
.env.example	Raise context to 64k with flash-attn + q8 KV cache	2026-06-04 22:37:35 -05:00
.gitignore	Ollama + gemma4:26b-a4b GPU stack with auto-pull	2026-06-04 20:49:36 -05:00
compose.yaml	Raise context to 64k with flash-attn + q8 KV cache	2026-06-04 22:37:35 -05:00
README.md	Ollama + gemma4:26b-a4b GPU stack with auto-pull	2026-06-04 20:49:36 -05:00

README.md

ollama-gemma4

Dockerized Ollama server with GPU passthrough, auto-pulling gemma4:26b-a4b-it-q4_K_M on first start. Built for opencode (or any OpenAI-compatible client) on an RTX 5060 Ti (16 GB).

Requirements

Docker + Compose plugin
NVIDIA driver + nvidia-container-toolkit (nvidia runtime registered with Docker)

Usage

docker compose up -d        # starts server, then pulls the model (~18 GB) once
docker compose logs -f model-pull   # watch the first-time download

The model-pull service exits 0 when the pull finishes — that's expected, not a crash. The ollama server keeps running.

docker compose ps           # ollama should stay "running"/healthy
docker compose down         # stop (model stays cached in the named volume)

The model cache lives in the ollama named volume, so it survives down/up.

Verify it works

curl http://localhost:11434/api/tags                      # model listed?
curl http://localhost:11434/api/generate -d '{
  "model": "gemma4:26b-a4b-it-q4_K_M",
  "prompt": "Say hi in one word.",
  "stream": false
}'

Confirm it landed on the GPU (PROCESSOR column should read 100% GPU, or close):

docker compose exec ollama ollama ps

opencode

Ollama exposes an OpenAI-compatible endpoint at http://localhost:11434/v1. Point opencode at it via opencode.json (project or ~/.config/opencode/):

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "ollama": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "Ollama (local)",
      "options": { "baseURL": "http://localhost:11434/v1" },
      "models": {
        "gemma4:26b-a4b-it-q4_K_M": { "name": "Gemma4 26B-A4B (local)" }
      }
    }
  }
}

Then opencode and pick ollama/gemma4:26b-a4b-it-q4_K_M.

Config

Override defaults by copying .env.example to .env (e.g. swap OLLAMA_MODEL to try a different tag).