Self-Hosting Livepeer’s LLM Pipeline: Deploying an Ollama-Based GPU Runner for AI Orchestrators
Written by Mike Zupper
Member of Cloud SPE, Livepeer AI
Introduction
Livepeer is rapidly evolving into a real-time AI video and machine intelligence network.
Beyond video transcoding, orchestrators can now serve AI workloads such as:
- Image generation
- Image-to-video
- text-to-speech
- audio-to-text
- Large Language Model (LLM) inference
- and more ... read about it @ Livepeer AI Docs
To support this shift, the Cloud SPE (Special Purpose Entity within Livepeer) has built a custom Ollama-based AI Runner optimized for running LLM inference on GPUs with as little as 8GB of VRAM.
This post walks you through exactly how to deploy that runner using Docker, configure a Livepeer AI Orchestrator to use it, and verify that everything is working correctly — with detailed logs, examples, and explanations.
If you get stuck, join the Livepeer Discord and visit the #orchestrating channel:
👉 https://discord.gg/xpKATpA7
You can always ping @mike_zoop for help.
Why We Built an Ollama-Based Runner (Cloud SPE Motivations)
The official Livepeer docs recommend GPUs with 16GB+ VRAM for AI inference — and for good reason: diffusion models and advanced pipelines often require huge memory footprints.
However, LLMs (especially quantized formats) can run very efficiently on 8GB, 10GB, 12GB cards.
Cloud SPE created this custom Ollama-based runner because:
✔ Many orchestrators already own GPUs like GTX 1080, 1070 Ti, 2080, 3060
These cards may be idle from legacy transcoding workloads — we want to put them back to work.
✔ LLM jobs do not require massive VRAM
Ollama supports quantization and streaming inference, making 8GB GPUs perfectly viable.
✔ Lower barrier to entry → more decentralization
The more GPUs that can join the network, the healthier and more globally distributed Livepeer becomes.
✔ High-VRAM GPUs (4090, 5090, etc.) can run more complex models
Operators with modern cards gain additional earning opportunities for heavy models and emerging video-AI pipelines.
Bottom line:
This runner is designed so more orchestrators can earn and more GPUs can be useful in Livepeer’s AI future.
Hardware & System Requirements
Your orchestrator node must meet the following minimum standards.
Minimum Requirements
- GPU: NVIDIA GTX 1080 or better (≥ 8GB VRAM)
- Driver: Latest NVIDIA drivers installed
- Docker: Installed + working
- NVIDIA Container Toolkit: Installed (enables CUDA inside Docker containers)
Livepeer’s official docs (https://docs.livepeer.org/ai/orchestrators/get-started) suggest 16GB VRAM, but Cloud SPE’s 8GB-compatible runner removes this requirement.
Recommended Hardware
- NVIDIA RTX 10-, 20-, 30-, or 40-series GPUs (save the 4090 or large VRAM models for other Livepeer AI Jobs)
- 8GB+ VRAM
- Fast NVMe storage
- CPU with ≥ 8 cores
- ≥ 32GB RAM
Lower latency and faster GPUs → more job wins.
Architecture Overview
Here’s the local architecture when running an AI Orchestrator and Ollama GPU runner on the same machine:
Livepeer LLM Flow (Simplified)
Client (Gateway)
|
v
AI Orchestrator
|
v
Ollama AI Runner (llm_runner)
|
v
Ollama
|
v
GPU
Components
| Component | Purpose |
|---|---|
Ollama AI Runner (llm_runner) |
Translates Livepeer LLM pipeline requests → Ollama API calls |
| Ollama Server | Loads and executes LLMs on your GPU |
| GPU | Executes inference kernels |
| AI Orchestrator | Receives jobs from the Livepeer network and routes them to the runner |
For most operators, the orchestrator and runner sit on the same box.
Advanced users can separate them using Livepeer Remote Workers — but this requires more networking and support (ask in Discord).
Deploying the Ollama-Based AI Runner
This section walks you through the full deployment.
Step 1 — Create the persistent Ollama model volume
This ensures your model stays downloaded after container restarts.
docker volume create ollama
Step 2 — Docker Compose Stack
Create a docker-compose.yml with the following stack:
services:
ollama-ai-runner:
image: tztcloud/livepeer-ollama-runner:0.1.1
container_name: llm_runner
restart: unless-stopped
runtime: nvidia
#Uncomment this port if you want to verify the service is up
#ports:
# - 8000:8000
environment:
- RUST_LOG=info
- OLLAMA_BASE_URL=http://ollama:11434
ollama:
image: ollama/ollama:latest
container_name: ollama
restart: unless-stopped
runtime: nvidia
#Uncomment this port if you want to verify the service is up
#ports:
# - 11434:11434
volumes:
- ollama:/root/.ollama
environment:
- OLLAMA_GPU_ENABLED=true
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
driver: nvidia
count: all
volumes:
ollama:
external: true
Step 3 — Start the stack
docker compose up -d
Step 4 — Download the model inside the Ollama container
Once the containers are running:
docker exec -it ollama ollama pull llama3.1:8b
This stores the model inside the ollama Docker volume.
Note:
Ollama model name (llama3.1:8b) and Livepeer model name (meta-llama/Meta-Llama-3.1-8B-Instruct) are different — but they represent the same model family.
This is expected.
Configuring the AI Orchestrator
You must update your aiModels.json to tell the orchestrator where your runner is located.
Step 5 — Edit aiModels.json
Add:
[
{
"pipeline": "llm",
"model_id": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"warm": true,
"price_per_unit": 0.18,
"currency": "USD",
"pixels_per_unit": 1000000,
"url": "http://llm_runner:8000"
}
]
Important details:
pipeline: "llm"enables the LLM pipelinemodel_idmust match the model Livepeer expectsurluses the Docker service namellm_runner— containers share a networkwarm: truetells the orchestration layer to preload the model
Step 6 — Register with the AI Service Registry
If you skip this step, you will not receive jobs.
Consult Livepeer docs or ask in Discord for the exact commands depending on your orchestrator setup.
Verifying the Deployment
Once everything is launched, you should see specific logs.
Runner Startup Logs (llm_runner)
INFO livepeer_ollama_runner: Starting livepeer-ollama-runner
INFO livepeer_ollama_runner: Ollama base URL: http://ollama:11434
INFO livepeer_ollama_runner: Bind address: 0.0.0.0:8000
INFO livepeer_ollama_runner: Server listening on 0.0.0.0:8000
This confirms:
- The runner reached the Ollama container
- It’s listening on port 8000
- No authentication required (local only)
Ollama Startup Logs
You should see:
level=INFO msg="llama runner started in 1.59 seconds"
level=INFO msg="loaded runners" count=1
level=INFO msg="waiting for llama runner to start responding"
This means:
- GPU was detected
- Ollama successfully registered its internal inference runner
- The model is ready to load when requested
When You Receive a Job
When the Livepeer gateway assigns an LLM job to your Orchestrator, llm_runner will log:
INFO llm_handler{model=Some("meta-llama/Meta-Llama-3.1-8B-Instruct")}:
livepeer_ollama_runner: Received LLM request
INFO llm_handler{model=Some("meta-llama/Meta-Llama-3.1-8B-Instruct")}:
livepeer_ollama_runner: Processing request with model=llama3.1:8b, stream=false
This verifies:
- Livepeer job → orchestrator → runner → Ollama pipeline works
- The model mapping is correct
- You’re officially serving LLM inference on the network
GPU Verification (nvidia-smi)
Run:
nvidia-smi
When jobs execute, you should see Ollama consuming VRAM:
0 N/A N/A 516194 C /usr/bin/ollama 5364MiB
This confirms:
- GPU is exposed to Docker
- Ollama is executing kernels on the GPU
- The workload is actually running (not CPU fallback)
Verify AI Capabilities
If you visit (https://tools.livepeer.cloud/ai/network-capabilities), you should see the "LLM" pipeline and your orchestrator listed as "Warm"
Optional: Using Remote Workers
Livepeer supports “AI Remote Workers” — allowing an orchestrator to run on one box and dispatch jobs to multiple remote GPU workers.
This is advanced and requires:
- Secure networking
- Correct registration
- Gateway reachability
- Consistent worker health monitoring
If you want to explore this:
👉 Join the Discord: https://discord.gg/xpKATpA7
Ask in #orchestrating, tag @mike_zoop
FAQ (Initial Version — Will Grow Over Time)
Do I need 16GB VRAM?
No.
Cloud SPE built this runner to support 8GB GPUs like the GTX 1080, 1070 Ti, 2060, etc.
Higher VRAM improves throughput, but 8GB works.
Do I need to run the Orchestrator and Runner on the same machine?
Not required, but strongly recommended for simplicity.
Why is my Orchestrator not receiving jobs?
Most common reasons:
- You did not register with the AI Service Registry
aiModels.jsonmisconfigured- GPU too slow → not competitive for jobs
- Network issues
- Runner not reachable by container name
llm_runner
Why use Ollama instead of raw PyTorch or TensorRT?
Ollama provides:
- Simple Docker deployment
- Fast quantized models
- Low VRAM usage
- Clean API for the runner
- Massive model library
Final Notes
If you hit issues, join the community and ask questions:
👉 Livepeer Discord: https://discord.gg/xpKATpA7
Ask in #orchestrating and tag @mike_zoop
You can also read more about my work at:
Cloud SPE is committed to lowering the barrier to entry, increasing GPU participation, and expanding Livepeer into a resilient, decentralized infrastructure layer for open AI.
If you'd like any revisions, enhancements, images, diagrams, or additional sections — just let me know!