5 min read

Self-Hosting Livepeer’s LLM Pipeline: Deploying an Ollama-Based GPU Runner for AI Orchestrators

Livepeer AI Orchestrators can reuse existing NVIDIA 10/20 Series GPUs to provide LLM AI Inference to the Livepeer AI Network

Written by Mike Zupper
Member of Cloud SPE, Livepeer AI


Introduction

Livepeer is rapidly evolving into a real-time AI video and machine intelligence network.
Beyond video transcoding, orchestrators can now serve AI workloads such as:

  • Image generation
  • Image-to-video
  • text-to-speech
  • audio-to-text
  • Large Language Model (LLM) inference
  • and more ... read about it @ Livepeer AI Docs

To support this shift, the Cloud SPE (Special Purpose Entity within Livepeer) has built a custom Ollama-based AI Runner optimized for running LLM inference on GPUs with as little as 8GB of VRAM.

This post walks you through exactly how to deploy that runner using Docker, configure a Livepeer AI Orchestrator to use it, and verify that everything is working correctly — with detailed logs, examples, and explanations.

If you get stuck, join the Livepeer Discord and visit the #orchestrating channel:
👉 https://discord.gg/xpKATpA7
You can always ping @mike_zoop for help.


Why We Built an Ollama-Based Runner (Cloud SPE Motivations)

The official Livepeer docs recommend GPUs with 16GB+ VRAM for AI inference — and for good reason: diffusion models and advanced pipelines often require huge memory footprints.

However, LLMs (especially quantized formats) can run very efficiently on 8GB, 10GB, 12GB cards.

Cloud SPE created this custom Ollama-based runner because:

✔ Many orchestrators already own GPUs like GTX 1080, 1070 Ti, 2080, 3060

These cards may be idle from legacy transcoding workloads — we want to put them back to work.

✔ LLM jobs do not require massive VRAM

Ollama supports quantization and streaming inference, making 8GB GPUs perfectly viable.

✔ Lower barrier to entry → more decentralization

The more GPUs that can join the network, the healthier and more globally distributed Livepeer becomes.

✔ High-VRAM GPUs (4090, 5090, etc.) can run more complex models

Operators with modern cards gain additional earning opportunities for heavy models and emerging video-AI pipelines.

Bottom line:
This runner is designed so more orchestrators can earn and more GPUs can be useful in Livepeer’s AI future.


Hardware & System Requirements

Your orchestrator node must meet the following minimum standards.

Minimum Requirements

  • GPU: NVIDIA GTX 1080 or better (≥ 8GB VRAM)
  • Driver: Latest NVIDIA drivers installed
  • Docker: Installed + working
  • NVIDIA Container Toolkit: Installed (enables CUDA inside Docker containers)

Livepeer’s official docs (https://docs.livepeer.org/ai/orchestrators/get-started) suggest 16GB VRAM, but Cloud SPE’s 8GB-compatible runner removes this requirement.

  • NVIDIA RTX 10-, 20-, 30-, or 40-series GPUs (save the 4090 or large VRAM models for other Livepeer AI Jobs)
  • 8GB+ VRAM
  • Fast NVMe storage
  • CPU with ≥ 8 cores
  • ≥ 32GB RAM

Lower latency and faster GPUs → more job wins.


Architecture Overview

Here’s the local architecture when running an AI Orchestrator and Ollama GPU runner on the same machine:

Livepeer LLM Flow (Simplified)

Client (Gateway)
   |
   v
AI Orchestrator
   |
   v
Ollama AI Runner (llm_runner)
   |
   v
Ollama
   |
   v
GPU

Components

Component Purpose
Ollama AI Runner (llm_runner) Translates Livepeer LLM pipeline requests → Ollama API calls
Ollama Server Loads and executes LLMs on your GPU
GPU Executes inference kernels
AI Orchestrator Receives jobs from the Livepeer network and routes them to the runner

For most operators, the orchestrator and runner sit on the same box.
Advanced users can separate them using Livepeer Remote Workers — but this requires more networking and support (ask in Discord).


Deploying the Ollama-Based AI Runner

This section walks you through the full deployment.

Step 1 — Create the persistent Ollama model volume

This ensures your model stays downloaded after container restarts.

docker volume create ollama

Step 2 — Docker Compose Stack

Create a docker-compose.yml with the following stack:

services:
  ollama-ai-runner:
    image: tztcloud/livepeer-ollama-runner:0.1.1
    container_name: llm_runner
    restart: unless-stopped
    runtime: nvidia
    #Uncomment this port if you want to verify the service is up 
    #ports:
    #  - 8000:8000
    environment:
      - RUST_LOG=info
      - OLLAMA_BASE_URL=http://ollama:11434
       
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: unless-stopped
    runtime: nvidia
    #Uncomment this port if you want to verify the service is up 
    #ports:
    #  - 11434:11434
    volumes:
      - ollama:/root/.ollama
    environment:
      - OLLAMA_GPU_ENABLED=true
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
              driver: nvidia
              count: all

volumes:
  ollama:
    external: true

Step 3 — Start the stack

docker compose up -d

Step 4 — Download the model inside the Ollama container

Once the containers are running:

docker exec -it ollama ollama pull llama3.1:8b

This stores the model inside the ollama Docker volume.

Note:
Ollama model name (llama3.1:8b) and Livepeer model name (meta-llama/Meta-Llama-3.1-8B-Instruct) are different — but they represent the same model family.
This is expected.


Configuring the AI Orchestrator

You must update your aiModels.json to tell the orchestrator where your runner is located.

Step 5 — Edit aiModels.json

Add:

[
    {
        "pipeline": "llm",
        "model_id": "meta-llama/Meta-Llama-3.1-8B-Instruct",
        "warm": true,
        "price_per_unit": 0.18,
        "currency": "USD",
        "pixels_per_unit": 1000000,
        "url": "http://llm_runner:8000"
    }
]

Important details:

  • pipeline: "llm" enables the LLM pipeline
  • model_id must match the model Livepeer expects
  • url uses the Docker service name llm_runner — containers share a network
  • warm: true tells the orchestration layer to preload the model

Step 6 — Register with the AI Service Registry

If you skip this step, you will not receive jobs.

Consult Livepeer docs or ask in Discord for the exact commands depending on your orchestrator setup.


Verifying the Deployment

Once everything is launched, you should see specific logs.

Runner Startup Logs (llm_runner)

INFO livepeer_ollama_runner: Starting livepeer-ollama-runner
INFO livepeer_ollama_runner: Ollama base URL: http://ollama:11434
INFO livepeer_ollama_runner: Bind address: 0.0.0.0:8000
INFO livepeer_ollama_runner: Server listening on 0.0.0.0:8000

This confirms:

  • The runner reached the Ollama container
  • It’s listening on port 8000
  • No authentication required (local only)

Ollama Startup Logs

You should see:

level=INFO msg="llama runner started in 1.59 seconds"
level=INFO msg="loaded runners" count=1
level=INFO msg="waiting for llama runner to start responding"

This means:

  • GPU was detected
  • Ollama successfully registered its internal inference runner
  • The model is ready to load when requested

When You Receive a Job

When the Livepeer gateway assigns an LLM job to your Orchestrator, llm_runner will log:

INFO llm_handler{model=Some("meta-llama/Meta-Llama-3.1-8B-Instruct")}:
livepeer_ollama_runner: Received LLM request

INFO llm_handler{model=Some("meta-llama/Meta-Llama-3.1-8B-Instruct")}:
livepeer_ollama_runner: Processing request with model=llama3.1:8b, stream=false

This verifies:

  • Livepeer job → orchestrator → runner → Ollama pipeline works
  • The model mapping is correct
  • You’re officially serving LLM inference on the network

GPU Verification (nvidia-smi)

Run:

nvidia-smi

When jobs execute, you should see Ollama consuming VRAM:

0   N/A  N/A          516194      C   /usr/bin/ollama         5364MiB

This confirms:

  • GPU is exposed to Docker
  • Ollama is executing kernels on the GPU
  • The workload is actually running (not CPU fallback)

Verify AI Capabilities

If you visit (https://tools.livepeer.cloud/ai/network-capabilities), you should see the "LLM" pipeline and your orchestrator listed as "Warm"

Optional: Using Remote Workers

Livepeer supports “AI Remote Workers” — allowing an orchestrator to run on one box and dispatch jobs to multiple remote GPU workers.

This is advanced and requires:

  • Secure networking
  • Correct registration
  • Gateway reachability
  • Consistent worker health monitoring

If you want to explore this:

👉 Join the Discord: https://discord.gg/xpKATpA7
Ask in #orchestrating, tag @mike_zoop


FAQ (Initial Version — Will Grow Over Time)

Do I need 16GB VRAM?

No.
Cloud SPE built this runner to support 8GB GPUs like the GTX 1080, 1070 Ti, 2060, etc.
Higher VRAM improves throughput, but 8GB works.

Do I need to run the Orchestrator and Runner on the same machine?

Not required, but strongly recommended for simplicity.

Why is my Orchestrator not receiving jobs?

Most common reasons:

  • You did not register with the AI Service Registry
  • aiModels.json misconfigured
  • GPU too slow → not competitive for jobs
  • Network issues
  • Runner not reachable by container name llm_runner

Why use Ollama instead of raw PyTorch or TensorRT?

Ollama provides:

  • Simple Docker deployment
  • Fast quantized models
  • Low VRAM usage
  • Clean API for the runner
  • Massive model library

Final Notes

If you hit issues, join the community and ask questions:

👉 Livepeer Discord: https://discord.gg/xpKATpA7
Ask in #orchestrating and tag @mike_zoop

You can also read more about my work at:

Cloud SPE is committed to lowering the barrier to entry, increasing GPU participation, and expanding Livepeer into a resilient, decentralized infrastructure layer for open AI.


If you'd like any revisions, enhancements, images, diagrams, or additional sections — just let me know!