Apr 10, 2026 7 min read AI

NaaP Analytics: Project Completion Report

Livepeer’s AI network is now measurable. Read the Cloud SPE’s final project report on NaaP Analytics, featuring public dashboards, performance signals, and the data foundation for production-grade service guarantees.

NaaP Analytics: Project Completion Report

By the Cloud SPE Team — Livepeer.Cloud

Executive Summary

The Cloud SPE has completed delivery of the Network-as-a-Product (NaaP) MVP — SLA Metrics, Analytics, and Public Infrastructure project, funded by the Livepeer Treasury. This post provides a detailed accounting of what was proposed, what was built, how the architecture works, and what impact this has on the Livepeer network going forward.

Work began in November 2025 with the original proposal and discovery process. The revised pre-proposal was submitted in January 2026, passed the treasury vote, and execution proceeded through three milestones culminating in April 2026.

What Was Promised vs. What Was Delivered

Milestone 1 — Metrics Collection & Aggregation (February 2026) ✅

Promised:

Define and implement the minimal metrics set
Aggregate existing telemetry into a unified analytics layer
A basic dashboard showing sample data flowing end to end

Delivered:

A comprehensive metrics catalog covering network state, stream activity, performance, payments, reliability, and orchestrator leaderboard scoring
A Kafka-to-ClickHouse ingest pipeline using ClickHouse's Kafka Engine tables for durable, exactly-once event consumption
Materialized views routing incoming events into tables with validation rules
Normalized tables capturing event-family facts and rollups
A working Grafana dashboard demonstrating end-to-end data flow from Kafka through to visual output
Documented bootstrap schema for reproducible fresh deployments

Milestone 2 — Test Signals & Derived Analytics (March 2026) ✅

Promised:

Deploy reference load-test gateways
Launch a public dashboard with core views
APIs for ecosystem consumption

Delivered:

Reference load-test gateway operational, generating consistent AI pipeline performance signals
Four production Grafana dashboards:
- System health overview
- Real-time stream activity
- Payments and revenue metrics
- FPS, latency, WebRTC performance
Additional supply inventory dashboard tracking GPU capacity across the network
A REST API with endpoints covering many requirement specs:
- Network state and orchestrator profiles
- Stream activity and job performance
- Payment and economics data
- Reliability and SLA scoring
- Orchestrator leaderboard with composite scoring
- GPU supply inventory and capacity
OpenAPI specification embedded in the API service with Swagger UI
Prometheus metrics endpoint for operational monitoring of the API itself

Milestone 3 — Stabilization & Review (April 2026) ✅

Promised:

Harden infrastructure for reliability and cost efficiency
Document metrics, assumptions, and known gaps
Review outcomes with the community to determine next steps

Delivered:

Full production deployment across multiple infrastructure nodes with separated concerns:
- Kafka broker, MirrorMaker2 for replicating events from Confluent Cloud, and a full ClickHouse + API + Grafana stack
Traefik reverse proxy with automated TLS via Cloudflare
Prometheus monitoring with 180-day retention policy
Resolver service — a custom service that publishes corrected current and serving state into canonical stores
dbt semantic layer publishing canonical and API views over normalized tables, with automated tests
31 data-quality validation tests in a scenario-based test harness
Comprehensive documentation suite (see below)

Architecture Deep Dive

The NaaP Analytics platform follows a layered architecture designed for clarity, auditability, and extensibility:

Kafka Topics
    ↓
ClickHouse Kafka Engine Tables
    ↓
Ingest Materialized Views → accepted_raw_events / ignored_raw_events
    ↓
Normalized Tables (event-family facts + rollups)
    ↓
Resolver Service → canonical_*_store / api_*_store
    ↓
dbt Semantic Layer → canonical_* / api_* views
    ↓
Go REST API + Grafana Dashboards

Key Architectural Decisions

ClickHouse + Kafka Engine — We chose ClickHouse as the analytics store for its columnar storage efficiency and native Kafka engine support. This eliminates the need for a separate ETL service — ClickHouse consumes directly from Kafka topics, and materialized views handle routing and validation in a single step.

REST/JSON API Design — The API follows a straightforward REST design with no authentication required for public read endpoints. This maximizes accessibility for ecosystem teams while keeping the door open for org-scoped access in the future.

Tiered Serving Contract — Data flows through defined tiers: raw → normalized → canonical (resolver) → semantic (dbt) → API. Each tier has explicit contracts about freshness, correctness, and derivation rules. The resolver publishes "corrected" state — reconciling event ordering, handling late-arriving data, and producing a consistent current-state view.

The Resolver

The resolver deserves specific mention. It's a separate service that:

Reads from normalized tables
Computes corrected current state (handling out-of-order events, deduplication, state transitions)
Publishes to canonical store tables
Supports three run modes: bootstrap (full historical rebuild), tail (real-time processing), and auto (bootstrap then tail)
Includes repair capabilities for specific time windows

This is the component that transforms raw event noise into trustworthy, queryable state. Without it, dashboards would show inconsistencies from event ordering, network partitions, or delayed telemetry.

Enrichment Layer

The API includes a polling worker that enriches raw orchestrator data with:

ENS name resolution
Staking information from the Livepeer protocol
Gateway metadata
GPU inventory details

This ensures the API and dashboards present human-readable, contextually rich data rather than raw addresses and IDs.

Documentation Delivered

One of the project goals was to leave the community with not just working software, but a documented system that others can operate, extend, and contribute to:

Document	Purpose
`DESIGN.md`	Architecture overview, layer rules, tier contracts, key decisions
`PRODUCT_SENSE.md`	Product goals, success criteria, non-goals
`PLANS.md`	Phase planning and implementation status
`metrics-and-sla-reference.md`	Community-facing metrics reference: formulas, SLA targets, glossary
`architecture.md`	Layer rules and enforcement model
`system-visuals.md`	Mermaid diagrams: ingest flow, resolver, deployment topology
`data-validation-rules.md`	Behavioral contract for all 17 validation rules (31 tests)
`operations-runbook.md`	Deployment, alerting, troubleshooting, maintenance, backups
`devops-environment-guide.md`	Monitoring, local/production environment setup
`data-retention-policy.md`	Kafka and ClickHouse retention windows, replay strategy
`infra-hardening-runbook.md`	Security posture, Kafka listener architecture
`incident-response.md`	Severity definitions (P0–P3), escalation contacts, post-mortem template
`run-modes-and-recovery.md`	Resolver run modes, failure recovery, rebuild procedures
`compose-services.md`	Docker Compose services, profiles, and responsibilities

Every product spec is individually documented with requirement traceability.

Impact on the Livepeer Network

Immediate Impact

Visibility: For the first time, anyone can see how the Livepeer AI network is performing — not from marketing materials, but from live data sourced from real workloads and standardized test signals.
Comparability: Orchestrators can be compared on a level playing field — same metrics, same methodology, same data pipeline. The leaderboard scoring model uses composite metrics that reward both performance and reliability.
Ecosystem Integration: The public APIs and data model are designed for consumption. The NaaP platform itself (livepeer/naap) is an architecture where new views, tools, and integrations can be built independently by any team.
Operational Maturity: The Livepeer network now has documented SLA metrics, a metrics glossary, formulas for reliability and performance scoring, and a reference implementation of how to collect, validate, and serve this data. This is the kind of infrastructure that enterprise evaluators look for.

Foundation for Future Work

This project intentionally did not attempt to:

Enforce SLAs or modify protocol incentives
Introduce new routing logic
Make protocol changes

These are all logical next steps, and they all depend on the measurement layer we've now established. Specifically, this enables:

SLA-aware job routing — gateways can use performance data to route jobs to orchestrators that meet specific reliability or latency thresholds
Network quality scores — aggregate metrics that can be published to the Livepeer Explorer or consumed by third-party evaluation tools
Treasury accountability — future funded projects can point to observable metrics as evidence of impact
GPU market intelligence — the supply inventory data provides a real-time view of network capacity, useful for both gateway operators planning workloads and orchestrators positioning their hardware

Lessons & Insights from the Build

Community Feedback Made This Better

The original pre-proposal (October 2025) was more ambitious — it included decentralized data transport via Streamr.Network, a larger budget, and broader scope. The community's feedback was pointed and constructive: too much scope, too much cost, unnecessary architectural risk. That feedback led to a complete reset. The revised proposal dropped Streamr, cut the budget, simplified the architecture, and focused on a thin but complete MVP. The result is better software shipped faster.

Existing Infrastructure Is an Asset

A key design principle was reusing existing Livepeer infrastructure wherever possible. Telemetry data from gateways and orchestrators already existed — it just wasn't aggregated or publicly accessible. By building on top of what was already there (gateway events, orchestrator telemetry, Kafka streams from Confluent Cloud), we avoided building a new event pipeline from scratch.

The Resolver Pattern Pays Off

Early in development, we faced the classic analytics challenge: raw events are noisy, out of order, and inconsistent. Rather than trying to fix this at the ingest layer (which would add latency and complexity), we built a separate resolver service that computes correct state from normalized events. This separation of concerns kept the ingest pipeline fast and simple while giving the API layer clean, trustworthy data. The resolver's repair mode — the ability to reprocess a specific time window — has already proven invaluable during development and will continue to be useful operationally.

Documentation Is a Deliverable

We treated documentation as a first-class deliverable, not an afterthought. Every architectural decision is recorded in an ADR. Every operational procedure has a runbook. Every validation rule has a behavioral contract. This matters because this is community infrastructure — it needs to be operable by people who didn't build it.

Right-Sizing Is an Art

The leaner budget forced discipline. We couldn't build everything, so we built the right things. The metrics catalog is minimal but sufficient. The API covers well-defined requirement specs. The dashboards address four key operational domains. Every component earned its place. This is a pattern we'd recommend to any SPE: start with the smallest thing that proves value, then let the data make the case for further investment.

Open Source

All code is available at github.com/Cloud-SPE/livepeer-naap-analytics under an open source license. The repository includes:

Complete Go API source OpenAPI specs
Resolver service source
dbt warehouse models with tests
ClickHouse schema and migrations
Grafana dashboards (JSON)
Prometheus configuration
Docker Compose stacks for local development
Production deployment configurations for Docker / Portainer
Full documentation suite
Data validation test harness

Acknowledgments

This project exists because of the Livepeer community. The feedback on the original pre-proposal — from DeFine, Karolak, vires-in-numeris, j0sh, Authority_Null, rickstaa, dob, and others — was direct, constructive, and ultimately made the project significantly better. Mehrdad from the Livepeer Foundation provided ongoing guidance and confirmed alignment with the network observability roadmap. Qiang Han from Livepeer Inc endorsed the proposal and committed to collaboration throughout execution. honestly_rich championed transparency and accountability throughout the process.

This is what community-driven governance looks like when it works.

The NaaP Analytics platform is live and open source. Review the code at github.com/Cloud-SPE/livepeer-naap-analytics, explore the proposal thread, and reach out in Livepeer Discord with questions or ideas for what to build next.

NaaP Analytics: Project Completion Report

Executive Summary

What Was Promised vs. What Was Delivered

Milestone 1 — Metrics Collection & Aggregation (February 2026) ✅

Milestone 2 — Test Signals & Derived Analytics (March 2026) ✅

Milestone 3 — Stabilization & Review (April 2026) ✅

Architecture Deep Dive

Key Architectural Decisions

The Resolver

Enrichment Layer

Documentation Delivered

Impact on the Livepeer Network

Immediate Impact

Foundation for Future Work

Lessons & Insights from the Build

Community Feedback Made This Better

Existing Infrastructure Is an Asset

The Resolver Pattern Pays Off

Documentation Is a Deliverable

Right-Sizing Is an Art

Open Source

Acknowledgments

You might also like...

Your GPUs Can Do More Than Transcode Video — Here's How to Put Them to Work on Livepeer

Building the Measurement Layer Livepeer Needs: Introducing NaaP Analytics

Self-Hosting Livepeer’s LLM Pipeline: Deploying an Ollama-Based GPU Runner for AI Orchestrators

Exciting New Livepeer Treasury Proposal: AI Performance Leaderboard

Financial Report: April 2024