Last Updated: 3/13/2026
Architecture Overview
Pie is designed as a three-layer system that separates application logic, resource management, and inference execution. This architecture enables programmable LLM serving while maintaining high performance and resource efficiency.
System Layers
┌─────────────────────────────────────────────┐
│ Application Layer (Wasm) │
│ Inferlets run in sandboxed Wasm runtime │
└─────────────────────────────────────────────┘
↓ API Calls
┌─────────────────────────────────────────────┐
│ Control Layer (Rust) │
│ Resource virtualization & batch scheduling │
└─────────────────────────────────────────────┘
↓ Batch Ops
┌─────────────────────────────────────────────┐
│ Inference Layer (Python/Rust) │
│ GPU execution & model serving │
└─────────────────────────────────────────────┘Application Layer
The application layer hosts user-defined inferlets in a sandboxed WebAssembly runtime. Inferlets are compiled to Wasm and run in isolated instances with controlled access to system resources.
Key Components:
- Wasm Runtime: Based on Wasmtime, provides secure sandboxing and component model support
- Inferlet Instances: Each running inferlet gets its own isolated Wasm instance
- API Bindings: WIT (WebAssembly Interface Types) define the contract between inferlets and the control layer
Security:
- Memory isolation between inferlets
- Capability-based security model
- No direct access to GPU or network (except via controlled APIs)
Control Layer
The control layer virtualizes LLM resources and manages batch scheduling. It exposes high-level APIs to inferlets while coordinating efficient GPU utilization.
Key Components:
-
Resource Virtualization
- KV Cache Pages: Paged attention with fine-grained allocation and deallocation
- Embedding Vectors: Pooled embedding storage for reuse across requests
- Context Objects: Virtual handles to model state (KV cache + embeddings)
-
Batch Scheduler
- Groups operations from multiple inferlets into efficient batches
- Prioritizes requests based on resource availability
- Implements continuous batching for high throughput
-
Model Registry
- Tracks available models and their configurations
- Routes requests to appropriate model instances
- Supports multi-model serving
APIs Exposed to Inferlets:
create_context(): Allocate a new context with KV cachefill(): Add tokens to context (prefill)forward(): Run model forward passsample(): Sample from logits distributionembed(): Compute embeddings- KV cache management (split, merge, drop)
Inference Layer
The inference layer executes batched operations on GPUs. It’s implemented primarily in Python (using PyTorch) with Rust FFI for low-latency communication.
Key Components:
-
Model Backends
- Python Backend: PyTorch-based model execution
- IPC Bridge: Fast communication between Rust runtime and Python workers
- Multi-GPU Support: Tensor parallelism and data parallelism
-
Batching Engine
- Receives batched operations from control layer
- Executes on GPU with optimized kernels
- Returns results to control layer
-
Memory Management
- Paged KV cache allocation
- GPU memory pooling
- Automatic garbage collection
Communication Flow
1. Inferlet Submission
Client → HTTP/gRPC → Server → Wasm Runtime
↓
Instantiate Inferlet
↓
Execute main()2. API Call Execution
Inferlet: ctx.generate(sampler, stop_cond)
↓
Control Layer: Batch scheduler queues request
↓
Inference Layer: Execute batched forward pass
↓
Control Layer: Return logits
↓
Inferlet: Sample and continue3. Multi-Step Generation
Inferlet Loop:
1. ctx.forward() → Control → Inference → GPU
2. Receive logits
3. ctx.sample() → Control (CPU sampling)
4. Append token to context
5. Check stop condition
6. Repeat or returnResource Management
KV Cache Paging
Pie uses paged attention inspired by vLLM, but with application-level control:
// Inferlet can explicitly manage KV cache
let ctx1 = model.create_context();
ctx1.fill("Translate to French: ");
// Split context to reuse prefix
let ctx2 = ctx1.split(); // Shares KV cache pages with ctx1
ctx2.fill("Hello");
let ctx3 = ctx1.split();
ctx3.fill("Goodbye");
// Both ctx2 and ctx3 reuse the "Translate to French: " prefixBenefits:
- Reduces redundant prefill for shared prefixes
- Enables tree/graph-of-thought patterns
- Supports speculative decoding and beam search
Batch Scheduling
The scheduler implements continuous batching with iteration-level batching:
- Continuous Batching: New requests join ongoing batches dynamically
- Iteration-Level Batching: Each decode step is a separate batch
- Priority Queues: Prefill (long) and decode (short) operations are scheduled separately
Scheduling Algorithm:
While requests pending:
1. Collect ready operations (prefill or decode)
2. Group by operation type
3. Pack into batch (up to max_batch_tokens)
4. Execute on GPU
5. Return results to waiting inferlets
6. RepeatMulti-GPU Support
Pie supports two parallelism strategies:
Tensor Parallelism (TP)
Splits a single model across multiple GPUs:
[[model]]
device = ["cuda:0", "cuda:1", "cuda:2", "cuda:3"]
tensor_parallel_size = 4 # All 4 GPUs share one modelUse case: Large models that don’t fit on a single GPU
Data Parallelism (DP)
Runs independent model replicas on separate GPUs:
[[model]]
device = ["cuda:0", "cuda:1", "cuda:2", "cuda:3"]
tensor_parallel_size = 1 # Each GPU runs a full modelUse case: High-throughput serving with smaller models
Hybrid Parallelism
Combines TP and DP:
[[model]]
device = ["cuda:0", "cuda:1", "cuda:2", "cuda:3"]
tensor_parallel_size = 2 # 2 DP groups, each with 2 TP GPUsUse case: Balance between large model support and throughput
IPC Architecture
Communication between the Rust runtime (control layer) and Python workers (inference layer) uses FFI-based IPC:
Two-Phase Initialization
Phase 1: Create IPC channels
Rust Runtime → Create FfiIpcBackend → Named IPC serverPhase 2: Python workers connect
Python Worker → Connect to IPC server → Handshake
↓
Backend registeredRequest/Response Flow
Rust (Control) → IPC Queue → Python (Inference)
↓
Execute on GPU
↓
Rust (Control) ← IPC Queue ← Python (Inference)Optimizations:
- Zero-copy tensor transfer (shared memory)
- Batch-level communication (not per-token)
- Async I/O to avoid blocking
WebAssembly Component Model
Pie uses the WebAssembly Component Model for inferlet APIs. This provides:
- Language Agnostic: Write inferlets in Rust, C++, Go, or any language that compiles to Wasm
- Versioned Interfaces: WIT definitions enable API evolution without breaking changes
- Type Safety: Strong typing at the boundary between Wasm and host
WIT Interface Example
// runtime/wit/inferlet.wit
interface model {
record context {
id: u64
}
create-context: func() -> context
fill: func(ctx: context, text: string) -> result<_, string>
forward: func(ctx: context) -> result<list<f32>, string>
sample: func(logits: list<f32>, temperature: f32) -> u32
}Inferlets import this interface and call functions directly:
use inferlet::model::{create_context, fill, forward, sample};
let ctx = create_context();
fill(ctx, "Hello, world!")?;
let logits = forward(ctx)?;
let token = sample(logits, 0.8);Telemetry and Observability
Pie integrates with OpenTelemetry for distributed tracing:
Inferlet Span
├─ create_context Span
├─ fill Span
│ └─ batch_prefill Span (GPU)
├─ generate Span
│ ├─ forward Span
│ │ └─ batch_decode Span (GPU)
│ └─ sample Span
└─ return SpanCollected Metrics:
- Request latency (end-to-end)
- GPU utilization
- Batch size distribution
- KV cache hit rate
- Token throughput
Design Principles
- Separation of Concerns: Application, control, and inference layers have clear boundaries
- Fine-Grained APIs: Inferlets control KV cache, sampling, and generation logic
- Efficient Batching: Continuous batching maximizes GPU utilization
- Resource Virtualization: Abstract GPU resources as high-level objects
- Sandboxed Execution: Wasm provides security and portability
- Observability First: Built-in tracing and metrics for production deployments
Comparison with Traditional Serving
| Feature | Traditional Serving | Pie |
|---|---|---|
| API | Fixed (text completion, chat) | Programmable (custom generation logic) |
| KV Cache | Global policies | Per-request control |
| Batching | System-level | Exposed to application |
| Extensibility | Fork and modify | Write inferlets |
| I/O Integration | External (round-trips) | Built-in (within inferlet) |
| Isolation | None (shared process) | Wasm sandboxing |
Further Reading
-
Research Papers:
- LLM Serving as an Operating System (HotOS 2025)
- Pie: Programmable LLM Serving (SOSP 2025)
-
Related Documentation:
- Writing Inferlets — Learn to build custom serving logic
- Configuration Reference — Tune performance and resource allocation
- Deployment Guide — Production deployment patterns