Last Updated: 3/13/2026

Architecture Overview

Pie is designed as a three-layer system that separates application logic, resource management, and inference execution. This architecture enables programmable LLM serving while maintaining high performance and resource efficiency.

System Layers


┌─────────────────────────────────────────────┐
│         Application Layer (Wasm)            │
│  Inferlets run in sandboxed Wasm runtime    │
└─────────────────────────────────────────────┘
                    ↓ API Calls
┌─────────────────────────────────────────────┐
│          Control Layer (Rust)               │
│  Resource virtualization & batch scheduling │
└─────────────────────────────────────────────┘
                    ↓ Batch Ops
┌─────────────────────────────────────────────┐
│        Inference Layer (Python/Rust)        │
│    GPU execution & model serving            │
└─────────────────────────────────────────────┘

Application Layer

The application layer hosts user-defined inferlets in a sandboxed WebAssembly runtime. Inferlets are compiled to Wasm and run in isolated instances with controlled access to system resources.

Key Components:

Wasm Runtime: Based on Wasmtime, provides secure sandboxing and component model support
Inferlet Instances: Each running inferlet gets its own isolated Wasm instance
API Bindings: WIT (WebAssembly Interface Types) define the contract between inferlets and the control layer

Security:

Memory isolation between inferlets
Capability-based security model
No direct access to GPU or network (except via controlled APIs)

Control Layer

The control layer virtualizes LLM resources and manages batch scheduling. It exposes high-level APIs to inferlets while coordinating efficient GPU utilization.

Key Components:

Resource Virtualization
- KV Cache Pages: Paged attention with fine-grained allocation and deallocation
- Embedding Vectors: Pooled embedding storage for reuse across requests
- Context Objects: Virtual handles to model state (KV cache + embeddings)
Batch Scheduler
- Groups operations from multiple inferlets into efficient batches
- Prioritizes requests based on resource availability
- Implements continuous batching for high throughput
Model Registry
- Tracks available models and their configurations
- Routes requests to appropriate model instances
- Supports multi-model serving

APIs Exposed to Inferlets:

create_context(): Allocate a new context with KV cache
fill(): Add tokens to context (prefill)
forward(): Run model forward pass
sample(): Sample from logits distribution
embed(): Compute embeddings
KV cache management (split, merge, drop)

Inference Layer

The inference layer executes batched operations on GPUs. It’s implemented primarily in Python (using PyTorch) with Rust FFI for low-latency communication.

Key Components:

Model Backends
- Python Backend: PyTorch-based model execution
- IPC Bridge: Fast communication between Rust runtime and Python workers
- Multi-GPU Support: Tensor parallelism and data parallelism
Batching Engine
- Receives batched operations from control layer
- Executes on GPU with optimized kernels
- Returns results to control layer
Memory Management
- Paged KV cache allocation
- GPU memory pooling
- Automatic garbage collection

Communication Flow

1. Inferlet Submission


Client → HTTP/gRPC → Server → Wasm Runtime
                                   ↓
                           Instantiate Inferlet
                                   ↓
                            Execute main()

2. API Call Execution


Inferlet: ctx.generate(sampler, stop_cond)
    ↓
Control Layer: Batch scheduler queues request
    ↓
Inference Layer: Execute batched forward pass
    ↓
Control Layer: Return logits
    ↓
Inferlet: Sample and continue

3. Multi-Step Generation


Inferlet Loop:
  1. ctx.forward() → Control → Inference → GPU
  2. Receive logits
  3. ctx.sample() → Control (CPU sampling)
  4. Append token to context
  5. Check stop condition
  6. Repeat or return

Resource Management

KV Cache Paging

Pie uses paged attention inspired by vLLM, but with application-level control:


// Inferlet can explicitly manage KV cache
let ctx1 = model.create_context();
ctx1.fill("Translate to French: ");
 
// Split context to reuse prefix
let ctx2 = ctx1.split();  // Shares KV cache pages with ctx1
ctx2.fill("Hello");
 
let ctx3 = ctx1.split();
ctx3.fill("Goodbye");
 
// Both ctx2 and ctx3 reuse the "Translate to French: " prefix

Benefits:

Reduces redundant prefill for shared prefixes
Enables tree/graph-of-thought patterns
Supports speculative decoding and beam search

Batch Scheduling

The scheduler implements continuous batching with iteration-level batching:

Continuous Batching: New requests join ongoing batches dynamically
Iteration-Level Batching: Each decode step is a separate batch
Priority Queues: Prefill (long) and decode (short) operations are scheduled separately

Scheduling Algorithm:


While requests pending:
  1. Collect ready operations (prefill or decode)
  2. Group by operation type
  3. Pack into batch (up to max_batch_tokens)
  4. Execute on GPU
  5. Return results to waiting inferlets
  6. Repeat

Multi-GPU Support

Pie supports two parallelism strategies:

Tensor Parallelism (TP)

Splits a single model across multiple GPUs:


[[model]]
device = ["cuda:0", "cuda:1", "cuda:2", "cuda:3"]
tensor_parallel_size = 4  # All 4 GPUs share one model

Use case: Large models that don’t fit on a single GPU

Data Parallelism (DP)

Runs independent model replicas on separate GPUs:


[[model]]
device = ["cuda:0", "cuda:1", "cuda:2", "cuda:3"]
tensor_parallel_size = 1  # Each GPU runs a full model

Use case: High-throughput serving with smaller models

Hybrid Parallelism

Combines TP and DP:


[[model]]
device = ["cuda:0", "cuda:1", "cuda:2", "cuda:3"]
tensor_parallel_size = 2  # 2 DP groups, each with 2 TP GPUs

Use case: Balance between large model support and throughput

IPC Architecture

Communication between the Rust runtime (control layer) and Python workers (inference layer) uses FFI-based IPC:

Two-Phase Initialization

Phase 1: Create IPC channels


Rust Runtime → Create FfiIpcBackend → Named IPC server

Phase 2: Python workers connect


Python Worker → Connect to IPC server → Handshake
                                           ↓
                                  Backend registered

Request/Response Flow


Rust (Control) → IPC Queue → Python (Inference)
                                ↓
                         Execute on GPU
                                ↓
Rust (Control) ← IPC Queue ← Python (Inference)

Optimizations:

Zero-copy tensor transfer (shared memory)
Batch-level communication (not per-token)
Async I/O to avoid blocking

WebAssembly Component Model

Pie uses the WebAssembly Component Model for inferlet APIs. This provides:

Language Agnostic: Write inferlets in Rust, C++, Go, or any language that compiles to Wasm
Versioned Interfaces: WIT definitions enable API evolution without breaking changes
Type Safety: Strong typing at the boundary between Wasm and host

WIT Interface Example


// runtime/wit/inferlet.wit
interface model {
  record context {
    id: u64
  }
  
  create-context: func() -> context
  fill: func(ctx: context, text: string) -> result<_, string>
  forward: func(ctx: context) -> result<list<f32>, string>
  sample: func(logits: list<f32>, temperature: f32) -> u32
}

Inferlets import this interface and call functions directly:


use inferlet::model::{create_context, fill, forward, sample};
 
let ctx = create_context();
fill(ctx, "Hello, world!")?;
let logits = forward(ctx)?;
let token = sample(logits, 0.8);

Telemetry and Observability

Pie integrates with OpenTelemetry for distributed tracing:


Inferlet Span
  ├─ create_context Span
  ├─ fill Span
  │   └─ batch_prefill Span (GPU)
  ├─ generate Span
  │   ├─ forward Span
  │   │   └─ batch_decode Span (GPU)
  │   └─ sample Span
  └─ return Span

Collected Metrics:

Request latency (end-to-end)
GPU utilization
Batch size distribution
KV cache hit rate
Token throughput

Design Principles

Separation of Concerns: Application, control, and inference layers have clear boundaries
Fine-Grained APIs: Inferlets control KV cache, sampling, and generation logic
Efficient Batching: Continuous batching maximizes GPU utilization
Resource Virtualization: Abstract GPU resources as high-level objects
Sandboxed Execution: Wasm provides security and portability
Observability First: Built-in tracing and metrics for production deployments

Comparison with Traditional Serving

Feature	Traditional Serving	Pie
API	Fixed (text completion, chat)	Programmable (custom generation logic)
KV Cache	Global policies	Per-request control
Batching	System-level	Exposed to application
Extensibility	Fork and modify	Write inferlets
I/O Integration	External (round-trips)	Built-in (within inferlet)
Isolation	None (shared process)	Wasm sandboxing

Architecture Overview

System Layers

Application Layer

Control Layer

Inference Layer

Communication Flow

1. Inferlet Submission

2. API Call Execution

3. Multi-Step Generation

Resource Management

KV Cache Paging

Batch Scheduling

Multi-GPU Support

Tensor Parallelism (TP)

Data Parallelism (DP)

Hybrid Parallelism

IPC Architecture

Two-Phase Initialization

Request/Response Flow

WebAssembly Component Model

WIT Interface Example

Telemetry and Observability

Design Principles

Comparison with Traditional Serving

Further Reading