Skip to Content
Architecture Overview

Last Updated: 3/13/2026


Architecture Overview

Pie is designed as a three-layer system that separates application logic, resource management, and inference execution. This architecture enables programmable LLM serving while maintaining high performance and resource efficiency.

System Layers

┌─────────────────────────────────────────────┐ │ Application Layer (Wasm) │ │ Inferlets run in sandboxed Wasm runtime │ └─────────────────────────────────────────────┘ ↓ API Calls ┌─────────────────────────────────────────────┐ │ Control Layer (Rust) │ │ Resource virtualization & batch scheduling │ └─────────────────────────────────────────────┘ ↓ Batch Ops ┌─────────────────────────────────────────────┐ │ Inference Layer (Python/Rust) │ │ GPU execution & model serving │ └─────────────────────────────────────────────┘

Application Layer

The application layer hosts user-defined inferlets in a sandboxed WebAssembly runtime. Inferlets are compiled to Wasm and run in isolated instances with controlled access to system resources.

Key Components:

  • Wasm Runtime: Based on Wasmtime, provides secure sandboxing and component model support
  • Inferlet Instances: Each running inferlet gets its own isolated Wasm instance
  • API Bindings: WIT (WebAssembly Interface Types) define the contract between inferlets and the control layer

Security:

  • Memory isolation between inferlets
  • Capability-based security model
  • No direct access to GPU or network (except via controlled APIs)

Control Layer

The control layer virtualizes LLM resources and manages batch scheduling. It exposes high-level APIs to inferlets while coordinating efficient GPU utilization.

Key Components:

  1. Resource Virtualization

    • KV Cache Pages: Paged attention with fine-grained allocation and deallocation
    • Embedding Vectors: Pooled embedding storage for reuse across requests
    • Context Objects: Virtual handles to model state (KV cache + embeddings)
  2. Batch Scheduler

    • Groups operations from multiple inferlets into efficient batches
    • Prioritizes requests based on resource availability
    • Implements continuous batching for high throughput
  3. Model Registry

    • Tracks available models and their configurations
    • Routes requests to appropriate model instances
    • Supports multi-model serving

APIs Exposed to Inferlets:

  • create_context(): Allocate a new context with KV cache
  • fill(): Add tokens to context (prefill)
  • forward(): Run model forward pass
  • sample(): Sample from logits distribution
  • embed(): Compute embeddings
  • KV cache management (split, merge, drop)

Inference Layer

The inference layer executes batched operations on GPUs. It’s implemented primarily in Python (using PyTorch) with Rust FFI for low-latency communication.

Key Components:

  1. Model Backends

    • Python Backend: PyTorch-based model execution
    • IPC Bridge: Fast communication between Rust runtime and Python workers
    • Multi-GPU Support: Tensor parallelism and data parallelism
  2. Batching Engine

    • Receives batched operations from control layer
    • Executes on GPU with optimized kernels
    • Returns results to control layer
  3. Memory Management

    • Paged KV cache allocation
    • GPU memory pooling
    • Automatic garbage collection

Communication Flow

1. Inferlet Submission

Client → HTTP/gRPC → Server → Wasm Runtime Instantiate Inferlet Execute main()

2. API Call Execution

Inferlet: ctx.generate(sampler, stop_cond) Control Layer: Batch scheduler queues request Inference Layer: Execute batched forward pass Control Layer: Return logits Inferlet: Sample and continue

3. Multi-Step Generation

Inferlet Loop: 1. ctx.forward() → Control → Inference → GPU 2. Receive logits 3. ctx.sample() → Control (CPU sampling) 4. Append token to context 5. Check stop condition 6. Repeat or return

Resource Management

KV Cache Paging

Pie uses paged attention inspired by vLLM, but with application-level control:

// Inferlet can explicitly manage KV cache let ctx1 = model.create_context(); ctx1.fill("Translate to French: "); // Split context to reuse prefix let ctx2 = ctx1.split(); // Shares KV cache pages with ctx1 ctx2.fill("Hello"); let ctx3 = ctx1.split(); ctx3.fill("Goodbye"); // Both ctx2 and ctx3 reuse the "Translate to French: " prefix

Benefits:

  • Reduces redundant prefill for shared prefixes
  • Enables tree/graph-of-thought patterns
  • Supports speculative decoding and beam search

Batch Scheduling

The scheduler implements continuous batching with iteration-level batching:

  1. Continuous Batching: New requests join ongoing batches dynamically
  2. Iteration-Level Batching: Each decode step is a separate batch
  3. Priority Queues: Prefill (long) and decode (short) operations are scheduled separately

Scheduling Algorithm:

While requests pending: 1. Collect ready operations (prefill or decode) 2. Group by operation type 3. Pack into batch (up to max_batch_tokens) 4. Execute on GPU 5. Return results to waiting inferlets 6. Repeat

Multi-GPU Support

Pie supports two parallelism strategies:

Tensor Parallelism (TP)

Splits a single model across multiple GPUs:

[[model]] device = ["cuda:0", "cuda:1", "cuda:2", "cuda:3"] tensor_parallel_size = 4 # All 4 GPUs share one model

Use case: Large models that don’t fit on a single GPU

Data Parallelism (DP)

Runs independent model replicas on separate GPUs:

[[model]] device = ["cuda:0", "cuda:1", "cuda:2", "cuda:3"] tensor_parallel_size = 1 # Each GPU runs a full model

Use case: High-throughput serving with smaller models

Hybrid Parallelism

Combines TP and DP:

[[model]] device = ["cuda:0", "cuda:1", "cuda:2", "cuda:3"] tensor_parallel_size = 2 # 2 DP groups, each with 2 TP GPUs

Use case: Balance between large model support and throughput

IPC Architecture

Communication between the Rust runtime (control layer) and Python workers (inference layer) uses FFI-based IPC:

Two-Phase Initialization

Phase 1: Create IPC channels

Rust Runtime → Create FfiIpcBackend → Named IPC server

Phase 2: Python workers connect

Python Worker → Connect to IPC server → Handshake Backend registered

Request/Response Flow

Rust (Control) → IPC Queue → Python (Inference) Execute on GPU Rust (Control) ← IPC Queue ← Python (Inference)

Optimizations:

  • Zero-copy tensor transfer (shared memory)
  • Batch-level communication (not per-token)
  • Async I/O to avoid blocking

WebAssembly Component Model

Pie uses the WebAssembly Component Model for inferlet APIs. This provides:

  1. Language Agnostic: Write inferlets in Rust, C++, Go, or any language that compiles to Wasm
  2. Versioned Interfaces: WIT definitions enable API evolution without breaking changes
  3. Type Safety: Strong typing at the boundary between Wasm and host

WIT Interface Example

// runtime/wit/inferlet.wit interface model { record context { id: u64 } create-context: func() -> context fill: func(ctx: context, text: string) -> result<_, string> forward: func(ctx: context) -> result<list<f32>, string> sample: func(logits: list<f32>, temperature: f32) -> u32 }

Inferlets import this interface and call functions directly:

use inferlet::model::{create_context, fill, forward, sample}; let ctx = create_context(); fill(ctx, "Hello, world!")?; let logits = forward(ctx)?; let token = sample(logits, 0.8);

Telemetry and Observability

Pie integrates with OpenTelemetry for distributed tracing:

Inferlet Span ├─ create_context Span ├─ fill Span │ └─ batch_prefill Span (GPU) ├─ generate Span │ ├─ forward Span │ │ └─ batch_decode Span (GPU) │ └─ sample Span └─ return Span

Collected Metrics:

  • Request latency (end-to-end)
  • GPU utilization
  • Batch size distribution
  • KV cache hit rate
  • Token throughput

Design Principles

  1. Separation of Concerns: Application, control, and inference layers have clear boundaries
  2. Fine-Grained APIs: Inferlets control KV cache, sampling, and generation logic
  3. Efficient Batching: Continuous batching maximizes GPU utilization
  4. Resource Virtualization: Abstract GPU resources as high-level objects
  5. Sandboxed Execution: Wasm provides security and portability
  6. Observability First: Built-in tracing and metrics for production deployments

Comparison with Traditional Serving

FeatureTraditional ServingPie
APIFixed (text completion, chat)Programmable (custom generation logic)
KV CacheGlobal policiesPer-request control
BatchingSystem-levelExposed to application
ExtensibilityFork and modifyWrite inferlets
I/O IntegrationExternal (round-trips)Built-in (within inferlet)
IsolationNone (shared process)Wasm sandboxing

Further Reading