Last Updated: 3/13/2026

Deployment Guide

This guide covers deploying Pie in production environments, including Docker, Kubernetes, and bare-metal setups.

Docker Deployment

Pie includes a Dockerfile for containerized deployments.

Building the Docker Image


git clone https://github.com/pie-project/pie.git
cd pie
docker build -t pie-server:latest .

Running with Docker


docker run -d \
  --name pie-server \
  --gpus all \
  -p 8080:8080 \
  -v ~/.pie:/root/.pie \
  pie-server:latest

Flags:

--gpus all: Expose all GPUs to the container (requires nvidia-container-toolkit)
-p 8080:8080: Map host port 8080 to container port 8080
-v ~/.pie:/root/.pie: Mount configuration and cache directory

Docker Compose

Create a docker-compose.yml file:


version: '3.8'
 
services:
  pie-server:
    image: pie-server:latest
    container_name: pie-server
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
    ports:
      - "8080:8080"
    volumes:
      - ./config:/root/.pie
      - ./cache:/root/.pie/cache
      - ./adapters:/root/.pie/adapters
    restart: unless-stopped

Start the service:


docker-compose up -d

Kubernetes Deployment

Prerequisites

Kubernetes cluster with GPU support (NVIDIA GPU Operator or device plugin)
kubectl configured to access your cluster
Persistent storage for configuration and cache

Deployment Manifest

Create pie-deployment.yaml:


apiVersion: v1
kind: ConfigMap
metadata:
  name: pie-config
data:
  config.toml: |
    host = "0.0.0.0"
    port = 8080
    enable_auth = true
    
    [[model]]
    hf_repo = "Qwen/Qwen3-0.6B"
    device = ["cuda:0"]
    tensor_parallel_size = 1
    activation_dtype = "bfloat16"
    weight_dtype = "bfloat16"
    
    gpu_mem_utilization = 0.8
    
    [telemetry]
    enabled = true
    endpoint = "http://otel-collector:4317"
    service_name = "pie-k8s"
 
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pie-cache
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 50Gi
 
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: pie-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: pie-server
  template:
    metadata:
      labels:
        app: pie-server
    spec:
      containers:
      - name: pie
        image: pie-server:latest
        ports:
        - containerPort: 8080
          name: http
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            nvidia.com/gpu: 1
        volumeMounts:
        - name: config
          mountPath: /root/.pie
          subPath: config.toml
        - name: cache
          mountPath: /root/.pie/cache
        env:
        - name: NVIDIA_VISIBLE_DEVICES
          value: "all"
      volumes:
      - name: config
        configMap:
          name: pie-config
      - name: cache
        persistentVolumeClaim:
          claimName: pie-cache
 
---
apiVersion: v1
kind: Service
metadata:
  name: pie-server
spec:
  type: LoadBalancer
  ports:
  - port: 8080
    targetPort: 8080
    protocol: TCP
    name: http
  selector:
    app: pie-server

Deploy to Kubernetes:


kubectl apply -f pie-deployment.yaml

Multi-GPU Deployment

For tensor parallelism across multiple GPUs:


resources:
  limits:
    nvidia.com/gpu: 4  # Request 4 GPUs
  requests:
    nvidia.com/gpu: 4

Update the config to use all GPUs:


[[model]]
hf_repo = "meta-llama/Llama-3.1-70B"
device = ["cuda:0", "cuda:1", "cuda:2", "cuda:3"]
tensor_parallel_size = 4

Bare-Metal Deployment

System Requirements

GPU: NVIDIA GPU with CUDA 12.6+ support
RAM: 16GB+ (more for larger models)
Storage: 100GB+ for models and cache
OS: Ubuntu 20.04+ or similar Linux distribution

Installation

Install CUDA and drivers:


# Follow NVIDIA's official installation guide
# https://developer.nvidia.com/cuda-downloads

Install Pie from source:


git clone https://github.com/pie-project/pie.git
cd pie/pie
uv sync --extra cu128

Create a systemd service:

Create /etc/systemd/system/pie-server.service:


[Unit]
Description=Pie LLM Serving System
After=network.target
 
[Service]
Type=simple
User=pie
Group=pie
WorkingDirectory=/opt/pie
Environment="PATH=/opt/pie/.venv/bin:/usr/local/bin:/usr/bin"
ExecStart=/opt/pie/.venv/bin/pie serve --config /etc/pie/config.toml
Restart=on-failure
RestartSec=10s
 
[Install]
WantedBy=multi-user.target

Enable and start the service:


sudo systemctl daemon-reload
sudo systemctl enable pie-server
sudo systemctl start pie-server
sudo systemctl status pie-server

Reverse Proxy with Nginx

For production deployments, use a reverse proxy for TLS termination:


server {
    listen 443 ssl http2;
    server_name pie.example.com;
 
    ssl_certificate /etc/letsencrypt/live/pie.example.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/pie.example.com/privkey.pem;
 
    location / {
        proxy_pass http://127.0.0.1:8080;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection 'upgrade';
        proxy_set_header Host $host;
        proxy_cache_bypass $http_upgrade;
        
        # Increase timeouts for long-running inferlets
        proxy_read_timeout 300s;
        proxy_send_timeout 300s;
    }
}

Performance Tuning

GPU Memory Optimization

Adjust gpu_mem_utilization based on your workload:


# Conservative (leaves more memory for model weights)
gpu_mem_utilization = 0.7
 
# Aggressive (maximizes KV cache capacity)
gpu_mem_utilization = 0.9

Batch Size Tuning

Increase max_batch_tokens for higher throughput:


# Default
max_batch_tokens = 10240
 
# High-throughput (requires more GPU memory)
max_batch_tokens = 32768

CUDA Graphs (Experimental)

Enable CUDA graphs for reduced kernel launch overhead:


use_cuda_graphs = true

:::caution CUDA graphs may not work with all models. Test thoroughly before enabling in production. :::

Monitoring and Observability

OpenTelemetry Integration

Enable telemetry to collect traces and metrics:


[telemetry]
enabled = true
endpoint = "http://otel-collector:4317"
service_name = "pie-production"

Set up an OpenTelemetry collector to forward data to your observability platform (Jaeger, Prometheus, Grafana, etc.).

Health Checks

Pie exposes a health check endpoint at /health:


curl http://localhost:8080/health

Use this for Kubernetes liveness and readiness probes:


livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
 
readinessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5

Logging

Configure log output via the log_dir parameter:


from pie import start_server, ServerConfig
 
config = ServerConfig(
    host="0.0.0.0",
    port=8080,
    enable_auth=True,
    cache_dir="/var/lib/pie/cache",
    verbose=True,
    log_dir="/var/log/pie",
    registry="/var/lib/pie/registry"
)
 
handle = start_server(config)

Logs include:

Request/response traces
Model loading and initialization
Batch scheduling decisions
Error and warning messages

Security Best Practices

Enable authentication in production:
```
enable_auth = true
```
Use TLS: Deploy behind a reverse proxy with TLS termination
Restrict network access: Bind to 127.0.0.1 or use firewall rules
Update regularly: Keep Pie and dependencies up to date for security patches
Isolate workloads: Use containers or VMs for multi-tenant deployments
Monitor resource usage: Set up alerts for GPU memory, CPU, and disk usage

Troubleshooting

Out of Memory (OOM) Errors

Symptoms: Server crashes with CUDA OOM errors

Solutions:

Reduce gpu_mem_utilization (e.g., from 0.8 to 0.7)
Decrease max_batch_tokens
Use a smaller model or increase tensor parallelism
Enable gradient checkpointing (if supported by the model)

Slow Inference

Symptoms: High latency for requests

Solutions:

Check GPU utilization (nvidia-smi)
Increase max_batch_tokens for better batching
Enable CUDA graphs (experimental)
Use smaller kv_page_size for better cache reuse
Profile with telemetry to identify bottlenecks

Model Loading Failures

Symptoms: Server fails to start or load models

Solutions:

Verify HuggingFace model repository name
Check internet connectivity (for downloading models)
Ensure sufficient disk space for model cache
Verify CUDA/GPU availability (nvidia-smi)

Production Checklist

Before deploying to production: