Skip to Content
Deployment Guide

Last Updated: 3/13/2026


Deployment Guide

This guide covers deploying Pie in production environments, including Docker, Kubernetes, and bare-metal setups.

Docker Deployment

Pie includes a Dockerfile for containerized deployments.

Building the Docker Image

git clone https://github.com/pie-project/pie.git cd pie docker build -t pie-server:latest .

Running with Docker

docker run -d \ --name pie-server \ --gpus all \ -p 8080:8080 \ -v ~/.pie:/root/.pie \ pie-server:latest

Flags:

  • --gpus all: Expose all GPUs to the container (requires nvidia-container-toolkit)
  • -p 8080:8080: Map host port 8080 to container port 8080
  • -v ~/.pie:/root/.pie: Mount configuration and cache directory

Docker Compose

Create a docker-compose.yml file:

version: '3.8' services: pie-server: image: pie-server:latest container_name: pie-server runtime: nvidia environment: - NVIDIA_VISIBLE_DEVICES=all ports: - "8080:8080" volumes: - ./config:/root/.pie - ./cache:/root/.pie/cache - ./adapters:/root/.pie/adapters restart: unless-stopped

Start the service:

docker-compose up -d

Kubernetes Deployment

Prerequisites

  • Kubernetes cluster with GPU support (NVIDIA GPU Operator or device plugin)
  • kubectl configured to access your cluster
  • Persistent storage for configuration and cache

Deployment Manifest

Create pie-deployment.yaml:

apiVersion: v1 kind: ConfigMap metadata: name: pie-config data: config.toml: | host = "0.0.0.0" port = 8080 enable_auth = true [[model]] hf_repo = "Qwen/Qwen3-0.6B" device = ["cuda:0"] tensor_parallel_size = 1 activation_dtype = "bfloat16" weight_dtype = "bfloat16" gpu_mem_utilization = 0.8 [telemetry] enabled = true endpoint = "http://otel-collector:4317" service_name = "pie-k8s" --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: pie-cache spec: accessModes: - ReadWriteOnce resources: requests: storage: 50Gi --- apiVersion: apps/v1 kind: Deployment metadata: name: pie-server spec: replicas: 1 selector: matchLabels: app: pie-server template: metadata: labels: app: pie-server spec: containers: - name: pie image: pie-server:latest ports: - containerPort: 8080 name: http resources: limits: nvidia.com/gpu: 1 requests: nvidia.com/gpu: 1 volumeMounts: - name: config mountPath: /root/.pie subPath: config.toml - name: cache mountPath: /root/.pie/cache env: - name: NVIDIA_VISIBLE_DEVICES value: "all" volumes: - name: config configMap: name: pie-config - name: cache persistentVolumeClaim: claimName: pie-cache --- apiVersion: v1 kind: Service metadata: name: pie-server spec: type: LoadBalancer ports: - port: 8080 targetPort: 8080 protocol: TCP name: http selector: app: pie-server

Deploy to Kubernetes:

kubectl apply -f pie-deployment.yaml

Multi-GPU Deployment

For tensor parallelism across multiple GPUs:

resources: limits: nvidia.com/gpu: 4 # Request 4 GPUs requests: nvidia.com/gpu: 4

Update the config to use all GPUs:

[[model]] hf_repo = "meta-llama/Llama-3.1-70B" device = ["cuda:0", "cuda:1", "cuda:2", "cuda:3"] tensor_parallel_size = 4

Bare-Metal Deployment

System Requirements

  • GPU: NVIDIA GPU with CUDA 12.6+ support
  • RAM: 16GB+ (more for larger models)
  • Storage: 100GB+ for models and cache
  • OS: Ubuntu 20.04+ or similar Linux distribution

Installation

  1. Install CUDA and drivers:

    # Follow NVIDIA's official installation guide # https://developer.nvidia.com/cuda-downloads
  2. Install Pie from source:

    git clone https://github.com/pie-project/pie.git cd pie/pie uv sync --extra cu128
  3. Create a systemd service:

Create /etc/systemd/system/pie-server.service:

[Unit] Description=Pie LLM Serving System After=network.target [Service] Type=simple User=pie Group=pie WorkingDirectory=/opt/pie Environment="PATH=/opt/pie/.venv/bin:/usr/local/bin:/usr/bin" ExecStart=/opt/pie/.venv/bin/pie serve --config /etc/pie/config.toml Restart=on-failure RestartSec=10s [Install] WantedBy=multi-user.target
  1. Enable and start the service:
    sudo systemctl daemon-reload sudo systemctl enable pie-server sudo systemctl start pie-server sudo systemctl status pie-server

Reverse Proxy with Nginx

For production deployments, use a reverse proxy for TLS termination:

server { listen 443 ssl http2; server_name pie.example.com; ssl_certificate /etc/letsencrypt/live/pie.example.com/fullchain.pem; ssl_certificate_key /etc/letsencrypt/live/pie.example.com/privkey.pem; location / { proxy_pass http://127.0.0.1:8080; proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection 'upgrade'; proxy_set_header Host $host; proxy_cache_bypass $http_upgrade; # Increase timeouts for long-running inferlets proxy_read_timeout 300s; proxy_send_timeout 300s; } }

Performance Tuning

GPU Memory Optimization

Adjust gpu_mem_utilization based on your workload:

# Conservative (leaves more memory for model weights) gpu_mem_utilization = 0.7 # Aggressive (maximizes KV cache capacity) gpu_mem_utilization = 0.9

Batch Size Tuning

Increase max_batch_tokens for higher throughput:

# Default max_batch_tokens = 10240 # High-throughput (requires more GPU memory) max_batch_tokens = 32768

CUDA Graphs (Experimental)

Enable CUDA graphs for reduced kernel launch overhead:

use_cuda_graphs = true

:::caution CUDA graphs may not work with all models. Test thoroughly before enabling in production. :::

Monitoring and Observability

OpenTelemetry Integration

Enable telemetry to collect traces and metrics:

[telemetry] enabled = true endpoint = "http://otel-collector:4317" service_name = "pie-production"

Set up an OpenTelemetry collector to forward data to your observability platform (Jaeger, Prometheus, Grafana, etc.).

Health Checks

Pie exposes a health check endpoint at /health:

curl http://localhost:8080/health

Use this for Kubernetes liveness and readiness probes:

livenessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 30 periodSeconds: 10 readinessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 10 periodSeconds: 5

Logging

Configure log output via the log_dir parameter:

from pie import start_server, ServerConfig config = ServerConfig( host="0.0.0.0", port=8080, enable_auth=True, cache_dir="/var/lib/pie/cache", verbose=True, log_dir="/var/log/pie", registry="/var/lib/pie/registry" ) handle = start_server(config)

Logs include:

  • Request/response traces
  • Model loading and initialization
  • Batch scheduling decisions
  • Error and warning messages

Security Best Practices

  1. Enable authentication in production:

    enable_auth = true
  2. Use TLS: Deploy behind a reverse proxy with TLS termination

  3. Restrict network access: Bind to 127.0.0.1 or use firewall rules

  4. Update regularly: Keep Pie and dependencies up to date for security patches

  5. Isolate workloads: Use containers or VMs for multi-tenant deployments

  6. Monitor resource usage: Set up alerts for GPU memory, CPU, and disk usage

Troubleshooting

Out of Memory (OOM) Errors

Symptoms: Server crashes with CUDA OOM errors

Solutions:

  • Reduce gpu_mem_utilization (e.g., from 0.8 to 0.7)
  • Decrease max_batch_tokens
  • Use a smaller model or increase tensor parallelism
  • Enable gradient checkpointing (if supported by the model)

Slow Inference

Symptoms: High latency for requests

Solutions:

  • Check GPU utilization (nvidia-smi)
  • Increase max_batch_tokens for better batching
  • Enable CUDA graphs (experimental)
  • Use smaller kv_page_size for better cache reuse
  • Profile with telemetry to identify bottlenecks

Model Loading Failures

Symptoms: Server fails to start or load models

Solutions:

  • Verify HuggingFace model repository name
  • Check internet connectivity (for downloading models)
  • Ensure sufficient disk space for model cache
  • Verify CUDA/GPU availability (nvidia-smi)

Production Checklist

Before deploying to production:

  • Enable authentication (enable_auth = true)
  • Configure TLS via reverse proxy
  • Set up monitoring and alerting
  • Configure log rotation
  • Test failover and restart behavior
  • Document authorized users and access policies
  • Set resource limits (CPU, memory, GPU)
  • Configure backups for configuration and adapters
  • Test with production-like workloads
  • Establish incident response procedures