Last Updated: 3/13/2026
Deployment Guide
This guide covers deploying Pie in production environments, including Docker, Kubernetes, and bare-metal setups.
Docker Deployment
Pie includes a Dockerfile for containerized deployments.
Building the Docker Image
git clone https://github.com/pie-project/pie.git
cd pie
docker build -t pie-server:latest .Running with Docker
docker run -d \
--name pie-server \
--gpus all \
-p 8080:8080 \
-v ~/.pie:/root/.pie \
pie-server:latestFlags:
--gpus all: Expose all GPUs to the container (requires nvidia-container-toolkit)-p 8080:8080: Map host port 8080 to container port 8080-v ~/.pie:/root/.pie: Mount configuration and cache directory
Docker Compose
Create a docker-compose.yml file:
version: '3.8'
services:
pie-server:
image: pie-server:latest
container_name: pie-server
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=all
ports:
- "8080:8080"
volumes:
- ./config:/root/.pie
- ./cache:/root/.pie/cache
- ./adapters:/root/.pie/adapters
restart: unless-stoppedStart the service:
docker-compose up -dKubernetes Deployment
Prerequisites
- Kubernetes cluster with GPU support (NVIDIA GPU Operator or device plugin)
- kubectl configured to access your cluster
- Persistent storage for configuration and cache
Deployment Manifest
Create pie-deployment.yaml:
apiVersion: v1
kind: ConfigMap
metadata:
name: pie-config
data:
config.toml: |
host = "0.0.0.0"
port = 8080
enable_auth = true
[[model]]
hf_repo = "Qwen/Qwen3-0.6B"
device = ["cuda:0"]
tensor_parallel_size = 1
activation_dtype = "bfloat16"
weight_dtype = "bfloat16"
gpu_mem_utilization = 0.8
[telemetry]
enabled = true
endpoint = "http://otel-collector:4317"
service_name = "pie-k8s"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pie-cache
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: pie-server
spec:
replicas: 1
selector:
matchLabels:
app: pie-server
template:
metadata:
labels:
app: pie-server
spec:
containers:
- name: pie
image: pie-server:latest
ports:
- containerPort: 8080
name: http
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
volumeMounts:
- name: config
mountPath: /root/.pie
subPath: config.toml
- name: cache
mountPath: /root/.pie/cache
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "all"
volumes:
- name: config
configMap:
name: pie-config
- name: cache
persistentVolumeClaim:
claimName: pie-cache
---
apiVersion: v1
kind: Service
metadata:
name: pie-server
spec:
type: LoadBalancer
ports:
- port: 8080
targetPort: 8080
protocol: TCP
name: http
selector:
app: pie-serverDeploy to Kubernetes:
kubectl apply -f pie-deployment.yamlMulti-GPU Deployment
For tensor parallelism across multiple GPUs:
resources:
limits:
nvidia.com/gpu: 4 # Request 4 GPUs
requests:
nvidia.com/gpu: 4Update the config to use all GPUs:
[[model]]
hf_repo = "meta-llama/Llama-3.1-70B"
device = ["cuda:0", "cuda:1", "cuda:2", "cuda:3"]
tensor_parallel_size = 4Bare-Metal Deployment
System Requirements
- GPU: NVIDIA GPU with CUDA 12.6+ support
- RAM: 16GB+ (more for larger models)
- Storage: 100GB+ for models and cache
- OS: Ubuntu 20.04+ or similar Linux distribution
Installation
-
Install CUDA and drivers:
# Follow NVIDIA's official installation guide # https://developer.nvidia.com/cuda-downloads -
Install Pie from source:
git clone https://github.com/pie-project/pie.git cd pie/pie uv sync --extra cu128 -
Create a systemd service:
Create /etc/systemd/system/pie-server.service:
[Unit]
Description=Pie LLM Serving System
After=network.target
[Service]
Type=simple
User=pie
Group=pie
WorkingDirectory=/opt/pie
Environment="PATH=/opt/pie/.venv/bin:/usr/local/bin:/usr/bin"
ExecStart=/opt/pie/.venv/bin/pie serve --config /etc/pie/config.toml
Restart=on-failure
RestartSec=10s
[Install]
WantedBy=multi-user.target- Enable and start the service:
sudo systemctl daemon-reload sudo systemctl enable pie-server sudo systemctl start pie-server sudo systemctl status pie-server
Reverse Proxy with Nginx
For production deployments, use a reverse proxy for TLS termination:
server {
listen 443 ssl http2;
server_name pie.example.com;
ssl_certificate /etc/letsencrypt/live/pie.example.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/pie.example.com/privkey.pem;
location / {
proxy_pass http://127.0.0.1:8080;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection 'upgrade';
proxy_set_header Host $host;
proxy_cache_bypass $http_upgrade;
# Increase timeouts for long-running inferlets
proxy_read_timeout 300s;
proxy_send_timeout 300s;
}
}Performance Tuning
GPU Memory Optimization
Adjust gpu_mem_utilization based on your workload:
# Conservative (leaves more memory for model weights)
gpu_mem_utilization = 0.7
# Aggressive (maximizes KV cache capacity)
gpu_mem_utilization = 0.9Batch Size Tuning
Increase max_batch_tokens for higher throughput:
# Default
max_batch_tokens = 10240
# High-throughput (requires more GPU memory)
max_batch_tokens = 32768CUDA Graphs (Experimental)
Enable CUDA graphs for reduced kernel launch overhead:
use_cuda_graphs = true:::caution CUDA graphs may not work with all models. Test thoroughly before enabling in production. :::
Monitoring and Observability
OpenTelemetry Integration
Enable telemetry to collect traces and metrics:
[telemetry]
enabled = true
endpoint = "http://otel-collector:4317"
service_name = "pie-production"Set up an OpenTelemetry collector to forward data to your observability platform (Jaeger, Prometheus, Grafana, etc.).
Health Checks
Pie exposes a health check endpoint at /health:
curl http://localhost:8080/healthUse this for Kubernetes liveness and readiness probes:
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 5Logging
Configure log output via the log_dir parameter:
from pie import start_server, ServerConfig
config = ServerConfig(
host="0.0.0.0",
port=8080,
enable_auth=True,
cache_dir="/var/lib/pie/cache",
verbose=True,
log_dir="/var/log/pie",
registry="/var/lib/pie/registry"
)
handle = start_server(config)Logs include:
- Request/response traces
- Model loading and initialization
- Batch scheduling decisions
- Error and warning messages
Security Best Practices
-
Enable authentication in production:
enable_auth = true -
Use TLS: Deploy behind a reverse proxy with TLS termination
-
Restrict network access: Bind to
127.0.0.1or use firewall rules -
Update regularly: Keep Pie and dependencies up to date for security patches
-
Isolate workloads: Use containers or VMs for multi-tenant deployments
-
Monitor resource usage: Set up alerts for GPU memory, CPU, and disk usage
Troubleshooting
Out of Memory (OOM) Errors
Symptoms: Server crashes with CUDA OOM errors
Solutions:
- Reduce
gpu_mem_utilization(e.g., from 0.8 to 0.7) - Decrease
max_batch_tokens - Use a smaller model or increase tensor parallelism
- Enable gradient checkpointing (if supported by the model)
Slow Inference
Symptoms: High latency for requests
Solutions:
- Check GPU utilization (
nvidia-smi) - Increase
max_batch_tokensfor better batching - Enable CUDA graphs (experimental)
- Use smaller
kv_page_sizefor better cache reuse - Profile with telemetry to identify bottlenecks
Model Loading Failures
Symptoms: Server fails to start or load models
Solutions:
- Verify HuggingFace model repository name
- Check internet connectivity (for downloading models)
- Ensure sufficient disk space for model cache
- Verify CUDA/GPU availability (
nvidia-smi)
Production Checklist
Before deploying to production:
- Enable authentication (
enable_auth = true) - Configure TLS via reverse proxy
- Set up monitoring and alerting
- Configure log rotation
- Test failover and restart behavior
- Document authorized users and access policies
- Set resource limits (CPU, memory, GPU)
- Configure backups for configuration and adapters
- Test with production-like workloads
- Establish incident response procedures