Last Updated: 3/13/2026
Configuration Reference
Pie uses a TOML configuration file to control server behavior, model settings, and resource allocation. The default configuration file is located at ~/.pie/config.toml.
Network Settings
host = "127.0.0.1"
port = 8080- host: The network interface to bind to (default:
127.0.0.1for localhost only) - port: The TCP port for the server to listen on (default:
8080)
Authentication
enable_auth = false- enable_auth: Enable SSH-based authentication for client connections (default:
false)
When authentication is enabled, authorized users are managed via ~/.pie/authorized_users.toml. See the Authentication Guide for details.
Model Configuration
Pie supports multiple model configurations. Each model is defined in a [[model]] section:
[[model]]
hf_repo = "Qwen/Qwen3-0.6B"
device = ["cuda:0"]
tensor_parallel_size = 1
activation_dtype = "bfloat16"
weight_dtype = "bfloat16"Model Parameters
- hf_repo: HuggingFace model repository identifier (e.g.,
"Qwen/Qwen3-0.6B") - device: List of GPU devices for this model (e.g.,
["cuda:0"]or["cuda:0", "cuda:1"]) - tensor_parallel_size: Degree of tensor parallelism (default:
1)- Set to
1for data parallelism only (each GPU runs the full model independently) - Set to
len(device)for tensor parallelism only (all GPUs share one model) - Example: 4 GPUs with
tensor_parallel_size=2creates 2 data-parallel groups of 2 tensor-parallel GPUs each
- Set to
Precision Settings
- activation_dtype: Data type for activations (
"float32","float16", or"bfloat16") - weight_dtype: Data type for model weights (
"float32","float16", or"bfloat16")
KV Cache Configuration
kv_page_size = 16- kv_page_size: Size of KV cache pages in tokens (default:
16)
Pie uses paged attention with fine-grained KV cache management. Smaller page sizes provide more flexibility for cache reuse but increase metadata overhead.
Batch and Resource Limits
max_batch_tokens = 10240
max_dist_size = 32
max_num_embeds = 128- max_batch_tokens: Maximum total tokens across all requests in a batch (default:
10240) - max_dist_size: Maximum vocabulary distribution size for sampling (default:
32) - max_num_embeds: Maximum number of embedding vectors per request (default:
128)
Adapter (LoRA) Settings
max_num_adapters = 32
max_adapter_rank = 8
adapter_path = "~/.pie/adapters/"- max_num_adapters: Maximum number of concurrent LoRA adapters (default:
32) - max_adapter_rank: Maximum rank for LoRA adapters (default:
8) - adapter_path: Directory for storing adapter weights (default:
~/.pie/adapters/)
Memory Management
gpu_mem_utilization = 0.8- gpu_mem_utilization: Fraction of GPU memory to allocate for KV cache (default:
0.8)
Pie reserves the specified fraction of GPU memory for KV cache pages. Lower values leave more memory for model weights and activations.
CUDA Graphs (Experimental)
use_cuda_graphs = false- use_cuda_graphs: Enable CUDA graph optimization for reduced kernel launch overhead (default:
false)
:::caution Experimental Feature CUDA graphs are experimental and may not work with all models or configurations. :::
Telemetry Configuration
[telemetry]
enabled = false
endpoint = "http://localhost:4317"
service_name = "pie"- enabled: Enable OpenTelemetry tracing (default:
false) - endpoint: OTLP collector endpoint (default:
"http://localhost:4317") - service_name: Service name for telemetry data (default:
"pie")
Example Configuration
Here’s a complete example with two models in a tensor-parallel setup:
# Pie Server Configuration
host = "0.0.0.0" # Listen on all interfaces
port = 8080
enable_auth = true
# Small model on single GPU
[[model]]
hf_repo = "Qwen/Qwen3-0.6B"
device = ["cuda:0"]
tensor_parallel_size = 1
activation_dtype = "bfloat16"
weight_dtype = "bfloat16"
kv_page_size = 16
max_batch_tokens = 10240
# Large model with 4-way tensor parallelism
[[model]]
hf_repo = "meta-llama/Llama-3.1-70B"
device = ["cuda:1", "cuda:2", "cuda:3", "cuda:4"]
tensor_parallel_size = 4
activation_dtype = "bfloat16"
weight_dtype = "bfloat16"
kv_page_size = 16
max_batch_tokens = 8192
# Memory and resource limits
gpu_mem_utilization = 0.85
max_num_embeds = 256
max_dist_size = 64
# Adapter settings
max_num_adapters = 64
max_adapter_rank = 16
adapter_path = "/data/pie/adapters/"
# Telemetry
[telemetry]
enabled = true
endpoint = "http://otel-collector:4317"
service_name = "pie-production"Default Device Selection
If no configuration file exists, Pie automatically selects a default device:
- CUDA if available (
cuda:0) - Metal (macOS) if available (
mps) - CPU as fallback
The default model is Qwen/Qwen3-0.6B.