Last Updated: 3/13/2026

Configuration Reference

Pie uses a TOML configuration file to control server behavior, model settings, and resource allocation. The default configuration file is located at ~/.pie/config.toml.

Network Settings


host = "127.0.0.1"
port = 8080

host: The network interface to bind to (default: 127.0.0.1 for localhost only)
port: The TCP port for the server to listen on (default: 8080)

Authentication


enable_auth = false

enable_auth: Enable SSH-based authentication for client connections (default: false)

When authentication is enabled, authorized users are managed via ~/.pie/authorized_users.toml. See the Authentication Guide for details.

Model Configuration

Pie supports multiple model configurations. Each model is defined in a [[model]] section:


[[model]]
hf_repo = "Qwen/Qwen3-0.6B"
device = ["cuda:0"]
tensor_parallel_size = 1
activation_dtype = "bfloat16"
weight_dtype = "bfloat16"

Model Parameters

hf_repo: HuggingFace model repository identifier (e.g., "Qwen/Qwen3-0.6B")
device: List of GPU devices for this model (e.g., ["cuda:0"] or ["cuda:0", "cuda:1"])
tensor_parallel_size: Degree of tensor parallelism (default: 1)
- Set to 1 for data parallelism only (each GPU runs the full model independently)
- Set to len(device) for tensor parallelism only (all GPUs share one model)
- Example: 4 GPUs with tensor_parallel_size=2 creates 2 data-parallel groups of 2 tensor-parallel GPUs each

Precision Settings

activation_dtype: Data type for activations ("float32", "float16", or "bfloat16")
weight_dtype: Data type for model weights ("float32", "float16", or "bfloat16")

KV Cache Configuration


kv_page_size = 16

kv_page_size: Size of KV cache pages in tokens (default: 16)

Pie uses paged attention with fine-grained KV cache management. Smaller page sizes provide more flexibility for cache reuse but increase metadata overhead.

Batch and Resource Limits


max_batch_tokens = 10240
max_dist_size = 32
max_num_embeds = 128

max_batch_tokens: Maximum total tokens across all requests in a batch (default: 10240)
max_dist_size: Maximum vocabulary distribution size for sampling (default: 32)
max_num_embeds: Maximum number of embedding vectors per request (default: 128)

Adapter (LoRA) Settings


max_num_adapters = 32
max_adapter_rank = 8
adapter_path = "~/.pie/adapters/"

max_num_adapters: Maximum number of concurrent LoRA adapters (default: 32)
max_adapter_rank: Maximum rank for LoRA adapters (default: 8)
adapter_path: Directory for storing adapter weights (default: ~/.pie/adapters/)

Memory Management


gpu_mem_utilization = 0.8

gpu_mem_utilization: Fraction of GPU memory to allocate for KV cache (default: 0.8)

Pie reserves the specified fraction of GPU memory for KV cache pages. Lower values leave more memory for model weights and activations.

CUDA Graphs (Experimental)


use_cuda_graphs = false

use_cuda_graphs: Enable CUDA graph optimization for reduced kernel launch overhead (default: false)

:::caution Experimental Feature CUDA graphs are experimental and may not work with all models or configurations. :::

Telemetry Configuration


[telemetry]
enabled = false
endpoint = "http://localhost:4317"
service_name = "pie"

enabled: Enable OpenTelemetry tracing (default: false)
endpoint: OTLP collector endpoint (default: "http://localhost:4317")
service_name: Service name for telemetry data (default: "pie")

Example Configuration

Here’s a complete example with two models in a tensor-parallel setup:


# Pie Server Configuration
 
host = "0.0.0.0"  # Listen on all interfaces
port = 8080
enable_auth = true
 
# Small model on single GPU
[[model]]
hf_repo = "Qwen/Qwen3-0.6B"
device = ["cuda:0"]
tensor_parallel_size = 1
activation_dtype = "bfloat16"
weight_dtype = "bfloat16"
kv_page_size = 16
max_batch_tokens = 10240
 
# Large model with 4-way tensor parallelism
[[model]]
hf_repo = "meta-llama/Llama-3.1-70B"
device = ["cuda:1", "cuda:2", "cuda:3", "cuda:4"]
tensor_parallel_size = 4
activation_dtype = "bfloat16"
weight_dtype = "bfloat16"
kv_page_size = 16
max_batch_tokens = 8192
 
# Memory and resource limits
gpu_mem_utilization = 0.85
max_num_embeds = 256
max_dist_size = 64
 
# Adapter settings
max_num_adapters = 64
max_adapter_rank = 16
adapter_path = "/data/pie/adapters/"
 
# Telemetry
[telemetry]
enabled = true
endpoint = "http://otel-collector:4317"
service_name = "pie-production"

Default Device Selection

If no configuration file exists, Pie automatically selects a default device:

CUDA if available (cuda:0)
Metal (macOS) if available (mps)
CPU as fallback

The default model is Qwen/Qwen3-0.6B.