Skip to Content
Configuration Reference

Last Updated: 3/13/2026


Configuration Reference

Pie uses a TOML configuration file to control server behavior, model settings, and resource allocation. The default configuration file is located at ~/.pie/config.toml.

Network Settings

host = "127.0.0.1" port = 8080
  • host: The network interface to bind to (default: 127.0.0.1 for localhost only)
  • port: The TCP port for the server to listen on (default: 8080)

Authentication

enable_auth = false
  • enable_auth: Enable SSH-based authentication for client connections (default: false)

When authentication is enabled, authorized users are managed via ~/.pie/authorized_users.toml. See the Authentication Guide for details.

Model Configuration

Pie supports multiple model configurations. Each model is defined in a [[model]] section:

[[model]] hf_repo = "Qwen/Qwen3-0.6B" device = ["cuda:0"] tensor_parallel_size = 1 activation_dtype = "bfloat16" weight_dtype = "bfloat16"

Model Parameters

  • hf_repo: HuggingFace model repository identifier (e.g., "Qwen/Qwen3-0.6B")
  • device: List of GPU devices for this model (e.g., ["cuda:0"] or ["cuda:0", "cuda:1"])
  • tensor_parallel_size: Degree of tensor parallelism (default: 1)
    • Set to 1 for data parallelism only (each GPU runs the full model independently)
    • Set to len(device) for tensor parallelism only (all GPUs share one model)
    • Example: 4 GPUs with tensor_parallel_size=2 creates 2 data-parallel groups of 2 tensor-parallel GPUs each

Precision Settings

  • activation_dtype: Data type for activations ("float32", "float16", or "bfloat16")
  • weight_dtype: Data type for model weights ("float32", "float16", or "bfloat16")

KV Cache Configuration

kv_page_size = 16
  • kv_page_size: Size of KV cache pages in tokens (default: 16)

Pie uses paged attention with fine-grained KV cache management. Smaller page sizes provide more flexibility for cache reuse but increase metadata overhead.

Batch and Resource Limits

max_batch_tokens = 10240 max_dist_size = 32 max_num_embeds = 128
  • max_batch_tokens: Maximum total tokens across all requests in a batch (default: 10240)
  • max_dist_size: Maximum vocabulary distribution size for sampling (default: 32)
  • max_num_embeds: Maximum number of embedding vectors per request (default: 128)

Adapter (LoRA) Settings

max_num_adapters = 32 max_adapter_rank = 8 adapter_path = "~/.pie/adapters/"
  • max_num_adapters: Maximum number of concurrent LoRA adapters (default: 32)
  • max_adapter_rank: Maximum rank for LoRA adapters (default: 8)
  • adapter_path: Directory for storing adapter weights (default: ~/.pie/adapters/)

Memory Management

gpu_mem_utilization = 0.8
  • gpu_mem_utilization: Fraction of GPU memory to allocate for KV cache (default: 0.8)

Pie reserves the specified fraction of GPU memory for KV cache pages. Lower values leave more memory for model weights and activations.

CUDA Graphs (Experimental)

use_cuda_graphs = false
  • use_cuda_graphs: Enable CUDA graph optimization for reduced kernel launch overhead (default: false)

:::caution Experimental Feature CUDA graphs are experimental and may not work with all models or configurations. :::

Telemetry Configuration

[telemetry] enabled = false endpoint = "http://localhost:4317" service_name = "pie"
  • enabled: Enable OpenTelemetry tracing (default: false)
  • endpoint: OTLP collector endpoint (default: "http://localhost:4317")
  • service_name: Service name for telemetry data (default: "pie")

Example Configuration

Here’s a complete example with two models in a tensor-parallel setup:

# Pie Server Configuration host = "0.0.0.0" # Listen on all interfaces port = 8080 enable_auth = true # Small model on single GPU [[model]] hf_repo = "Qwen/Qwen3-0.6B" device = ["cuda:0"] tensor_parallel_size = 1 activation_dtype = "bfloat16" weight_dtype = "bfloat16" kv_page_size = 16 max_batch_tokens = 10240 # Large model with 4-way tensor parallelism [[model]] hf_repo = "meta-llama/Llama-3.1-70B" device = ["cuda:1", "cuda:2", "cuda:3", "cuda:4"] tensor_parallel_size = 4 activation_dtype = "bfloat16" weight_dtype = "bfloat16" kv_page_size = 16 max_batch_tokens = 8192 # Memory and resource limits gpu_mem_utilization = 0.85 max_num_embeds = 256 max_dist_size = 64 # Adapter settings max_num_adapters = 64 max_adapter_rank = 16 adapter_path = "/data/pie/adapters/" # Telemetry [telemetry] enabled = true endpoint = "http://otel-collector:4317" service_name = "pie-production"

Default Device Selection

If no configuration file exists, Pie automatically selects a default device:

  • CUDA if available (cuda:0)
  • Metal (macOS) if available (mps)
  • CPU as fallback

The default model is Qwen/Qwen3-0.6B.