Last Updated: 4/8/2026
Welcome to Pie 🥧
Pie is a programmable system for LLM serving that lets you control the serving loop itself—not just send prompts and wait for responses.
Why Pie?
Modern AI applications need more than simple text completion. They require:
- Fine-grained control over inference and caching
- Integrated I/O without round-trip latency
- Custom generation workflows for agentic patterns
Pie makes all of this possible through programmable inferlets.
⚡ Quick Example
Here’s a simple inferlet in Rust:
use inferlet::{Args, Result, Sampler, get_auto_model};
use inferlet::stop_condition::{max_len, ends_with_any};
#[inferlet::main]
async fn main(mut args: Args) -> Result<String> {
let model = get_auto_model();
let mut ctx = model.create_context();
ctx.fill_system("You are a helpful assistant.");
ctx.fill_user("How are you?");
let sampler = Sampler::top_p(0.6, 0.95);
let stop = max_len(100).or(ends_with_any(model.eos_tokens()));
let response = ctx.generate(sampler, stop).await;
Ok(response)
}Compile to WebAssembly and run:
pie run my-inferlet -- --prompt "Hello!"🚀 Key Features
Programmable Serving
Write custom decoding logic and resource management policies in your inferlets. Control KV cache, embeddings, and generation at a fine-grained level.
Application-Aware Optimization
Optimize for your specific use case—whether it’s tree-of-thought reasoning, multi-agent systems, or custom workflows.
Integrated Computation & I/O
Call external APIs, run code, and manage data directly in your serving loop without extra round-trips.
Multi-Language Support
Write inferlets in Rust, C++, Go, or any language that compiles to WebAssembly.
Framework Agnostic
Client libraries for Python, JavaScript, and Rust make it easy to integrate with any stack.
📖 Getting Started
Ready to dive in? Here’s your path:
- Installation - Get Pie installed on your system
- Quickstart Tutorial - Build your first inferlet in 5 minutes
- Core Concepts - Understand Pie’s architecture and design
- Client SDKs - Integrate Pie into your applications
🎯 Use Cases
Pie excels at:
- Agentic Workflows - Multi-step reasoning with tool use
- Custom Decoding - Speculative decoding, beam search, MCTS
- Interactive Applications - Chat, code completion, real-time generation
- Research - Experiment with novel serving strategies
- Production Systems - High-throughput, low-latency inference
🌟 Performance
Pie delivers significant performance improvements for complex workflows:
- Lower latency through application-aware KV cache management
- Higher throughput with fine-grained resource control
- Better efficiency by eliminating unnecessary round-trips
Benchmarked on Llama 3.2 1B using L40 GPU
🧑💻 Community
- GitHub Repository - Star us and contribute!
- Community - Join discussions and get help
- Roadmap - See what’s coming next
📚 Documentation Sections
For New Users
- Installation Guide - Set up Pie
- Tutorials - Step-by-step guides
- Core Concepts - Understand the fundamentals
For Developers
- Writing Inferlets - Build custom serving logic
- Client SDKs - Python, JavaScript, Rust APIs
- CLI Reference - Command-line tools
Advanced Topics
- SDK Development - Build with inferlet SDKs
- Supported Models - Compatible LLMs
- Standard Inferlets - Built-in inferlets
📄 Research
Pie is backed by academic research:
- HotOS 2025 Paper - Vision for LLM serving systems as operating systems
- SOSP 2025 Paper - Design and implementation details
🚦 Quick Links
| Resource | Description |
|---|---|
| Installation | Install Pie on your system |
| Quickstart | Build your first inferlet |
| Python Client | Python SDK documentation |
| CLI Reference | Command-line interface guide |
| Examples | Sample inferlets and code |
Ready to get started? Head to the Installation Guide or jump into the Quickstart Tutorial!