Last Updated: 4/8/2026

Welcome to Pie 🥧

Pie is a programmable system for LLM serving that lets you control the serving loop itself—not just send prompts and wait for responses.

Why Pie?

Modern AI applications need more than simple text completion. They require:

Fine-grained control over inference and caching
Integrated I/O without round-trip latency
Custom generation workflows for agentic patterns

Pie makes all of this possible through programmable inferlets.

⚡ Quick Example

Here’s a simple inferlet in Rust:


use inferlet::{Args, Result, Sampler, get_auto_model};
use inferlet::stop_condition::{max_len, ends_with_any};
 
#[inferlet::main]
async fn main(mut args: Args) -> Result<String> {
    let model = get_auto_model();
    let mut ctx = model.create_context();
 
    ctx.fill_system("You are a helpful assistant.");
    ctx.fill_user("How are you?");
 
    let sampler = Sampler::top_p(0.6, 0.95);
    let stop = max_len(100).or(ends_with_any(model.eos_tokens()));
 
    let response = ctx.generate(sampler, stop).await;
    Ok(response)
}

Compile to WebAssembly and run:


pie run my-inferlet -- --prompt "Hello!"

🚀 Key Features

Programmable Serving

Write custom decoding logic and resource management policies in your inferlets. Control KV cache, embeddings, and generation at a fine-grained level.

Application-Aware Optimization

Optimize for your specific use case—whether it’s tree-of-thought reasoning, multi-agent systems, or custom workflows.

Integrated Computation & I/O

Call external APIs, run code, and manage data directly in your serving loop without extra round-trips.

Multi-Language Support

Write inferlets in Rust, C++, Go, or any language that compiles to WebAssembly.

Framework Agnostic

Client libraries for Python, JavaScript, and Rust make it easy to integrate with any stack.

📖 Getting Started

Ready to dive in? Here’s your path:

Installation - Get Pie installed on your system
Quickstart Tutorial - Build your first inferlet in 5 minutes
Core Concepts - Understand Pie’s architecture and design
Client SDKs - Integrate Pie into your applications

🎯 Use Cases

Pie excels at:

Agentic Workflows - Multi-step reasoning with tool use
Custom Decoding - Speculative decoding, beam search, MCTS
Interactive Applications - Chat, code completion, real-time generation
Research - Experiment with novel serving strategies
Production Systems - High-throughput, low-latency inference

🌟 Performance

Pie delivers significant performance improvements for complex workflows:

Lower latency through application-aware KV cache management
Higher throughput with fine-grained resource control
Better efficiency by eliminating unnecessary round-trips

Benchmarked on Llama 3.2 1B using L40 GPU

🧑‍💻 Community

GitHub Repository - Star us and contribute!
Community - Join discussions and get help
Roadmap - See what’s coming next

📚 Documentation Sections

For New Users

Installation Guide - Set up Pie
Tutorials - Step-by-step guides
Core Concepts - Understand the fundamentals

For Developers

Writing Inferlets - Build custom serving logic
Client SDKs - Python, JavaScript, Rust APIs
CLI Reference - Command-line tools

Advanced Topics

SDK Development - Build with inferlet SDKs
Supported Models - Compatible LLMs
Standard Inferlets - Built-in inferlets

📄 Research

Pie is backed by academic research:

HotOS 2025 Paper - Vision for LLM serving systems as operating systems
SOSP 2025 Paper - Design and implementation details

🚦 Quick Links

Resource	Description
Installation	Install Pie on your system
Quickstart	Build your first inferlet
Python Client	Python SDK documentation
CLI Reference	Command-line interface guide
Examples	Sample inferlets and code

Ready to get started? Head to the Installation Guide or jump into the Quickstart Tutorial!