蒙提：一种用 Rust 编写的、用于人工智能的极简安全 Python 解释器。

蒙提：一种用 Rust 编写的、用于人工智能的极简安全 Python 解释器。
Monty: A minimal, secure Python interpreter written in Rust for use by AI

## Monty：为AI代理设计的安全Python解释器 Monty是一个极简、安全的Python解释器，用Rust构建，旨在安全地执行大型语言模型（LLM）生成的代码。它避免了传统沙箱（如Docker）的开销，启动时间低于1微秒——明显快于替代方案。 Monty支持Python的一个子集，包括类型提示，并允许受控地访问开发者定义的宿主函数。它阻止了对文件系统、环境变量和网络的直接访问。主要功能包括资源跟踪（内存、时间）、stdout/stderr捕获以及快照功能，用于暂停和恢复执行。虽然功能有限（除了少量模块外没有标准库，目前还没有类或匹配语句），但Monty在特定用途上表现出色：使LLM能够编写和执行Python代码以执行诸如工具使用之类的任务，提供了一种更快、更便宜、更可靠的传统工具调用方法的替代方案。它旨在为Pydantic AI中的代码模式等功能提供支持。 Monty可用于Rust、Python和JavaScript，可以通过pip或npm安装。它目前仍处于实验阶段，但与Docker、Pyodide和直接Python执行等解决方案相比，性能表现出希望。

## Monty：一种为AI安全设计的极简Python解释器 Monty 是一种新的、极简且安全的 Python 解释器，使用 Rust 构建，专门设计用于 AI 代理。由 Pydantic 团队开发，它优先考虑速度（启动时间微秒级）和较小的攻击面，通过省略标准库来实现。核心思想是允许 LLM 编写和执行 Python 代码，以便在沙盒环境中进行工具使用，从而减轻安全风险。提供 WebAssembly 构建版本以供实验 ([https://simonw.github.io/research/monty-wasm-pyodide/demo.html](https://simonw.github.io/research/monty-wasm-pyodide/demo.html))。讨论的重点在于极简解释器与 CPython 完全功能之间的权衡、沙盒技术的需求（虚拟机与 seccomp 等操作系统功能），以及 LLM 适应受限 Python 环境的限制的可能性。Pydantic 团队计划扩展 Monty 的功能，以支持类、数据类型和 JSON 处理等常见用例，重点是启用“代码模式”，以提高 LLM 工具调用的效率。

原文

A minimal, secure Python interpreter written in Rust for use by AI.

Experimental - This project is still in development, and not ready for the prime time.

A minimal, secure Python interpreter written in Rust for use by AI.

Monty avoids the cost, latency, complexity and general faff of using full container based sandbox for running LLM generated code.

Instead, it let's you safely run Python code written by an LLM embedded in your agent, with startup times measured in single digit microseconds not hundreds of milliseconds.

What Monty can do:

Run a reasonable subset of Python code - enough for your agent to express what it wants to do
Completely block access to the host environment: filesystem, env variables and network access are all implemented via external function calls the developer can control
Call functions on the host - only functions you give it access to
Run typechecking - monty supports full modern python type hints and comes with ty included in a single binary to run typechecking
Be snapshotted to bytes at external function calls, meaning you can store the interpreter state in a file or database, and resume later
Startup extremely fast (<1μs to go from code to execution result), and has runtime performance that is similar to CPython (generally between 5x faster and 5x slower)
Be called from Rust, Python, or Javascript - because Monty has no dependencies on cpython, you can use it anywhere you can run Rust
Control resource usage - Monty can track memory usage, allocations, stack depth, and execution time and cancel execution if it exceeds preset limits
Collect stdout and stderr and return it to the caller
Run async or sync code on the host via async or sync code on the host

What Monty cannot do:

Use the standard library (except a few select modules: sys, typing, asyncio, dataclasses (soon), json (soon))
Use third party libraries (like Pydantic), support for external python library is not a goal
define classes (support should come soon)
use match statements (again, support should come soon)

In short, Monty is extremely limited and designed for one use case:

To run code written by agents.

For motivation on why you might want to do this, see:

In very simple terms, the idea of all the above is that LLMs can be work faster, cheaper and more reliably if they're asked to write Python (or Javascript) code, instead of relying on traditional tool calling. Monty makes that possible without the complexity of a sandbox or risk of running code directly on the host.

Note: Monty will (soon) be used to implement codemode in Pydantic AI

Monty can be called from Python, JavaScript/TypeScript or Rust.

To install:

(Or pip install pydantic-monty for the boomers)

Usage:

from typing import Any

import pydantic_monty

code = """
async def agent(prompt: str, messages: Messages):
    while True:
        print(f'messages so far: {messages}')
        output = await call_llm(prompt, messages)
        if isinstance(output, str):
            return output
        messages.extend(output)

await agent(prompt, [])
"""

type_definitions = """
from typing import Any

Messages = list[dict[str, Any]]

async def call_llm(prompt: str, messages: Messages) -> str | Messages:
    raise NotImplementedError()

prompt: str = ''
"""

m = pydantic_monty.Monty(
    code,
    inputs=['prompt'],
    external_functions=['call_llm'],
    script_name='agent.py',
    type_check=True,
    type_check_stubs=type_definitions,
)


Messages = list[dict[str, Any]]


async def call_llm(prompt: str, messages: Messages) -> str | Messages:
    if len(messages) < 2:
        return [{'role': 'system', 'content': 'example response'}]
    else:
        return f'example output, message count {len(messages)}'


async def main():
    output = await pydantic_monty.run_monty_async(
        m,
        inputs={'prompt': 'testing'},
        external_functions={'call_llm': call_llm},
    )
    print(output)
    #> example output, message count 2


if __name__ == '__main__':
    import asyncio

    asyncio.run(main())

Iterative Execution with External Functions

Use start() and resume() to handle external function calls iteratively, giving you control over each call:

import pydantic_monty

code = """
data = fetch(url)
len(data)
"""

m = pydantic_monty.Monty(code, inputs=['url'], external_functions=['fetch'])

# Start execution - pauses when fetch() is called
result = m.start(inputs={'url': 'https://example.com'})

print(type(result))
#> <class 'pydantic_monty.MontySnapshot'>
print(result.function_name)  # fetch
#> fetch
print(result.args)
#> ('https://example.com',)

# Perform the actual fetch, then resume with the result
result = result.resume(return_value='hello world')

print(type(result))
#> <class 'pydantic_monty.MontyComplete'>
print(result.output)
#> 11

Both Monty and MontySnapshot can be serialized to bytes and restored later. This allows caching parsed code or suspending execution across process boundaries:

import pydantic_monty

# Serialize parsed code to avoid re-parsing
m = pydantic_monty.Monty('x + 1', inputs=['x'])
data = m.dump()

# Later, restore and run
m2 = pydantic_monty.Monty.load(data)
print(m2.run(inputs={'x': 41}))
#> 42

# Serialize execution state mid-flight
m = pydantic_monty.Monty('fetch(url)', inputs=['url'], external_functions=['fetch'])
progress = m.start(inputs={'url': 'https://example.com'})
state = progress.dump()

# Later, restore and resume (e.g., in a different process)
progress2 = pydantic_monty.MontySnapshot.load(state)
result = progress2.resume(return_value='response data')
print(result.output)
#> response data

use monty::{MontyRun, MontyObject, NoLimitTracker, StdPrint};

let code = r#"
def fib(n):
    if n <= 1:
        return n
    return fib(n - 1) + fib(n - 2)

fib(x)
"#;

let runner = MontyRun::new(code.to_owned(), "fib.py", vec!["x".to_owned()], vec![]).unwrap();
let result = runner.run(vec![MontyObject::Int(10)], NoLimitTracker, &mut StdPrint).unwrap();
assert_eq!(result, MontyObject::Int(55));

MontyRun and RunProgress can be serialized using the dump() and load() methods:

use monty::{MontyRun, MontyObject, NoLimitTracker, StdPrint};

// Serialize parsed code
let runner = MontyRun::new("x + 1".to_owned(), "main.py", vec!["x".to_owned()], vec![]).unwrap();
let bytes = runner.dump().unwrap();

// Later, restore and run
let runner2 = MontyRun::load(&bytes).unwrap();
let result = runner2.run(vec![MontyObject::Int(41)], NoLimitTracker, &mut StdPrint).unwrap();
assert_eq!(result, MontyObject::Int(42));

Monty will power code-mode in Pydantic AI. Instead of making sequential tool calls, the LLM writes Python code that calls your tools as functions and Monty executes it safely.

from pydantic_ai import Agent
from pydantic_ai.toolsets.code_mode import CodeModeToolset
from pydantic_ai.toolsets.function import FunctionToolset
from typing_extensions import TypedDict


class WeatherResult(TypedDict):
    city: str
    temp_c: float
    conditions: str


toolset = FunctionToolset()


@toolset.tool
def get_weather(city: str) -> WeatherResult:
    """Get current weather for a city."""
    # your real implementation here
    return {'city': city, 'temp_c': 18, 'conditions': 'partly cloudy'}


@toolset.tool
def get_population(city: str) -> int:
    """Get the population of a city."""
    return {'london': 9_000_000, 'paris': 2_100_000, 'tokyo': 14_000_000}.get(
        city.lower(), 0
    )


toolset = CodeModeToolset(toolset)

agent = Agent(
    'anthropic:claude-sonnet-4-5',
    toolsets=[toolset],
)

result = agent.run_sync(
    'Compare the weather and population of London, Paris, and Tokyo.'
)
print(result.output)

There are generally two responses when you show people Monty:

Oh my god, this solves so many problems, I want it.
Why not X?

Where X is some alternative technology. Oddly often these responses are combined, suggesting people have not yet found an alternative that works for them, but are incredulous that there's really no good alternative to creating an entire Python implementation from scratch.

I'll try to run through the most obvious alternatives, and why there aren't right for what we wanted.

NOTE: all these technologies are impressive and have widespread uses, this commentary on their limitations for our use case should not be seen as a criticism. Most of these solutions were not conceived with the goal of providing an LLM sandbox, which is why they're not necessary great at it.

Tech	Language completeness	Security	Start latency	Cost	Setup complexity	File mounting	Snapshotting
Monty	partial	strict	0.06ms	free	easy	easy	easy
Docker	full	good	195ms	free	intermediate	easy	intermediate
Pyodide	full	poor	2800ms	free	intermediate	easy	hard
starlark-rust	very limited	good	1.7ms	free	easy	not available?	impossible?
sandboxing service	full	strict	1033ms	not free	intermediate	hard	intermediate
YOLO Python	full	non-existent	0.1ms / 30ms	free	easy	easy / scary	hard

See ./scripts/startup_performance.py for the script used to calculate the startup performance numbers.

Details on each row below:

Language completeness: No classes (yet), limited stdlib, no third-party libraries
Security: Explicitly controlled filesystem, network, and env access, strict limits on execution time and memory usage
Start latency: Starts in microseconds
Setup complexity: just pip install pydantic-monty or npm install @pydantic/monty, ~4.5MB download
File mounting: Strictly controlled, see #85
Snapshotting: Monty's pause and resume functionality with dump() and load() makes it trivial to pause, resume and fork execution

Language completeness: Full CPython with any library
Security: Process and filesystem isolation, network policies, but container escapes exist, memory limitation is possible
Start latency: Container startup overhead (~195ms measured)
Setup complexity: Requires Docker daemon, container images, orchestration, python:3.14-alpine is 50MB - docker can't be installed from PyPI
File mounting: Volume mounts work well
Snapshotting: Possible with durable execution solutions like Temporal, or snapshotting an image and saving it as a Docker image.

Language completeness: Full CPython compiled to WASM, almost all libraries available
Security: Relies on browser/WASM sandbox - not designed for server-side isolation, python code can run arbitrary code in the JS runtime, only deno allows isolation, memory limits are hard/impossible to enforce with deno
Start latency: WASM runtime loading is slow (~2800ms cold start)
Setup complexity: Need to load WASM runtime, handle async initialization, pyodide NPM package is ~12MB, deno is ~50MB - Pyodide can't be called with just PyPI packages
File mounting: Virtual filesystem via browser APIs
Snapshotting: Possible with durable execution solutions like Temporal presumably, but hard

See starlark-rust.

Language completeness: Configuration language, not Python - no classes, exceptions, async
Security: Deterministic and hermetic by design
Start latency: runs embedded in the process like Monty, hence impressive startup time
Setup complexity: Usable in python via starlark-pyo3
File mounting: No file handling by design AFAIK?
Snapshotting: Impossible AFAIK?

Services like Daytona, E2B, Modal.

There are similar challenges, more setup complexity but lower network latency for setting up your own sandbox setup with k8s.

Language completeness: Full CPython with any library
Security: Professionally managed container isolation
Start latency: Network round-trip and container startup time. I got ~1s cold start time with Daytona EU from London, Daytona advertise sub 90ms latency, presumably that's for an existing container, not clear if it includes network latency
Cost: Pay per execution or compute time
Setup complexity: API integration, auth tokens - fine for startups but generally a non-start for enterprises
File mounting: Upload/download via API calls
Snapshotting: Possible with durable execution solutions like Temporal, also the services offer some solutions for this, I think based con docker containers

Running Python directly via exec() (~0.1ms) or subprocess (~30ms).

Language completeness: Full CPython with any library
Security: None - full filesystem, network, env vars, system commands
Start latency: Near-zero for exec(), ~30ms for subprocess
Setup complexity: None
File mounting: Direct filesystem access (that's the problem)
Snapshotting: Possible with durable execution solutions like Temporal