Show HN:Xorq——开源的、优先使用 Python 的、Pandas 风格的数据管道
Show HN: Xorq – open-source Python-first Pandas-style pipelines

原始链接: https://github.com/xorq-labs/xorq

xorq是一个Python框架,旨在为机器学习工作流带来可复现且高性能的声明式流水线。它基于Ibis和DataFusion构建,允许将数据转换定义为与引擎无关的表达式。xorq支持多引擎执行,能够在Snowflake、DuckDB和Python等系统之间无缝移动数据。它拥有自动缓存中间结果、可序列化的流水线(通过YAML)以及由DataFusion支持的可移植UDF等特性。 xorq利用Apache Arrow进行高效的数据传输,既可以作为交互式库,也可以作为命令行工具使用。CLI允许根据表达式构建和序列化工件,确保在不同环境下的可复现性和一致性结果。xorq使用延迟计算,能够处理大型数据集而不会耗尽内存。虽然仍在预发布阶段,但xorq提供了一种强大的方法,可以使用熟悉的pandas风格语法构建复杂且可复现的ML流水线。

Xorq是一个开源的、优先使用Python的库,旨在解决将Pandas风格的数据管道从研究部署到生产环境中的痛点。它基于Ibis和DataFusion构建,解决了SQL/Pandas不匹配、调试难题、重复计算和部署不可靠性等问题。其关键特性包括:基于Ibis的表达式系统,实现轻松的引擎流式处理;表达式缓存;支持Pandas DataFrame的DataFusion后端UDF引擎;YAML序列化;以及使用UDF轻松创建Flight端点。 创建者鼓励协作(Apache 2.0许可证)并欢迎反馈。安装方法:`pip install xorq` 或 `nix run github:xorq-labs/xorq`。演示包括MCP服务器+Flight+XGBoost、DuckDB并发示例和一个OpenAI UDF。团队随时解答您的问题。

原文

PyPI Downloads PyPI - Version GitHub License PyPI - Status GitHub Actions Workflow Status Codecov

xorq is a deferred computational framework that brings the replicability and performance of declarative pipelines to the Python ML ecosystem. It enables us to write pandas-style transformations that never run out of memory, automatically cache intermediate results, and seamlessly move between SQL engines and Python UDFs—all while maintaining replicability. xorq is built on top of Ibis and DataFusion.

Feature Description
Declarative expressions Express and execute complex data processing logic via declarative functions. Define transformations as Ibis expressions so that you are not tied to a specific execution engine.
Multi-engine Create unified ML workflows that leverage the strengths of different data engines in a single pipeline. xorq orchestrates data movement between engines (e.g., Snowflake for initial extraction, DuckDB for transformations, and Python for ML model training).
Built-in caching xorq automatically caches intermediate pipeline results, minimizing repeated work.
Serializable pipelines All pipeline definitions, including UDFs, are serialized to YAML, enabling version control, reproducibility, and CI/CD integration. Ensures consistent results across environments and makes it easy to track changes over time.
Portable UDFs Build pipelines as UDxFs- aggregates, windows, and transformations. The DataFusion-based xorq engine provides a portable runtime for UDF execution.
Arrow-native architecture Built on the Apache Arrow columnar memory format and Arrow Flight transport layer, xorq achieves high-performance data transfer without cumbersome serialization overhead.

xorq functions as both an interactive library for building expressions and a command-line interface. This dual nature enables seamless transition from exploratory research to production-ready artifacts. The steps below will guide through using both the CLI and library components to get started.

Caution

This library does not currently have a stable release. Both the API and implementation are subject to change, and future updates may not be backward compatible.

xorq is available as xorq on PyPI:

Note

We are changing the name from LETSQL to xorq.

# your_pipeline.py
import xorq as xo
import xorq.expr.datatypes as dt

@xo.udf.make_pandas_udf(
    schema=xo.schema({"title": str, "url": str}),
    return_type=dt.bool,
    name="url_in_title",
)
def url_in_title(df):
    return df.apply(
        lambda s: (s.url or "") in (s.title or ""),
        axis=1,
    )

# Connect to xorq's embedded engine
con = xo.connect()

# Reference to the parquet file
name = "hn-data-small.parquet"

expr = xo.deferred_read_parquet(
    con,
    xo.options.pins.get_path(name),
    name,
).mutate(**{"url_in_title": url_in_title.on_expr})

expr.execute().head()

xorq provides a CLI that enables you to build serialized artifacts from expressions, making your pipelines reproducible and deployable:

# Build an expression from a Python script
xorq build your_pipeline.py -e "expr" --target-dir builds

This will create a build artifact directory named by its expression hash:

builds
└── fce90c2d4bb8
   ├── abe2c934f4fe.sql
   ├── cec2eb9706bc.sql
   ├── deferred_reads.yaml
   ├── expr.yaml
   ├── metadata.json
   ├── profiles.yaml
   └── sql.yaml

The CLI converts Ibis expressions into serialized artifacts that capture the complete execution graph, ensuring consistent results across environments. More info can be found in the tutorial Building with xorq.

For more examples on how to use xorq, check the examples directory, note that in order to run some of the scripts in there, you need to install the library with examples extra:

pip install 'xorq[examples]'

Contributions are welcome and highly appreciated. To get started, check out the contributing guidelines.

This project heavily relies on Ibis and DataFusion.

This repository is licensed under the Apache License

联系我们 contact @ memedata.com