数据框 1.0.0.0
Dataframe 1.0.0.0

原始链接: https://discourse.haskell.org/t/ann-dataframe-1-0-0-0/13834

经过两年的开发,一个新的 Haskell dataframe 库的 1.0 版本发布。一个关键特性是**类型化 Dataframe**,它提供编译时模式检查,以提高数据完整性,并使探索性分析与生产流程之间的过渡更加顺畅。这得益于 maxigit 和 mcoady 两位用户的宝贵社区反馈。 该库现在支持从 **Hugging Face 数据集** 读取数据,并能高效处理**大于内存的文件**——处理包含十亿行的数据集大约需要 10-30 分钟。**易用性得到了提升**,数值运算和空值处理更加直观。 未来的开发将侧重于扩展**连接器**(BigQuery、Snowflake、S3)和**格式支持**(Parquet、Iceberg、DuckDB)。最终目标是能够查询大型数据湖,并与 **AI 代理** 集成,以进行类型引导的数据探索。作者对社区表示感谢,特别是 daikonradish,感谢他们的贡献。

## Dataframe 1.0.0.0 发布 & DataHaskell 复兴 Haskell 的 `dataframe` 库已达到 1.0.0.0 版本,引入了 `DataFrame.Typed` API。这个新的 API 在编译时跟踪 dataframe 模式,可以在运行时*之前*捕获与列名和操作相关的错误——这是优于 Python 以运行时为中心的验证方法的重大优势。此功能旨在简化复杂仪表盘和数据管道的构建。 此次发布标志着 DataHaskell 生态系统的更广泛复兴,正在进行中的工作包括反应式笔记本(可在 [https://www.datahaskell.org/](https://www.datahaskell.org/) 找到),提供潜在的仪表盘功能。 讨论还涉及了库的命名约定(由于 Haskell 的 PVP 使用四部分版本号)以及在习惯于 R 和 Python 的组织中采用类型化数据科学工具的挑战。然而,用户强调了类型安全在确保已发表研究的数据完整性方面的优势。
相关文章

原文

It’s been roughly two years of work on this and I think things are in a good enough state that it’s worth calling this v1.

Features

Typed dataframes

We got there eventually and I think we got there in a way that still looks nice. There is now a DataFrame.Typed API that tracks the entire schema of the dataframe - column names, misapplied operations etc are now compile time failures and you can easily move between exploratory and pipeline work. This is in large part thanks to maxigit and mcoady (Github user names) for their feedback.

$(DT.deriveSchemaFromCsvFile "Housing" "./data/housing.csv")
    
main :: IO ()
main = do
    df <- D.readCsv "./data/housing.csv"
    let df' = either (error . show) id (DT.freezeWithError @Housing df)
    let df'' =
            df'
                & DT.derive @"rooms_per_household" (DT.col @"total_rooms" / DT.col @"households")
                & DT.impute @"total_bedrooms" 0
                & DT.derive @"bedrooms_per_household"
                    (DT.col @"total_bedrooms" / DT.col @"households")
                & DT.derive @"population_per_household"
                    (DT.col @"population" / DT.col @"households")

    print df''

Calling dataframe from Python

There’s an implementation of Apache Arrow’s C Data interface along with an example of how to pass dataframes between polars and haskell.

Find that here

Getting data from hugging face

You can explore huggingface datasets. Example:

df <- D.readParquet "hf://datasets/Rafmiggonpaz/spain_and_japan_economic_data/data/train-00000-of-00001.parquet"

Larger than memory files

The Lazy/query-engine-like implementation is now pretty fast. It can compute the one billion row challenge in about 10 minutes on a mac and about 30min on a 12 year old Dell (not OOM).

You’ll have to generate the data yourself but the code is here.

Better ergonomics with numeric promotion and null awareness

Introduced more lenient operators that make happy path computation much easier. E.g:

D.derive "bmi" (F.lift2 (\m h -> (/) <$> m <*> fmap ((^2) . (/100). realToFrac) h) mass height) df

Is now instead:

D.derive "bmi" (mass ./ (height ./ 100) .^ 2) df

What’s next?

Connectors! BigQuery, Snowflake, s3 buckets etc. Formats! Parquet, Iceberg, DuckDB, a custom dataframe format with full data provenance. Moving from small in memory demos to querying large data lakes is the goal.

Also, since a new era is upon us, some integration with ai agents to do type-guided data exploration.

A big thank you to everyone who has taken time to try the library and /or give advice - especially @daikonradish who was the main voice in the direction of the architecture. A lot of the most important design decisions were made on community threads. Thank you all.

联系我们 contact @ memedata.com