数据框 1.0.0.0
Dataframe 1.0.0.0

原始链接: https://discourse.haskell.org/t/ann-dataframe-1-0-0-0/13834

经过两年的开发,一个新的 Haskell dataframe 库的 1.0 版本发布。一个关键特性是**类型化 Dataframe**,它提供编译时模式检查,以提高数据完整性,并使探索性分析与生产流程之间的过渡更加顺畅。这得益于 maxigit 和 mcoady 两位用户的宝贵社区反馈。 该库现在支持从 **Hugging Face 数据集** 读取数据,并能高效处理**大于内存的文件**——处理包含十亿行的数据集大约需要 10-30 分钟。**易用性得到了提升**,数值运算和空值处理更加直观。 未来的开发将侧重于扩展**连接器**(BigQuery、Snowflake、S3)和**格式支持**(Parquet、Iceberg、DuckDB)。最终目标是能够查询大型数据湖,并与 **AI 代理** 集成,以进行类型引导的数据探索。作者对社区表示感谢,特别是 daikonradish,感谢他们的贡献。

黑客新闻 新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 Dataframe 1.0.0.0 (haskell.org) 12 分,由 internet_points 1小时前 | 隐藏 | 过去 | 收藏 | 讨论 帮助 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系 搜索:
相关文章

原文

It’s been roughly two years of work on this and I think things are in a good enough state that it’s worth calling this v1.

Features

Typed dataframes

We got there eventually and I think we got there in a way that still looks nice. There is now a DataFrame.Typed API that tracks the entire schema of the dataframe - column names, misapplied operations etc are now compile time failures and you can easily move between exploratory and pipeline work. This is in large part thanks to maxigit and mcoady (Github user names) for their feedback.

$(DT.deriveSchemaFromCsvFile "Housing" "./data/housing.csv")
    
main :: IO ()
main = do
    df <- D.readCsv "./data/housing.csv"
    let df' = either (error . show) id (DT.freezeWithError @Housing df)
    let df'' =
            df'
                & DT.derive @"rooms_per_household" (DT.col @"total_rooms" / DT.col @"households")
                & DT.impute @"total_bedrooms" 0
                & DT.derive @"bedrooms_per_household"
                    (DT.col @"total_bedrooms" / DT.col @"households")
                & DT.derive @"population_per_household"
                    (DT.col @"population" / DT.col @"households")

    print df''

Calling dataframe from Python

There’s an implementation of Apache Arrow’s C Data interface along with an example of how to pass dataframes between polars and haskell.

Find that here

Getting data from hugging face

You can explore huggingface datasets. Example:

df <- D.readParquet "hf://datasets/Rafmiggonpaz/spain_and_japan_economic_data/data/train-00000-of-00001.parquet"

Larger than memory files

The Lazy/query-engine-like implementation is now pretty fast. It can compute the one billion row challenge in about 10 minutes on a mac and about 30min on a 12 year old Dell (not OOM).

You’ll have to generate the data yourself but the code is here.

Better ergonomics with numeric promotion and null awareness

Introduced more lenient operators that make happy path computation much easier. E.g:

D.derive "bmi" (F.lift2 (\m h -> (/) <$> m <*> fmap ((^2) . (/100). realToFrac) h) mass height) df

Is now instead:

D.derive "bmi" (mass ./ (height ./ 100) .^ 2) df

What’s next?

Connectors! BigQuery, Snowflake, s3 buckets etc. Formats! Parquet, Iceberg, DuckDB, a custom dataframe format with full data provenance. Moving from small in memory demos to querying large data lakes is the goal.

Also, since a new era is upon us, some integration with ai agents to do type-guided data exploration.

A big thank you to everyone who has taken time to try the library and /or give advice - especially @daikonradish who was the main voice in the direction of the architecture. A lot of the most important design decisions were made on community threads. Thank you all.

联系我们 contact @ memedata.com