It’s been roughly two years of work on this and I think things are in a good enough state that it’s worth calling this v1.
Features
Typed dataframes
We got there eventually and I think we got there in a way that still looks nice. There is now a DataFrame.Typed API that tracks the entire schema of the dataframe - column names, misapplied operations etc are now compile time failures and you can easily move between exploratory and pipeline work. This is in large part thanks to maxigit and mcoady (Github user names) for their feedback.
$(DT.deriveSchemaFromCsvFile "Housing" "./data/housing.csv")
main :: IO ()
main = do
df <- D.readCsv "./data/housing.csv"
let df' = either (error . show) id (DT.freezeWithError @Housing df)
let df'' =
df'
& DT.derive @"rooms_per_household" (DT.col @"total_rooms" / DT.col @"households")
& DT.impute @"total_bedrooms" 0
& DT.derive @"bedrooms_per_household"
(DT.col @"total_bedrooms" / DT.col @"households")
& DT.derive @"population_per_household"
(DT.col @"population" / DT.col @"households")
print df''
Calling dataframe from Python
There’s an implementation of Apache Arrow’s C Data interface along with an example of how to pass dataframes between polars and haskell.
Getting data from hugging face
You can explore huggingface datasets. Example:
df <- D.readParquet "hf://datasets/Rafmiggonpaz/spain_and_japan_economic_data/data/train-00000-of-00001.parquet"
Larger than memory files
The Lazy/query-engine-like implementation is now pretty fast. It can compute the one billion row challenge in about 10 minutes on a mac and about 30min on a 12 year old Dell (not OOM).
You’ll have to generate the data yourself but the code is here.
Better ergonomics with numeric promotion and null awareness
Introduced more lenient operators that make happy path computation much easier. E.g:
D.derive "bmi" (F.lift2 (\m h -> (/) <$> m <*> fmap ((^2) . (/100). realToFrac) h) mass height) df
Is now instead:
D.derive "bmi" (mass ./ (height ./ 100) .^ 2) df
What’s next?
Connectors! BigQuery, Snowflake, s3 buckets etc. Formats! Parquet, Iceberg, DuckDB, a custom dataframe format with full data provenance. Moving from small in memory demos to querying large data lakes is the goal.
Also, since a new era is upon us, some integration with ai agents to do type-guided data exploration.
A big thank you to everyone who has taken time to try the library and /or give advice - especially @daikonradish who was the main voice in the direction of the architecture. A lot of the most important design decisions were made on community threads. Thank you all.