Smallpond – A lightweight data processing framework built on DuckDB and 3FS

jamesblonde · 2025-03-02T20:21:20 1740946880

We are seeing more and more specialized query engines. This is a query engine specialized for training pipelines. It is not general purpose - it is for providing batches of training data at workers. It uses Ray for parallelization. The kind of queries you need are random reads (to implement shuffling across epochs), arrow support (zero copy to Pandas DataFrames), and efficient checkpointing.

nyrikki · 2025-03-03T15:33:42 1741016022

Some of what they are doing is simply what was lost due to the ubiquitous nature of the relational model.

The hierarchical model is applicable to many problems and actually in part why moving off mainframes is challenging because IMS is so much more efficient than the relational model for applications like airline tickets.

There have been several efforts to leverage object stores in the way they did that I am aware of but it was a hard sell.

The hierarchical model really only works for many to one relationships, and it's integrity model differs and is not as DRY.

There are lessons to learn here but it requires some relearning.

When you have a shopping cart, having data local to the server handling the transaction is also a benefit.

Codd's relational model has advantages, but has held back some efforts because we are just use to dealing with the painful parts that we often don't consider other options.

auxten · 2025-03-03T03:34:14 1740972854

Data operations are increasingly happening near the GPU side to boost efficiency—especially for compute-heavy workflows. Talking about Arrow file processing and zero-copy queries on DataFrames, which are becoming crucial for modern data pipelines. I think another option worth considering is chdb, which supports these features and fits well with this shift. (author of chdb here)

agilob · 2025-03-03T08:23:19 1740990199

I'm super impressed how much effort DeepSeek did and how much of it they opensourced.

orlp · 2025-03-02T18:31:35 1740940295

One thing I found peculiar is that for the GraySort benchmark it dispatches to Polars by default to do the actual sorting, not DuckDB: https://github.com/deepseek-ai/smallpond/blob/ed112db42af4d0....

tomnipotent · 2025-03-03T02:03:34 1740967414

The function argument defaults to polars, but the actual benchmark code sets duckdb by default.

https://github.com/deepseek-ai/smallpond/blob/ed112db42af4d0...

orlp · 2025-03-03T08:59:45 1740992385

I see, confusing multiple layers of defaults :)

dang · 2025-03-02T20:20:12 1740946812

Related ongoing thread:

Understanding Smallpond and 3FS - https://news.ycombinator.com/item?id=43232410

also:

DuckDB goes distributed? DeepSeek's smallpond takes on Big Data - https://news.ycombinator.com/item?id=43206964 (no comments there, but some people have been recommending that article)

rubenvanwyk · 2025-03-02T19:08:15 1740942495

May Data Engineering content keep on hitting front page HN!

HackerThemAll · 2025-03-02T18:19:20 1740939560

DuckDB itself is cool enough, especially when combined with SQLite and/or PostgreSQL, and now this. Thanks DeepSeek!

dcreater · 2025-03-03T07:23:33 1740986613

How is duckdb combined with SQLite? Aren't they alternatives to each other?

HackerThemAll · 2025-03-03T14:39:17 1741012757

They are complementary to each other. There's an SQLite extension for use within DuckDB [1], which gives you a power of great transactional capabilities of SQLite and speed of analytical queries within DuckDB's columnar storage engine, all within a single database.

[1] https://duckdb.org/docs/stable/extensions/sqlite.html

jitl · 2025-03-03T09:11:47 1740993107

Not sure what the poster meant but DuckDB is an analytics DB, it doesn’t have a btree index - at least not last time I looked. You could consider it the OLAP embedded DB to SQLite’s OLTP embedded db.

DuckDB can read SQLite so you can even imagine using them side by side in the same system, serving point reads and writes from SQLite and using DuckDB for stuff like aggregates and searches that SQLite is slower at.

dcreater · 2025-03-03T07:28:07 1740986887

Confused by the example in the repo? What is the use case for this? Is it a replacement for dask, ray etc? (Not a professional swe)

fastasucan · 2025-02-28T13:26:51 1740749211

What does this do - what is the benefit over DuckDB, Polers etc?

articsputnik · 2025-02-28T15:06:19 1740755179

Mehdi just wrote about this. Mainly starting DAGs parallelism using Ray (core) and their filesystem 3FS. See https://mehdio.substack.com/p/duckdb-goes-distributed-deepse....

mritchie712 · 2025-03-02T14:39:53 1740926393

I don't think you get any really benefits over duckdb unless your data is 10tb+ or you spin up 3FS (which seem challenging).

RyanHamilton · 2025-03-02T19:35:45 1740944145

If you want to checkout duckdb try QStudio. It's a free sql client with duckdb integrated: https://www.timestored.com/qstudio/help/duckdb-sql-editor. Disclaimer: I'm the main author.

maximilianroos · 2025-03-02T19:39:36 1740944376

Big fan of QStudio! Thanks for building it!

dcreater · 2025-03-03T07:26:22 1740986782

What's with the win95 ui?

RyanHamilton · 2025-03-03T09:14:22 1740993262

There are many themes to choose from. I recorded the demo on that page and I like windows 95. I concede it may not be pretty but I've always found it functional. The default is darcula theme like shown on the main page: https://www.timestored.com/qstudio/

shipp02 · 2025-03-02T18:53:00 1740941580

Is the code written by the deepseek model?

I should probably give up on being a software engineer if it is.

cavisne · 2025-03-02T20:19:09 1740946749

There is a chinese blogpost from 2019 about 3FS so it predates deepseek [1]. It will be interesting to see the benchmarks but I suspect without 3FS smallpond is not that useful (the bottleneck would move to the networked file system).

None of the big US clouds support Infiniband broadly (Azure & Oracle have some support) so 3FS itself is not very useful to US companies who want to use public clouds.

[1] https://www.high-flyer.cn/blog/3fs/

breadwinner · 2025-03-02T19:09:24 1740942564

Give up and become what? Most white collar jobs will be automated in the coming years. You think doctors' jobs are safe?

ezst · 2025-03-02T19:23:25 1740943405

Not OP, but, anything that actually physically affects the real world for the better? For instance, large infrastructure engineering and construction projects are not going to run themselves any time soon. The world doesn't revolve around ad and fin tech.

didntknowyou · 2025-03-02T20:07:48 1740946068

you can already google the information , the majority of a doctor's value is not in their information but their people and technical skills.

agilob · 2025-03-03T08:27:20 1740990440

>the majority of a doctor's value is not in their information but their people and technical skills.

Not even that, in the last years I haven't met a single doctor who would even care. Their value is now a necessary evil, they have the legal powers to recommend you to a hospital and give prescriptions. These legal powers will be much harder to change.

rscho · 2025-03-02T20:21:22 1740946882

Well, googling the info is one thing. But today, medicine is still mostly a know-how profession. Residency is there mostly to transmit know-how.

nurettin · 2025-03-03T04:13:10 1740975190

If your white collar job consists of simply using software, like copying numbers you see to an excel sheet, maybe. Otherwise they are pretty safe. People have been building tools and automation for thousands of years, yet nobody invented a fully automated cook for your fancy family dinner.

rscho · 2025-03-02T19:23:27 1740943407

Yes, doctors are safe. Because they do things. With their hands. That no one else does.

aragonite · 2025-03-02T19:39:38 1740944378

> Because they do things. With their hands. That no one else does

That's only true of surgeons :) What if your specialty is nonsurgical (internal medicine, pediatrics, psychiatry, etc)?

rscho · 2025-03-02T19:46:36 1740944796

Almost all specialties do various technical procedures that only them really know how to do. The extreme is psychoanalytic psychiatry, which are the only ones really doing nothing with their hands (yes, interventional psychiatry is a thing...). Now, you could argue that 'yes, but most of the times it's done by techs/nurses'. Well, no. When things go south, and in all places where there is noone else to do the stuff (of which there are many) docs are on their own.

Regarding surgery, I expect it to be one of the easiest procedures to automate, actually (still quite hard, obviously). Because surgery is the only case where there's always advanced imaging available beforehand, and the environment is relatively fixed (OR).

menaerus · 2025-03-03T10:30:37 1740997837

Why do you think medical science wrt complexity is any different than applied math, which computer science essentially is? People already can use LLMs to assist them in diagnosing health issues so why would it be hard to believe that the doctors won't be using the same kind of assistance soon too?

rscho · 2025-03-03T15:25:30 1741015530

> Why do you think medical science wrt complexity is any different than applied math

I don't think I wrote that.

Doctors already use tech assistance. I just pointed out that while we've got efficient robots for applied math, we don't have those as agents in the physical world. People who do blue collar jobs are less replaceable. Well, believe it or not, but most doctors are actually blue collar workers.

menaerus · 2025-03-03T18:11:27 1741025487

You sort of implied that with your replies across the thread. And since AI already replaced part of the CS, I was wondering why do you think this would not be the case with doctors. I'm not sure I agree it's a blue collar profession. I can easily see diagnostics being replaced with AI models.

rscho · 2025-03-03T19:20:46 1741029646

I never wanted to imply that. But here, people frequently assume that because that's what they're used to. Diagnosis is the tip of the iceberg. Most people here aren't sick, so diagnostics are their only focus. If they get ill, they want a diagnosis. But many people are chronically ill already, and doctors spend most of their time treating, not diagnosing. Treating people is made in good part of technical procedures and practical assessments, and you need doctors for that because robots are still far behind for that kind of stuff. People actually have a completely skewed view of what a doctor is.

downrightmike · 2025-03-02T19:53:48 1740945228

Not even true of all surgeons, the ones that make the most money use machines to work on things their hands couldn't do

skeeter2020 · 2025-03-02T23:48:55 1740959335

pathologists are some of the highest paid doctors and they are right in the crosshairs of what AI is getting better at performing.

rscho · 2025-03-03T00:45:47 1740962747

Do you really know what pathologists do ? Apparently not...

geodel · 2025-03-03T15:51:03 1741017063

IT engineers thought the same. Until finally automation is setting them right.

rscho · 2025-03-03T17:31:55 1741023115

I'm sure going to be amazed if the LLMs of the future 10y suddenly acquire the ability to physically cut just the right bit of a random surgical piece, with a precise idea of where, when and in what orientation the surgeon dug it out, all that with shitty documentation. Humans will be cheaper for a long time still.

rscho · 2025-03-02T19:55:36 1740945336

Haha. Have you actually ever seen a surgical robot yourself? Your claim is laughable. There is no automation whatsoever in any robot on the market currently.

downrightmike · 2025-03-02T19:56:17 1740945377

not automation, yet

nurettin · 2025-03-03T04:14:59 1740975299

Psychiatrists do that triangle shape with their hands.

ghc · 2025-03-02T19:47:02 1740944822

Uh, pediatricians do a lot with their hands. I don't think my kids (or future grandkids) will be seeing an AI/robot doctor.

mdaniel · 2025-03-02T19:45:02 1740944702

Also, a hallucination for 'SELECT mising_field FROM borgus_tuble' is one thing, hallucinating that taking a dose of Cl Na O along with CH3 CO2 H will cure covid is another thing entirely

sramam · 2025-03-02T20:00:10 1740945610

This is so funny!

However it can't even be called hallucinating. Imagine the incident "postmortem":

    But the AI was trained on White House press briefings

Made my day...

sharpshadow · 2025-03-03T05:02:41 1740978161

Is it really true that people drank bleach!? It always felt to me as some idiot did it once and it was repeated by the media endlessly, probably for clicks because this story is so dumb. Nonetheless the actual thing which people take is ClO2.

delfinom · 2025-03-02T19:38:03 1740944283

Nope.

Healthcare megacorps are buying up independent practices like crazy. All because doctors can't keep up with the bullshit IT required for insurance, state mandates, etc and that's in addition to the insanity of even renting commercial real estate for an office these days.

These megacorps set quotas and push doctors to nickel and dime like crazy. They sure as shit will spend the money to find robots that can give you a prostate exam with a robot dildo.

mdaniel · 2025-03-02T19:47:06 1740944826

Sounds good; if all these pro-AI folks could get it to complete the insurance paperwork that'd be swell. Actually, come to think of it, do that for the paperwork from both sides, doctor and patient, and eliminate and entire class of leaches upon humanity

I'm going to laugh if DOGE eliminates the IRS, but also might be thankful

tyre · 2025-03-02T19:52:53 1740945173

Join us at https://www.camber.health/ if you want to help fix this.

We build software that automates insurance billing for clinics.

And yes, the sentiment is correct that the burden of insurance encourages consolidation in healthcare. Wrapping that away (i.e. Stripe for healthcare financial infra) lowers the barrier to entrepreneurship.

rscho · 2025-03-02T19:52:08 1740945128

Don't laugh too quickly, because what you describe is already happening: models are used to design processes allowing insurance corps to deny claims optimally, while on the other side models write your claims. If I were you, I wouldn't be laughing. If you are laughing, then you don't see where this is going to take us.

holdenk · 2025-03-03T07:15:15 1740986115

Admittedly just one part of the insurance paperwork but we're working on automating appeals at Fight Health Insurance / Fight Paperwork :) ( https://www.fighthealthinsurance.com/ / https://www.fightpaperwork.com/ )

rscho · 2025-03-02T19:48:54 1740944934

Except the tech to do that is not there, and we're quite far from it. It's one thing to have a robot write text, it's a whole other thing to have a robot perform at human level in medical procedures. Not happening tomorrow.

risyachka · 2025-03-03T00:22:56 1740961376

Why would you assume that?

menaerus · 2025-03-03T10:35:01 1740998101

It's not impossible to imagine. Not many LoC and it's a lightweight python layer on top of the stack that actually does the heavy lifting - DuckDB and 3FS.

lvl155 · 2025-03-02T19:39:54 1740944394

Looking forward to next few years when we can finally abstract away all the back-end techs.

BobbyJo · 2025-03-02T19:48:34 1740944914

We ain't even solved garbage collection yet, and you think "back end systems" are going to abstracted away in the next few years?

purplerabbit · 2025-03-02T20:44:53 1740948293

Maybe they just mean for the type of projects they care about

BobbyJo · 2025-03-02T22:28:47 1740954527

Can't you already just use FaaS and managed persistence?

tarruda · 2025-03-02T19:59:02 1740945542

> We ain't even solved garbage collection yet

Can you elaborate on that?

BobbyJo · 2025-03-02T20:06:54 1740946014

People still write in languages that force you to manage your own memory.

Once performance starts to matter (either due to scale or time requirements) abstractions always have tradeoffs you can't accept.

pyrolistical · 2025-03-02T20:48:16 1740948496

So then, how can garbage collection ever be solved if it’s a trade-off

BobbyJo · 2025-03-02T22:30:16 1740954616

And how can backends be abstracted away if there is a trade off?

As long as compute is a meaningful percentage of spend, the trade off will matter.

pyrolistical · 2025-03-02T22:43:31 1740955411

Right. So what does it look like for garbage collection to be solved? You’re saying it’s not ever possible

BobbyJo · 2025-03-03T00:05:40 1740960340

I am saying it's not possible for the foreseeable future, yes. The same way backends becoming an abstraction most developers don't need to worry about is also not going to happen in the foreseeable future.

threeseed · 2025-03-02T21:43:02 1740951782

We've had this for at least a decade now.

If you use a cloud provider there are managed solutions for data engineering pipelines.

m2f2 · 2025-03-03T05:46:40 1740980800

Sure, and it's not cheap.

agilob · 2025-03-03T08:28:26 1740990506

AI isn't going to be any cheaper

（评论） (comments)

（评论）
(comments)