（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=40509987

一个团队正在开发一个表单生成器软件即服务 (SaaS)。对于较小的表单，用户可以通过无限滚动来浏览响应，并根据需要加载数据。然而，较大的表单可能包含超过 50,000 个响应，从而导致大量的网络使用和内存压力。通过将这些大型响应集作为 Parquet 文件存储在简单存储服务 (S3) 中，团队利用 DuckDB 和 DuckDB-WASM 在本地加载整个数据集。用户与本地数据库交互，而不是在滚动时从服务器获取行。这简化了流程，提高了效率并确保即使数据负载很大也能保持一致的性能。对于处理中等大小数据的服务，请考虑尝试使用此方法来优化网络带宽并减少内存需求。

This is the exact reason we applied duckdb and duckdb-wasm into our service.

Our team is currently building a form builder SaaS. Most forms have responses under 1,000, but some of them would have more than 50,000 responses.

So, when user tries to explore through all responses in our “response sheet” feature, usually they could be loaded via infinite scrolling (load as they scroll).

This uses up to 100MB of network in total if they had to get object arrays of 50,000 rows of data with 50 columns.

That was where duckdb kicked in : just store the responses into S3 as parquet file(in our case Cloudflare R2).

Then, load the whole file into duckdb-wasm into client. So when you scroll through sheet, instead of getting rows from server, you query rows from local db.

This made our sheet feature very efficient and consistent in terms of their speed and memory usage.

If network speed and memory is your bottle neck when loading “medium” data into your client, you definitely should give it a try.

PS. If you have any questions, feel free to ask!

PS. Our service is called Walla, check it out at https://home.walla.my/en

> you query rows from local db.

But the data is still remote (in object storage) right? If I understand correctly, this works then the first solution because parquet is a much more efficient format to query?

My comment was a bit ambiguous. So, for sheets where we have to load all data, we would load all columns at once as a parquet file. (I will leave comment for the advantage of this approach in the next comment)

On the other hand, let’s say we have to draw a chart from a column. The type chart could be changed by user - they could be Pie charts, means, time series chart, median, table or even dot products. To achieve this goal, we would bring just a column from s3 using duckdb, and apply sql queries from client side, rendering adequate ui.

Long story short, you could either 1) query specific columns using s3-parquet-duckdb stack 2) load parquet file through network, and put it inside local duckdb-wasm instance so that you can do queries from client side

Great point.

The advantages of loading “parquet” in “client side” are that 1) you only have to load data once from server and 2) the parquet files are surprisingly well zipped.

1) If you load once from server, no more small network requests while you are scrolling a table. Moreover, you could use the same duckdb table to visualize data or show raw data.

2) Sending whole data as a parquet file is faster through network than receiving data as json in response.

I wonder how much of the benefit is from just not using json vs reducing round trips. I guess if you had a large table you could stream (smaller than normal) row groups from parquet? Not sure how much work that would be though.

I'm not sure what the optimal response size is for an http response, but probably there are diminishing efficiency returns above more than a MB or two, and more of a latency hit for reading the whole file. So if you used row groups of a couple of MB and then streamed them in you'd probably get the best of both worlds.

Exactly.

We have also tried arrow js or parquet wasm, and they were much lighter than duckdb wasm worker.

DuckDb however was useful in our case, considering our nature as form builder service, we had to provide features for statistics. It was cool to have OLAPS inside a webworker that could handle (as far as we checked) more than 100,000 rows at ease.

Not saying that this technique doesn't have any use, but the example scenario is pretty weak. Why not just send s3 link to the analysts

    s3://robotaxi-inc/daily-ride-data/*.parquet

Or just the query:

    SELECT pickup_at, dropoff_at, trip_distance, total_amount
    FROM 's3://robotaxi-inc/daily-ride-data/*.parquet'
    WHERE fare_amount > 100 AND trip_distance < 10

does the intermediate database and view really provide much value?

Thinking more, I feel this boils down to ownership. Does it make sense for you to own this abstraction layer? Or does it make sense to shift the ownership towards the receiving end, and just provide the raw data.

To me maintaining this sort of views to data sounds more like responsibility of data analysts that random devs. They are the experts in wrangling data afterall. But of course there is no single right or wrong answer either

I recently took a big gamble on DuckDB. I've added it to the core of qStudio a free SQL editor. Now within the editor you can save table results from any database to the local duckdb instance or you can even perform joins amongst any server and I do mean almost any server. Out of the box DuckDB supports mysql/postgresql but I've added support for 20+ other database though it probably doesn't support all their types. If you want to get started with DuckDB qStudio is a good IDE: https://www.timestored.com/qstudio/duckdb-file-viewer#qduckd...

qStudio looks very cool and im right in your target demo - and I hate to complain about free but just a bit of feedback: jar on mac is a lot of friction to be able to try it. I just spent about 5 minutes downloading and trying to get it to work - mac says it cant open because its from an unidentified developer, I went into security to get it to open anyway, it gave me another scary warning and then it still couldn't launch because I dont have a jvm installed somehow?

I'll probably try again later because it really does look cool and as much as I absolutely love my intellij/datagrip I'm always looking for alternatives and have thought several times about building a tool with these automatic analysis tools built in.

Damn that's dissapointing as I actually just recorded a mac demo this week. I got the false impression all macs have java but I guess it was just the one I had. I'm happy to try creating an installer if you are willing to test it. If so please email me ryan A_T timestored.com.

Macs come with a weird thing where they have a Java in the path, but when you run it it takes you to a Java download page. It might be worth adding that to your instructions rather than bundling Java yourself?

Shameless plug - MotherDuck[0] is a serverless managed DuckDB data warehouse with some interesting properties:

- Managed storage with zero-copy clone (and upcoming time travel)

- Secure Sharing

- Hybrid Mode that allows folks to combine client (WASM, UI, CLI, etc) with cloud data

- Approved and supported ecosystem of third party vendors

- Since we're running DuckDB in production, we're working very closely with the DuckDB team to improve both our service and open source DDB in terms of reliability, semantics, and capabilities

[0]https://motherduck.com/

(co-founder at MotherDuck)

Why wouldn't you mail your co-worker the SQL text of this query? What is the advantage of the binary database file with no data?

You then also have to mail them every update to the SQL and hope they're handling versioning well

With this approach, next time they attach the DB, they automatically see the latest version

Oh I missed the part where they put that DB file on S3 too and sent a link. I thought they were attaching the DB to the email.

Still, I would more readily send people a link to a Gist or playground than a binary DB on S3.

Not a whole lot of operational difference between sending them a snowflake view name or sending an S3 URL but there's probably two orders of magnitude cost difference

Don’t know what you’re talking about. Used correctly snowflake can be as effective cost wise or cheaper for most orgs’ data needs. Compared to spinning up and down clusters of spark (and maintaining g teams to do this work or paying databricks).

Not talking about spark clusters anywhere here. I wouldn't be surprised if that's similar in cost to snowflake.

What I'm talking about is snowflake vs a $50 EC2 instance running DuckDB reading data from S3.

Try it out sometime — the results might surprise you

I have, and haven’t come away impressed. I have in fact done this multiple times. perhaps this doesn’t work with actual complex analytics and stuff? DuckDB doesn’t even do proper SIMD for example.

To be clear I have used Duckdb in some serverless ideas and it has worked well, but nothing has bet hyper optimized engines like snowflake and redshift.

The future is here, it's just not evenly distributed. Wait a decade (or less) and this will be in open source. Heck, it might already be.

Shameless plug - Data Tap[1] is a custom made Lambda function with embedded DuckDB and AWS managed ingestion URL where you can HTTP post your data as much as you like. It will buffer your data and then use DuckDB SQL clause to land the datas Parquet on S3.

- Deploy to your own account (BYOC)

- "S3 first": Partitioned and compressed Parquet on S3

- Secure sharing of write access to the Data Tap URL

- 50x more cost efficient than e.g. "Burnhose" whilst also having unmatched scalability of Lambda (1000 instances in 10s).

[1] https://www.taps.boilingdata.com/ (founder)

Up until this point in my career, I've always wanted to incorporate some of these new age SQL tools into my workflows, but I just never seem to realize the utility.

What's going on here that wouldn't warrant building a processing pipeline that places the data in a more permanent data warehouse and create all the necessary views there?

They are similar in that you have some data stored somewhere, and you can query it as though it was data stored inside a 'conventional' database that has it's own storage that it manages itself.

But whereas Presto/Trino/Bigquery/etc are server-based where queries execute on a cluster of compute nodes, duckdb is something you run locally, in-process.

Back in the day (early 2000’s), I worked with a DB2 instance that had similar functionality. At the time, they called this feature federated databases. If you had the appropriate wrapper, you could use any data source in a query. Even output from other programs. At the time I used it for including dynamic DNA sequence alignments in queries.

IIRC, SQLite can do similar things with virtual tables (with a more limited set of data file types).

I always liked this way of working, but I also wonder why it never really took off. Data discovery can be an issue, and I can see the lack of indexing as being a problem.

I guess that’s a long winded way to ask: as interesting as this is, what are the use cases where one would really want (or need) to use it?

> I always liked this way of working, but I also wonder why it never really took off.

In today's new fangled world, a lot of developers don't use a lot of the great stuff that RDBMS can provide - stored procedures, SQL constraints, even indexes. The modern mindset seems to be that that stuff belongs in the code layer, above the database. Sometimes it's justified as keeping the database vanilla, so that it can be swapped out.

In the old days you aimed to keep your database consistent as far as possible, no matter what client was using it. So of course you would use SQL constraints, otherwise people could accidentally corrupt the database using SQL tools, or just with badly written application code.

So it's not hard to see why more esoteric functions are not widely used.

I've heard people call the old approach "outdated" and even "wrong" before.

It's honestly the dumbest situation, where people eagerly use extremely complex databases as if they were indexed KV stores. Completely ignoring about 97% of the features in the process.

What's especially funny is that half the time a basic KV store would perform better given all the nonsense layered on top.

And then there's this whole mentality of "we can't interface with this database unless we insert a translation layer which converts between relationships between sets of tuples and a graph structure".

It's like people have unearthed an ancient technology and have no idea how it's intended to be used.

Stored procs, triggers and suchlike actually are outdated and mostly wrong. Have you considered that people moved away from them for good reasons, not just ignorance? Hard to source control, impossible to debug, written in weird/bad languages, massive blast radius, don’t scale. Usability matters! I have written more TSQL than I would care to admit, and it’s a hell you couldn’t pay me enough to go back to.

The “consistency must be enforced at all costs” just turned out not to be true. Worked at many places at moderate scale since my old dba days that don’t use fks. It just doesn’t actually matter, I can’t recall any serious bugs or outages related to dangling rows. Plus, you end up having multiple databases anyway for scale and organizational reasons so fks are useless in that environment anyway.

On the other hand, I’m all for indexes and complex querying. not just k-v.

Most of the actual drawbacks you describe (Lack of source control and "weird/bad language" are just wrong. You can source control this stuff just fine and if you don't like the language nothing stops you from transpiling.) are caused by how databases are being used not by their inherent design.

That's what I meant when I said it's weird people keep picking these particular databases for projects which don't end up using any of the features.

Stored procedures, triggers and all the other features people seem to refuse to use work mostly fine if you actually design the rest of the product around them and put a modicum of effort into them.

"It doesn't scale" - that can be said of the fundamental database design itself. If you need to scale hard then you need to pick a distributed database.

I dont really know why im arguing this, there are 100s of blog posts out there for decades that explain why sprocs are considered bad. The marketplace of ideas has spoken, and we've moved on. I'll never likely never convince you, but here goes my take expanded some more.

You can source control them, but then you have a deployment coordination problem. Often changes to business logic will be half in application code and half in the sproc, and you have to coordinate the deployment of that change. You have to somehow reason about what version of the sprocs are on your server. Add on that usually in enviroments that use sprocs, changes to them are gatekept by DBAs. Just write your business logic in application code.

Transpile? why on earth would anyone bother with that, it just makes it even more complicated and impossible to debug. without that, TSQL/PSQL/PGSQL are objectively very awkward languages for imperative business logic. People are not familiar with them, it makes it hard to jump from regular code to *SQL code to read the logic flow, nevermind debugging them with breakpoints and suchlike. Splitting your buis logic between database and app makes it much more awkward to test too. Just write your business logic in application code.

Scale - making the hardest to scale piece of your architecture , the traditional rdmbs, an execution environment for arbitrarily complex business logic is going to add a lot of extra load. Meaning you're going to have scaling problems sooner than you otherwise would. Just write your business logic in application code, you can scale those pods/servers horizontally easy peasy, they're cheap.

Look you can make any technology work if you try hard enough, but sprocs are almost all downsides these days. The one minor upside is some atomicness, but you can get that with a client side transaction and a few more round trips. There's just IMO no reason to pick them anymore and pay these costs, given what we've learned.

BTW I could have made this clearer in the original comment but i love RDBMSs - i love schemas, indexes, complex querying, and transactions. It is not weird to keep picking these databases to get those things.

I just will never again use them as a programming environment.

>[appeal to popularity]

I think you put too much weight on what people currently do. Every year it seems the previous years trendy ideas are "terrible" and something new is "the way it should have always been done from the start".

>[stored procedures, business logic splits, and DBA gatekeeping]

You probably shouldn't be using stored procedures to be implementing more than you need to keep things internally consistent (i.e., almost never). I actually don't understand why you're so hung up on stored procedures. I've seen people try to implement whole applications on top of stored procedures, certainly not anything I'm recommending.

Yes if you split your business logic like that then you're going to have trouble, so don't. If instead you treat your database as an isolated unit and keep it internally consistent, then it's not really that difficult to keep things always working even when migrations happen.

As for the DBAs, nobody says you have to have DBAs, but if you're going to use a technology, it's worth having someone who understands it. (or, you know, just pick a different database).

>[transpile?]

I mean, lots of people use query generators which let you write e.g. native python and generate equivalent SQL (SQLAlchemy Core).

Anyway, your complaint seems to be that people don't understand SQL databases well enough to use them properly. Sure, but this is like complaining that git is bad because nobody knows how to use it. People who don't know or want to learn git, SQL, or a tool they're using, should pick a different tool.

>[it scales worse if you try to implement an application in it]

well yes

>[more complaining about abuse of stored procedures]

sure

...

I think at some point you read something I wrote as: "And you should attempt to find out a way to shoehorn your entire application into the RDBMS such that your application is just a frontend which calls stored procedures on the RDBMS" but that's certainly not anything close to what I said.

Designing your database such that it stays internally consistent when you perform operations on it is a good goal, and doesn't require filling it with stored procedures and making it your backend.

I think the implementations had a bad DX - but the concepts were not incorrect. Not enforcing referential integrity, or foregoing schemas all together (hello document stores and then doing the equivalent of SQL joins and aggregations in nested loops within code) is just plain lazy / ignorant that comes to bite you in the ass later.

I think you touched on the real reason: we separate services now and they all have different databases. We shifted up to the application layer not because of technical reasons but social reasons.

Your timing is off, we as an industry ditched stored procs and friends before (micro)services became a trend. For the reasons I said.

I’m not sure why it’s so hard to accept that we stopped using them because they were bad, not for some other gotcha reason (social, ignorance, fads, etc)

Maybe because claiming they are bad with out explaining why, giving explicit examples, or really digging into the nuance of the more than one feature mentioned makes the claim feel unsubstantiated.

Pretty sure I did say why. “ Hard to source control, impossible to debug, written in weird/bad languages, massive blast radius, don’t scale. Usability matters! ” and I forgot “generates non obvious non local changes” for triggers.

What is so hard to source control with stored procedures, I’ve writing them in text files for over 30years, version controlled in rcs, cvs, svn and git. No problem, it’s just the deployment automation that works a bit different.

Also, oracle plsql is way better than ms t-sql, it mostly feels like pascal/modula2 with embedded sql statements.

yes, you can hack something together that keeps them in source control, and then some automations to run them in a CD way. But now your business logic change is half in application code, and half in sprocs, and you have a versioning and deployment coordination problem. Which you can work around too. But why? making deployments and debugging of buis logic hard for what end? Just keep all the business logic in your app code, then its deployed and tested as one artifact, and everyones happier.

Also i hope you're not suggesting anyone pick up oracle for a new project these days :)

What do you mean hack to keep them in source control? They're just sql source code files, like any other source code you can simply commit to version control?

Wouldn't advise anyone to use Oracle, but neither would i advise them to use sql-server. Usually Postgres is good enough, although oracle plsql is still nicer than postgres plsql (e.g, packages).

i feel like i addressed this in a different subthread https://news.ycombinator.com/reply?id=40526231&goto=threads%...

> You can source control them, but then you have a deployment coordination problem. Often changes to business logic will be half in application code and half in the sproc, and you have to coordinate the deployment of that change. You have to somehow reason about what version of the sprocs are on your server. Add on that usually in enviroments that use sprocs, changes to them are gatekept by DBAs. Just write your business logic in application code.

i totally get that position. I used to be in that camp, but in my last 3 jobs at companies of >10mil users, we had to get rid of FKs for other reasons (perf, dist system stuff). I kind of expected bad data corruption issues.... but they just never happened ¯\_(ツ)_/¯. Not sure if that was strong cultures of automated testing, or what. So now i no longer think they're worth the tradeoffs.

YMMV with this one, i could see how it might pan out differently in other environments

Did these involve financial transactions? Other situations I will have dangling rows without issues, but whenever money comes into the picture I like to have as much consistency enforced as I can. But I'm aware that doesn't play nicely with distributed systems.

Yes actually. financial transactions are often used as an example that demands strong consistency. But it’s not true in the real world for the most part. for any given money movement (a debit and a credit) there’s a good chance one of those legs is over the network - you’re either interacting with a third party api, or a proprietary internal product, or similar. So you can’t have transactional consistency, or fks. Further, for every movement there is layers of reconciliation that happens hours or days later, that fixes any inconsistency. Finance / fintech is actually among the most eventually consistent systems out there, so it’s sort of ironic its use in a lot of transactional examples.

Exception is true banking ledgers maintained at banks, but even then you’d be surprised. Entirely different systems handle loans and checking accounts

It makes me think that for many projects, recutils would be a better fit if you’re not gonna use the advanced functionality anyway. It should do just fine on medium sized DBs and is so much simpler.

Yep, that's me. I use sqlite as a key value store. Usually just two columns per table, a key and value, that I map to a python dict, and pickle the value if I need something more complex.

> In today's new fangled world, a lot of developers don't use a lot of the great stuff that RDBMS can provide - stored procedures, SQL constraints, even indexes. The modern mindset seems to be that that stuff belongs in the code layer, above the database.

Implementing constraints in application code is certainly a lot easier (and easier to test) than in the database. What I want is a database which far stronger constraint capabilities than eg MySQL or Postgres provide, so that my application-level constraints can live in the database where they belong, without compromising ease of development and maintenance.

> Implementing constraints in application code is certainly a lot easier

I can’t imagine doing this. You’re basically saying that you will only ever have one program writing to your database.

This is, for many, considered a best practice.

The idea being that a single service/codebase is controlling interaction with the database. Any read/writes should go through that service's APIs. Basically a "CreateFoo(a, b, c)" API is a better way to enforce how data is written/read vs every service writing their own queries.

It incurs huge costs though:

1. Costs of API maintenance - rigid APIs need constant change. To mitigate that people create APIs like GraphQL which is a reimplementation of SQL (ie. a query language).

2. Costs of data integration - (micro)systems owning databases need data synchronisation and duplication. What's worse - most of the times data synchronisation bypasses service APIs (using tools like CDC etc.) so all efforts to decouple services via APIs are moot.

A single data platform (a database) with well governed structure enforced by a DBMS is a compelling alternative to microservices in this context.

> Costs of API maintenance - rigid APIs need constant change.

Rigidity is the point...

If you have a "CRUD a Foo" set of APIs, and how you create/read a "foo" is defined in that API. Sometimes change is necessary, sometimes the API contract changes, but sometimes just the internal implementation changes (eg new or refactored tables/columns). The rigidity of the API ensures that every downstream user of a foo creates/reads in the exact same way. It centralizes permissions, overrides (eg. a/b testing), rate limiting, transactions, etc. to be homogeneous for everyone. If you want to create/read a foo via database queries alone, and the database changes or the business logic changes, then the same issue occurs where all clients need to change, but now it needs to be coordinated everywhere, and you can't benefit from a centralized change hiding implementation details.

Many people prefer to keep all the logic around enforcing consistency and object lifecycle (application behavior) in the application layer. This allows a single codebase to manage it, and it can be uniformly guarded with tests. Exposing the database itself is really just an example of a leaking implementation details.

> To mitigate that people create APIs like GraphQL

If you need raw flexible queries, then this is probably the wrong sort of solution. Ideally, the developer of a service already knows what queries will be made, and clients don't typically need detailed custom queries. Analytics (typically read-only) should already occur in an offline or read-replica version of the database to not impact production traffic.

> (micro)systems owning databases need data synchronisation and duplication.

What do you mean? Foo service exposes "CRUD-Foo" apis and is the only service that calls the Foo-storing database. If Foo service is horizontally scaled, then it's all the same code, and can safely call foo-db in parallel. If the database needs horizontal scaling, you can just use whatever primitives exist for the database you picked, and foo-service will call it as expected, and the transaction governs the lifecycle of the records.

> ...all efforts to decouple services via APIs are moot.

To be clear, different services wouldn't operate on their own duplicated version of the same shared data, in their own database. They'd call a defined API to get that data from another service. The whole point of this is to allow each service to define the interface for their data.

> Exposing the database itself is really just an example of a leaking implementation details.

This is a mistake that a lot of people make - treating database as an implementation detail. Relational model and relational calculus/algebra were invented for _sharing_ data (not storing). The relational model _is_ the interface. Access to data is abstracted away - swapping storage layer or introducing different indexes or even using foreign data wrappers is transparent to applications.

Security and access policies are defined not against API operations but against _data_ - because it is data that you want to protect regardless of what API is used to access it.

> To be clear, different services wouldn't operate on their own duplicated version of the same shared data, in their own database. They'd call a defined API to get that data from another service.

You mean synchronous calls? This actually leads to what industry calls "distributed spaghetti" and is the worst of both worlds.

> The whole point of this is to allow each service to define the interface for their data.

The point is that well defined relational database schema and SQL _is_ a very good interface for data. Wrapping it in JSON over HTTP is a step backwards.

Stored procedures are awful. There, I said it.

Its storing transform code in an inscrutible way, hiding pipeline components from downstream users. The rest of your alarms and dq has to be extendend to look into these...

Oh an PL/SQL requires a huge context switch from whatever you were doing.

I'm a bit surprised about the flak that stored procs have. Treat them just like any other, hopefully well documented, API, and they work perfectly fine. There are pros and cons, just like whether you decide to use react or htmx, python or golang, erlang or lisp, row based database vs column based vs kv, etc.

On the other hand, yes, going back to the days of poorly written and documented oracle pl/sql stored procs, yes, I shudder too, but then again, that can also be said of a number of public APIs that have been hacked/had side effects exposed.

I generate my pl/pgsql using Lisp. Which is another technology that has been dying for decades. The money it generates seems to work just as well as the money I earned writing js.

Sometimes you need to have atomic updates for a function. Stored procedures aren’t inherently evil, but are a hold over from when DB admins were a thing and they didn’t trust app developers.

Don’t worry, just wait another 5-10 years, they’ll be back in vogue again.

I don’t get this mentality, they’re just code that lives closed to your data, and can do bulk processing on data in your database way more efficiently than anything living in the application layer.

>"can do bulk processing on data in your database way more efficiently than anything living in the application layer."

Not necessarily. All depends on particulars. Very complex processing of reasonably sized chunk of data especially when can be done in multiple threads / distributed can be way more efficient.

I've head very real case: consulting company was hired by TELCO to write the code that will take content of the database and create one huge file containing invoices to clients. Said file would then be sent to a print house that can print the actual invoices, and mail those to clients.

They tried to do it with stored procedures and had failed miserably - it would run for a couple of days give or take and then crash.

I was called for help. Created 2 executables. On is a manager and the other is multithreaded calculator that did all the math. Calculator was put on 300 client care workstations as a service. Dispatcher would do initial query from accounts table for current bill cycle, split account id's into few arrays and send those arrays to calculators. Calculators would suck needed data in bulk from a database, do all the calculations and send the results back to dispatcher for merging into final file. TELCO people were shocked when they realized that they can have print file in less than 2 hours.

People do still use indexes and the basic FK/UQ constraints because they're add value even when using ORM mappers,etc. I don't think that the ideas of changing databases are that prevalent any more (apart from perhaps unit-testing but since db's can often be dockerized that's also solved now).

People don't hate stored-procedures (or other custom SQL constraints/triggers) per-se, but more the amount of headache they bring when it comes to debugging and keeping in sync with version control.

.sql files go into version control just as well as .java files. And because your build will be blowing all your stored procedures away and rebuilding them as part of the build, you'll know at compile time if Junior Dev Jimmy dropped a null constraint or renamed something but didn't fix one of the procs.

That's way easier to debug than if Jimmy also had his query living as a string somewhere in your backend code.

Honestly, if this stuff is hard, it's because you've made it hard. I can only assume most people griping about stored procedures don't have a /sql folder living next to their /python or /java folder in source control.

the approach in the article is for analytics / DW type use cases, not OTLP. So a dig about kids these days not understanding constraints is not particularly relevant.

> Sometimes it's justified as keeping the database vanilla, so that it can be swapped out.

I understand if it's a prototype, I have seen developers of even mature products follow this mentality that they want the DB to be portable. It's insane. The reality is that rarely happens and when it does its mostly cause your business is growing. But not knowing (or caring to know) the features your DB offers you and not exploiting it is just laziness.

The reason a lot of more advanced database features aren't used is simply because a lot (most?) web developers don't know them and have a very light understanding of databases in general.

I like this when I tried it out. The AWS API plugin was neat, query resources with SQL.

However, I like to limit my toolset to three things, these days thats duckdb, julia for analysis side, and... OK two things.

Oh yeah Trino for our distributed compute.

A related functionality is "SQL/MED", a SQL specification for federated databases that has some kind of relationship to medical data historically (I believe one of the use cases is data at one site that another site is allowed to query, in a limited way, data hosted at another site that may not be moved from it).

Similar timeline - Oracle had a feature called external tables where you can create a "table" on top of an external flat file and run queries against it. Prior to that we were use awk/perl etc and found it to be more performant + all the benefits that SQL brings.

The article shows dynamically fetching a database, with database level views, that dynamically fetch data.

You're just showing a query that dynamically fetches data.

I am confused...

It's a cool example, but really an antipattern. Nowadays everyone gets analysts want access to raw data, since they know which aggregations they need best, whereas data engineers stay away from pre-aggregating and focus on building self-service data access tooling. Win-win this way.

How about building a duckdb accessible catalog on top of s3? Like instead of read_parquet, you would select from tables, which themselves would be mapped to s3 paths aka external tables.

> Changes in format, different partitioning strategies, schema changes – through all of it the receiver’s view remains the same.

I don't understand this - if I start saving those files in a different format how will it continue to work? Why would the view remain the same if I just rename columns, even?

I think what was implied was that the user just references the view so as long as you update the view in concert with the data format change then the user is none the wiser about the change.

Oh - I see. Because the .db file is centralised, you can connect to it from somewhere and do stuff to it. I think I missed that the DB is like that.

Is it like dumping a SQLite database somewhere with a view in it, and connecting over that as well? Or does DuckDB have more magic to transfer less data in the query work?

Yeah, I guess you could equivalently put a SQLite database with a view or virtual table in S3 which would give you the same level of indirection and API abstraction provided by this mechanism.

Where DuckDB will have an advantage is in the processing speed for OLAP type queries. I don't know what the current state of SQLite Virtual Tables for parquet files on S3 is, but DuckDB has a number of optimisations for it like only reading the required columns and row groups through range queries. SQLite has a row oriented processing model so I suspect that it wouldn't do that in the same way unless there is a specific vtable extension for it.

You can get a comparable benefit for data in a sqlite db itself with the project from the following blogpost but that wouldn't apply to collections of parquet files: https://phiresky.github.io/blog/2021/hosting-sqlite-database...

This is a great feature. We've been able to significantly extend the scope and usefulness of our on-prem SQL Cluster for analytics and reporting with PolyBase by building new transactional systems with cheaper postgres, doing ETLs of third-party data to delta tables in azure storage, and then federating access to them with PolyBase so that nobody in the business has to change how they actually query the data. I'm sure in another decade we'll be fully migrated to some cloud platform but for now, federating the queries is a huge win.

How stable is the DuckDB file format nowadays? One annoying thing I've experienced (like a year ago) was when I upgraded the rust bindings that produced my DuckDB files, which then couldn't be opened with DBeaver because they hadn't released a new version yet.

I guess you could run into something similar with this solution?

For things like this I just export to a sqlite DB file and send that along. This is interesting for AWS/S3 datastores though.

I didn't know that(I don't use DuckDB), but SQLite is very standard and anyone can play with it. Query/UX interfaces exist for almost every imaginable platform even. So I'm still a SQLite recommender for stuff like this.

DuckDB is really for OLAP/Columnar (analytical workloads) whereas SQLite is more for OLTP/Row-based (transactional).

When you start running aggregations/windows over large amounts of data, you'll soon see the difference in performance.

Sure, but the OP is using DuckDB as a way to export some smaller amount of data for someone. That's a great use-case for SQLite, from whatever your real DB is, which for the OP is also DuckDB.

Is the `weird_rides.db` file python or some sort of duckdb script?

I'm quite intrigued about DuckDBs ability to read parquet files off of buckets. How good is at at simply ignoring files (e.g, filtering based on info in datafiles paths)?

To my understanding, it does not cache (but I haven't reviewed the code). "Expensive" here really just means expensive in time. If you're running DuckDB on an EC2 instance to query S3 on the same account, it's practically free of any other cost. I wouldn't bother doing it outside AWS unless it's a one-time thing.

Running a simple analytics query on ~4B rows across 6.6K parquet files in S3 on an m6a.xl takes around 7 minutes. And you can "index" these queries somewhat by adding dimensions in the path (s3://my-data/category=transactions/month=2024-05/rows1.parquet) which DuckDB will happily query on. So yeah, fairly expensive in time (but cheap for storage!). If you're just firehosing data into S3 and can add somewhat descriptive dimensions to your paths, you can optimize it a bit.

If the parquet file includes any row group stats, then I imagine DuckDB might be able to use those to avoid scanning the entire file. It's definitely possible to request specific sections of a blob stored in S3. But I'm not familiar enough with DuckDB to know whether or not it does this.

DuckDB can do some pushdowns to certain file formats like parquet, but every release seems to be getting better and better at doing it.

Parquet pushdowns combined with Hive structuring is a pretty good combination.

There are some HTTP and Metadata caching options in DuckDB, but I haven't really figured out how and when they really making a difference.

I asked this some time ago on their Discord in relation to AWS lambda and the Python client and the answer was that you need to handle caching on your own but it is easy to do with fsspec. I haven’t tried it yet though.

Do you have any details on this?

Duckdb over vanilla S3 has latency issues because S3 is optimized for bulk transfers, not random reads. The new AWS S3 Express Zone supports low-latency but there's a cost.

Caching Parquet reads from vanilla S3 sounds like a good intermediate solution. Most of the time, Parquet files are Hive-partitioned, so it would only entail caching several smaller Parquet files on-demand and not the entire dataset.

I admit I'm not crazy deep in this space but I'm _pretty_ into this space and I've never heard of Carbon.

Parquet is kind of winning the OSS columnar format race right now.

Parquet the most popular columnar format. (owing to support in Spark and various other big data tools, as well as local tools like pandas, polars and duckdb)

It's technically not the very best format (ORC has some advantages), but it's so ubiquitous and good enough -- still far better than than CSV or the next best competing format. I have not heard of Carbon -- it sounds like an interesting niche format, hopefully it's gaining ground.

It's the VHS, not the betamax.

"Good enough" makes it sound like barely a step up from a CSV file. I'd say its support for various encodings [1] including a great default (dictionary + run length encoding on the indices) and compression algorithms that can be set for each individual column, columnar access, partitioning, a parallelized reader out of the box, in-memory filtering and other ops concurrently with loading in the data (thanks to Arrow) etc. etc. are all really wonderful when working with medium-sized data.

[1] https://parquet.apache.org/docs/file-format/data-pages/encod...

Agreed. On a scale of 10 in terms of current technology, CSV is a 1 while Parquet is 7. ORC is maybe 7.2. But parquet is far more ubiquitous than ORC (I’ve never seen ORC in prod but I also have limited sample sizes)

I’m sure there are more advanced formats.

Conceptually is this sort of a database-agnostic SQL view? Seems like these could be stacked to simplify (or complicate?) an entire ETL pipeline.

I haven’t decided where I land on this. In some ways, stacking SQL views looks like it simplifies a bunch of ETL jobs, but I also fear a few things:

* It either breaks catastrophically when there’s a change in the source data

* Fails silently and just passes the incorrect data along

* More challenging to debug than an ETL pipeline where we have a clear point of error, can see the input and output of each stage, etc

* Source control of SQL views seems less great than code. Often when we have too many views, you can’t update one without dropping all of the dependencies and recreating them all

But I also wonder if I feel this way because I know programming better than SQL

No you are right, views have an use case but they get abused as ersatz orchestration, creating a nested structure that multipkies the amount of needed compute.

edit: now I read your question more carefully. I think the s3 data is meant to be managed by other orchestration. This is a quick easy way to share a data source with an analyst, PM or end consumer.

I do not expect poster is advocating this as any intermediate stage in a data pipeline.

DuckDB has Swift bindings, but unfortunately, afaik, nothing official for Android. If anyone has gotten it working on Android I'd love to hear about it.

Quote: "How do you share that dataset with an analyst?"

..and blah blah about the sharing from a developer perspective.

In reality the analyst is some higher up that only knows how to import/view CSV in Excel so that's exactly what will ask for ("hey zzz can you send me those daily parcels in a .csv file? thank you")

Sorry for the late answer. I get the following errors if I try to execute queries on delta tables (accessing them with spark works fine)

InvalidInputException: Invalid Input Error: Attempting to execute an unsuccessful or closed pending query result Error: IO Error: Hit DeltaKernel FFI error (from: kernel_scan_data_next in DeltaSnapshot GetFile): Hit error: 2 (ArrowError) with message (Invalid argument error: Incorrect datatype for StructArray field "partitionValues", expected Map(Field { name: "entries", data_type: Struct([Field { name: "keys", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "values", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, false) got Map(Field { name: "key_value", data_type: Struct([Field { name: "key", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "value", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, false))

（评论） (comments)

（评论）
(comments)