(评论)
(comments)

原始链接: https://news.ycombinator.com/item?id=40949489

在客户端使用 SQLite 具有显着的优势,例如由于固态驱动器技术的进步而实现的快速数据检索、与云存储选项相比更低的成本以及增强的功能,包括全文搜索和设备上的光学字符识别(光学字符识别)。 利用这项技术的一个项目是“Cluttr”,这是一个旨在组织和搜索屏幕截图文件夹的网站。 Cluttr 使用 OCR 进行图像分析,并使用机器学习模型来添加上下文。 大部分托管费用由个人用户承担,降低了总体成本。 虽然云服务可以提供巨大的每秒输入/输出操作 (IOPS),但在 Lenovo ThinkPad X1 上实现同等 IOPS 的价格却是天文数字(每月 2.7 万美元)。 个人和企业使用的云要求差异很大,而且通常只有一小部分存储的数据需要频繁查询。 因此,与 IOPS 相关的较高成本可能不会引起许多用户的严重关注。 在比较 WASM SQLite 和索引数据库 (IndexedDB) 等浏览器内技术时,后者缺乏开箱即用的强大并发处理支持。 然而,存在一个用于提高并行化效率的多个读取器和写入器提案。 尽管在 Firefox 和 Chrome 等流行浏览器中被广泛采用,Webkit 仍然忽视它,为进一步的发展留下了空间。 总之,采用 SQLite 进行本地数据存储具有许多优势,允许开发人员将业务逻辑从服务器转移到客户端,从而减轻服务器负载、更好地分离关注点并简化性能。 此外,正在进行的研究旨在探索利用非渲染 DOM 元素作为潜在表表示的创新技术,从而实现基于 CSS 选择器的新颖查询语言。 虽然这个概念尚未经过测试,但初步想法表明,对于适度的数据集来说,结果是有希望的。

相关文章

原文


Overall I think using SQLite locally to offload database work is incredibly powerful. Given most laptops have an SSD now a days, you can scan an entire 30-100MiB SQLite db in miliseconds.

Meanwhile in AWS you would pay $27k a month to have the same IOPS as a Lenovo Thinkpad X1.

I just got done with a side project using WASM SQLite as well, it's incredibly powerful, even supports full text search.

My project "cluttr" is a local first site that focuses on cleaning up your screen shots folder and making it searchable via OCR. All in browser.

It also supports using Ollama to run a vision model against your images to provide more context.

Really helpful if you take a lot of screenshots for clients or PMs and need to find them later. The best part is 99% of the hosting cost is offloaded to the client, so I am not really worried about a large server bill.

https://cluttr.pages.dev/

(working on deployment to https://cluttr.ai later today)



This solutions scales to tens of thousands of images in my testing (probably ~200k or more before searching is too slow). Why use a nuclear bomb when a stick of dynamite is good enough :)



Yes, DuckDB is OLAP, SQLite is OLTP. I should have called that out.

But if you are doing aggregate or skip-scan analysis and if they're talking about read speeds and in-memory processing, well, SQLite leaves some performance on the table by being single-threaded, as far as I can tell.



I wanted to try SQLite in the browser for a side project, but the startup time was unacceptable to me. It takes a solid second or two to crunch through the WASM binary, init SQLite, and open a database. Gets worse on first boot when you need to create all the tables.

There was a web.dev post or something saying "Web SQL is finally here! Just use SQLite with WASM!" Like, sure, that does seem to work, but it requires a huge WASM blob and a huge heap of JS glue. Indexeddb--as horrible as it is--starts up and is usable almost instantly. The sad part is that Indexeddb's querying is so bad you're forced to build your own database on top of it anyway...

I get where the web standards folks were coming from when they didn't like WebSQL just being SQLite in every browser, but frankly that would be so much better than what we have now even 5 years after back-peddling.



SQLite is not SQL standards compliant. It doesn't even support data types from core SQL-92. One needs more than INTEGER, REAL, TEXT, BLOB for applications. Adopting it as WebSQL would have been a terrible mistake.



> Meanwhile in AWS you would pay $27k a month to have the same IOPS as a Lenovo Thinkpad X1.

Curious about how this number was arrived at and what products were involved



RDS with 256,000 IOPS (the max) on a single primary instance of MariaDB.

Of course its not realistic to run a primary DB on a laptop, but the IOPS are extremely expensive in AWS was the point.

Offloading expensive queries to a browser is a viable solution. If I had decided to make this some sort of SaaS offering I am certain running full text search at scale would cost thousands of dollars per month long term, with the data becoming increasingly irrelevant over time, and I would still be forced to host it.



> Meanwhile in AWS you would pay $27k a month to have the same IOPS as a Lenovo Thinkpad X1.

This is kind of an unfair comparison. Essentially nobody needs a million iops for their database. Even an extremely busy database doesn't need to scan all of the data it holds (or at least, if it does, you're using it very wrong—that's why we have indexes).

A fast disk is possible on a laptop because it's a tiny hop to RAM. And it's desirable because you probably have nowhere near 2TB of RAM handy, so you need it to be fast.

In the cloud you can get 2TB of ram for $11k/mo (4x r6g.16xlarge). Not that you need anything like that to run your database. Most of that data is never being queried.

It's also the case that a laptop workload is very different than a server workload. If I run a steam game, I want it open fast. My laptop isn't crunching numbers on all the bytes at that moment. When I run a table scan on a Postgres table, processing needs to happen on every single tuple. A million iops isn't useful if your CPU immediately becomes the bottleneck. A Thinkpad would simply never match the response times of a server with a tenth of the iops under load (if the workload required scanning huge amounts of data).

So yes, the iops are more expensive, but that's really not a metric that anyone in the target market is hurting over.



"Nobody needs IOPS until they need IOPS"

I've had to scrub a multi terabyte database of PII before moving to a staging environment, it hurts. With modern data architectures, you may write the same data 4-5 times in its life cycle, staging data, data warehouses, marketing, PowerBI, Looker, etc...

Especially reporting solutions, where they may aggregate massive amounts of data and write it to temp tables.

It will require IOPS, and you will pay handsomely for it.



I don't know about your specific example, but doing an operation where you rewrite most of a multi terabyte database online is almost certainly not best accomplished with SELECT/UPDATE. Even if you need multiple passes, that's N terabytes times M passes times two. That's... not a lot of reads and writes. Dumping the database to files on blob storage, rewriting them, then reading them into your destination is almost certainly the fastest and cheapest way to go about that.

And that's not an iops avoidance thing, that's a "this isn't what your database is built to do with the configuration your running it in" sort of thing.



Cloud providers have been delivering shit IOPS and charging unhinged prices for basic modern day performance that I really can’t agree with you at all. Nobody “needs” high IOPS the same way nobody needs a car that can drive faster than the 70mph. Why would anyone settle for an artificial ceiling?



Huh? Not really. AWS is a marvel of engineering in that it can be everyone's redundant IOPS but being your IOPS isn't such a big deal. Most folks are single region and rent-a-datacenter colo operations can get you two racks with separate power/uplink no issue.

I hope everyone at least once in their career gets to experience just how god damn fast hardware (especially networking speeds between your own servers) is. Sweet lord loading up three bare-metal dbs with 1TB ram each and bonded 10G nics where the app servers didn't even have to hit a router to talk to them. We initially thought the replication lag being pinned at 0 was a mistake.

Don't take this as any sort of condemnation of cloud offerings, being able to spin up comparable infra in my underwear and not having to think about purchasing and dealing with hardware vendors are truly a gift.



> I hope everyone at least once in their career gets to experience just how god damn fast hardware (especially networking speeds between your own servers) is.

THIS. I realized cloud disks were much slower than I thought when I ran the same tests in RDS – with a local NVMe cache – against my decade-old Dell R620, with its disks also being NVMe, but via Ceph over Mellanox Infiniband. My server matched or beat the many-generations-newer RDS instance on almost every query.

You can’t get around latency. Even at 1 msec, that’s a maximum of 1000 ops a single thread can do per second, modulo the various buffering and chunking strategies every layer does.



> OPFS doesn’t come with graceful handling of concurrency out of the box. Developers should be aware of this and design around it.

There's a multiple readers and writers proposal [0]. It's been "position: positive" by Firefox [1], implemented in Chrome [2], and ignored by Webkit [3] (of course).

    0: https://github.com/whatwg/fs/blob/main/proposals/MultipleReadersWriters.md
    1: https://github.com/mozilla/standards-positions/issues/861
    2: https://chromestatus.com/feature/5172892632875008
    3: https://github.com/WebKit/standards-positions/issues/238
Love the shoutout to Roy Hashimoto. He's been writing VFSs for SQLite-on-the-browser and perf testing them. He's recently wrote "IDBMirrorVFS", which "is a new example VFS that keeps all SQLite files in memory while persisting to IndexedDB". It has remarkable performance, of course. https://github.com/rhashimoto/wa-sqlite/discussions/189

Also, looks like Roy takes advantage of the aforementioned proposal when he wrote `OPFSPermutedVFS`, which takes 2nd pace behind `IDBMirrorVFS` w/r/t perf. https://github.com/rhashimoto/wa-sqlite/blob/master/src/exam...



It sucks that even the newest make-wadm-sqlite-fast FileAccessHandles are still an intermediated virtual file system hosted by the browser. I don't get why we have three different file system APIs on the web and none that just use files.

Meanwhile WASI has their own server-side file system APIs but neither WASI nor the browser side seem to have any effort to get on the same page. It'll be 2035 before component-model ends up being usable on the web, at this rate, before modules are modular.



The profusion of underpowered interfaces is frustrating but I don't see how browsers could

"just use files"

What does this mean? Providing the POSIX filesystem C language APIs with file descriptor numbers, etc? I guess the browser makers could go Cosmopolitan-C style and emulate the Linux filesystem interface on all platforms, but I think you'd find there's still a weird intermediary layer in between your code and the underlying OS filesystem. Even on Linux the browser would have a virtual filesystem to provide sandboxing & quality-of-service governance. On other systems you'd need a virtual filesystem to paper over those differences between operating systems. Windows exists!



Notion takes 15s to load to an empty page. Then another 5 to dismiss the popup about new AI features and the like.

I'm glad they are making their app faster, in the meantime I (browser user) have cancelled my team's subscription and will be using something else.



I was about to say, does Notion actually feel any faster to anyone? I stopped using it years ago but the last place I contracted at was invested in it heavily and the thing was crazy slow.



Developing a fast system requires an entire different mindset, unless it is critical it never happens. Most people think it is w/e we don’t need to optimize and when you have people complaining it is already too late



> we've all collectively decided desktop software is impossible

the customers are the ones who did.

Suppose there was a desktop app, but a competitor has a web version. It's much easier, faster to use the web version, because you dont need to download the desktop app, install it (which might require admin perms). Then your files are local, so if you have a second computer, there needs to be some way to share those files between.



One team works insanely hard and saves 2 seconds on app load. Another team takes the opportunity to add 3 seconds worth of crap that no one wants. Welcome to modern software development.



I also moved away from Notion a while ago for being crazy slow. For personal note taking I use Obsidian, big fan of it.

For my dev team I recently moved us to eraser.io. Still in the early stages with it but it's working well enough for us atm.



I’d love advice too - I’ve become heavily reliant on Notion’s database-lite features, which is what a lot of the markdown-based apps don’t invest in by default. I’m beginning to get frustrated with loading speeds that feel like spinning disks.



Will the tables in the Notion mobile app still be so painfully slow? Every time I scroll few rows down (in a table with few than 100 rows with dozen columns), the app hangs for several seconds while trying to fetch/render (who knows what it's actually doing) few additional rows. When I edit a row and return back to the table view, scroll position is lost and I have to suffer through the fetching once again.



They’re a very un-user focused company. They used the | cursor for any horizontally scrolling content for years, making a kanban board with non default states is nightmarishly complex, and there’s a billion odd UI slownesses and niggles.

But they’re recently added AI.



Notion is a great example of my hatred of the current "upgrade or die" delivery method of software. Notion 1.0, when it was mostly text with a unique way to view tabular data, was fast and enjoyable to use. Once it shifted to competing with Atlassian et al, it acquired all the bloat those products already have.



Sadly, I don’t think basing a business (with the tech investor desire for growth! growth! growth!) off one product, is best for an app that’s best when predictable.



Compared to a plain text editor, no. Compared to documentation + Access hybrids, I'd say it was good enough. The lack of true offline support has always been a big downside.

And by fast I mostly mean, when you did a / command, the menu came up quickly. Now it's stuffed with so many options as to be useless.



Seeing things like that excite me a lot, I can't wait for a future where all the code is written in whatever language you like and runs through WASM in the browser. The HTML and JS can go back doing that thing they are built for: Displaying the UI and handling the interactions.



Not necessarily. It's just that a lot of SPAs don't contain a lot of business logic because it's too tedious to do without a client side database; which means you end up offloading a lot of business logic to server side databases behind some kind of API.

Adding local persistence in the form of an actual database, allows you to do more of these things client side. Which means you end up with more business logic that you can cleanly separate from your data rendering logic and other cruft needed for e.g. form filling, data validation, and what not. You still need that of course but it makes web applications more similar to full desktop applications in the sense that the server might be a lot lighter or even be optional (other than serving the code and other assets).

I've actually been toying with building a Google Reader style application that stores its data in the browser recently. I use a few minimal server scripts to work around cors issues for fetching feeds and html previews. But aside from that, there is no need for a server. I can save local state (stored in indexdb) to a file and download it and then restore it from the same file as a backup strategy.

I've been adding search capabilities with tf/idf ranking, phrase matching. I'm using OpenAI to help summarize and tag content. And I'm currently adding a light weight vector search implementation. This is all running in the browser (except for openAI).

My goal with this is experimenting with RAG against news content. So, I've been piecing together things I need for this and raising the ambition level as I've progressed in the last few weeks. Most of this is probably not optimal and a big motivation for me is to just wrap my head around all the bits and pieces I need. But there's no good reason why most of those things could not be optimized with e.g. some wasm code that uses web-gpu for doing math and less memory intensive ways of storing stuff.

BTW, I'm using kotlin-js and kotlin-multiplatform which makes it easy to forget that I'm dealing with Javascript and very limited browser APIs under the hood. UI is still tedious to do but I have a growing amount of code that is pure business logic, algorithms, or other stuff you'd normally run on a server and implement in a language like Kotlin. Which is why it's nice to be using that in the browser.



I don’t see why they need to have a worker and a shared worker .. couldn’t they run the sqlite in the sharedworker directly?



All this wonderful optimization, and yet there's still no UI to warn that (perhaps due to a websocket connection being lost?) a comment won't be saved. I lost two Notion comments this way yesterday!



This is both silly and somewhat off-topic, but on the subject of doing database work on the client side ... has anyone experimented with using non-rendered DOM as 'table' and CSS selectors as 'query language'?

I'm not talking about 'using the DOM to store data' in the traditional sense. The idea is rather (roughly) to store each database 'table' as a child of a hidden DocumentFragment, with 'rows' as its children, etc. Then we'd query this data using CSS selectors (or XPath).

For example, instead of

  SELECT * FROM Employees WHERE gender = 'male' AND age > 30
you'd have something like:
  customersTbl.querySelectorAll('row:has(cell-gender[value="male"]):has(cell-age[value > "30"])')
And instead of
  SELECT * FROM Employees WHERE gender = 'male' OR age > 30
you'd have something like:
  employeesTbl.querySelectorAll('row:has(cell-gender[value="male"], cell-age[value > "30"])')
You can even have (truly) structured cells! Instead of
  SELECT * FROM Employees WHERE name->>'first_name' = 'John'
you can have:
  employeesTbl.querySelectorAll('row:has(cell-name:has(cell-first_name[value="John"]))')
And for querying array-like structures, instead of:
  SELECT * FROM employees WHERE skills::jsonb ? 'JavaScript';
you can have:
  employeesTbl.querySelectorAll('row:has(cell-skills > cell-skill[value="JavaScript"])')
I can certainly imagine the performance to be horrible, but having never tried out the idea (and having too meager a mental model of DOM performance to reason from) I can also imagine it being surprisingly decent for smaller datasets. Who knows? :-)

Edit: silly me indeed to think CSS attribute selectors already can do numeric comparisons! See here: https://github.com/w3c/csswg-drafts/issues/354



We've been using wasm sqlite with kotlin-js and the kotlin sqldelight framework. This is admittedly a bit of an exotic stack to be running in a browser. But it actually works surprisingly well. Unfortunately the failure modes with opfs are kind of ugly and need some attention. A key issue is controlling what happens when you have multiple tabs open interacting with the same DB.

One interesting thing is that opfs maximum disk usage is kind of browser specific but tends to be a percentage of the available diskspace measured in GB rather than some lowish number like 5MB as is typical for e.g. browser local storage. This makes it suitable for locally caching and syncing large amounts of remote database content.

We've had a few challenges with opfs and particularly limited support for this on Safari it kind of works but with some caveats. Chrome/Firefox are fine. Our web app is packaged up as a PWA for use on mobile.

Another issue is that opfs creates issues with loading resources from external websites. E.g. we have HTML previews that may include images on external domains that the browser will render fine without opfs enabled but will refuse to render when using opfs, unless you set crossOrigin=anonymous on the img tag.

Relative to indexed-db, which is supported by most browsers, you gain a more sane API to access data and more flexible querying and support for things like joins. I've had some exposure to indexed-db and IMHO it's a case study in bad API design gone horribly wrong. Querying is very limited and the APIs are poorly documented and have weird failure modes. Sqlite is an absolute pleasure to use in comparison and probably faster and way more capable.



Sadly, SharedWorkers still are not supported on Chrome+Android. Hope that changes soon because it would be a game-changer for many applications to have this API broadly available.



This is cool because I haven’t seen WASM SQlite in production. I wasn’t sure whether it’d end up being more performant than something built in like IndexedDB. Seeing that it’s practically feasible is definitely cool.



> Using SQLite improved page navigation times by 20 percent in all modern browsers.

Couldn't you cache the data in LocalStorage and get similar speed improvements?

I mean, how to run SQLite in the browser is a great topic and surely has a lot of very good use cases. But attributing the performance improvements to it sounds a bit misleading, when you had no client caching strategy before and you do now. Obviously loading data from a local cache is faster than loading data from a remote server.

Comparing WASM SQLite vs. other client caching solutions (LS, IndexDB, etc) could be interesting. In terms of perf, complexity, compatibility, storage capacity, query language, etc.



(I work at Notion, but didn't build the WASM sqlite thingy)

We've implemented this same cache using LocalStorage, IndexedDB, and SQLite. The Android & iOS apps used SQLite for this since 2020 (I built the Android version IIRC), and the desktop app used SQLite running on a native thread since 2021.

We migrated away from LocalStorage for two reasons:

1. LocalStorage is limited to 10mb file size. We also use LocalStorage for a bunch of less-durable state like "is this toggle open?" or "which view in this database was open last", and as our customer workspaces grew, we faced mounting errors as the record cache competed with the rest of the app for that space. We'd have a bunch of slowdowns under contention as we tried to delete keys in LocalStorage synchronously, which manifested as major UI lag. No good!

- LocalStorage loves to lose writes if you're writing from multiple tabs. It's just not a reliable or trustworthy API. For a cache that doesn't matter as much, but we'd still end up with cache misses for power users for pages that should really be totally locally loadable.

I implemented the IndexedDB version in 2019, to replace the earlier LocalStorage option. We used the IDB record cache in the desktop app until we switched over to native SQLite there in 2021 (https://www.notion.so/blog/faster-page-load-navigation) but we never shipped it for browser users for a few reasons:

- Performance and reliability problems with IDB in browsers is hard to debug; in the native app we can trust the version of Chromium we ship and remediate issues using Electron APIs, where as in the browser wild we're at the mercy of the user-agent

- Our testing in the browser showed limited performance improvements across all device categories: faster devices & scenarios were even faster with IndexedDB, but slower devices & scenarios could be even slower.

The reason I'd attribute for the performance challenge is that IndexedDB pays a high tax per row written and row read. It can be fine in terms of total throughput for a cache if you have large, coarse-grained cache rows, like caching all of a document as a single object, and you update the cache infrequently.

Notion's data model is tree/graph of very fine-grained records; each paragraph is its own database row. Our cache on IndexedDB would perform great for smaller workspace sizes and for a single tab, but with multiple tabs and medium-to-large workspaces, we'd hit contention in IndexedDB and get major slowdowns.

We should improve our cache architecture to have another layer of cache that does whole-pages, but need to weigh the improvement/complexity there versus other performance opportunities.



from my own experience what notion did is pretty much required to use wasm sqlite with persistence in production. on the one hand being used at notions scale really points to wasm sqlite being ready for primetime, on the other hand there are still a LOT of ugly parts that need to be worked around:

- tooling, support and debugging for opfs

- only really works with a single worker using sync api, but this is not available from service workers

- the official js releases are really not idiomatic js and require in depth understanding to pick out the right parts and then build a usable layer on top, afaik there is no existing npm package that just provides what is required



I'm so glad they didn't. We would all be stuck on some 14 year old baseline version of SQLite at this point.

SQLite as WASM is a much better solution IMO.



[..] Web SQL Database was a prior API developed by Apple.[12] But Firefox refused to add support for it and argued against it becoming a standard because it would codify the quirks of SQLite.[13][14] It was thus deprecated in favor of IndexedDB.[..]

Mozilla again.. The real joke is, Firefox is now the only(?) browser using sqlite out-of-the-box for internal databases.



I did, but seems I missed the actual subfolder where the databases are saved. But this makes it even worse. They all have sqlite installed, but offer not API for making use of it. Really missed chance.



Presumably GP meant Mozilla agitated for a Document/KV-store, yet for some inexlicable reason, in their own browser, they chose to implement IndexedDB with SQLite.

They could have used an actual KV store behind it, like chromium did (levelDB IIRC).



It's buggy depending on browser implementation, Firefox will corrupt DBs, it's a herculean task to debug things with users, it blocks the main thread for reads or writes (maybe wasm stuff does too).



It has a weird, verbose API that's more like a simple document store than a SQL database, so it's naturally going to be outperformed by SQLite, especially if you're performing complex operations. It's also heavily async which means operations have to spin the event loop.



yeah I gave up on the direct API, I used Dexie for my product.

I think it's funny you say a problem is that it's async while another commentor says a problem is it blocks the UI. Maybe it's the worst of both?



This is an optional cache so Notion continues to work the way it did before in browsers that don't support WASM SQLite.

As an alternative we have apps for iOS, Android, Mac, Windows. In those apps we already run SQLite in a native thread talking to the normal filesystem for years.



This honestly sounds like a nightmare: multimegabyte wasm downloads, data corruption in production and byzantine hacks to coordinate writes between tabs. I am very grateful that we chose IndexedDB instead for our application!



Honestly, we should just have native SQLite supported by the browser.

SQLite has become the defacto local single file database standard.

It has a very permissive license.

It has support contracts out to the 2050s (service lifetime of Airbus A350 airframe).

SQLite will probably have a longer lifetime than whatever browser standard we have now.



Unfortunately WebSQL was ahead of its time, I have no doubt it would have caught on more today with the resurgent popularity of SQLite.

IndexedDB got to ride the NoSQL hype of the 2010's



I was also disappointed, but this decision seems reasonable to me. SQLite is still evolving, if you pick a specific version then you're stuck on that version or risk breaking people that depend on specifics of that version. Also a vulnerability anywhere in SQLite would expose billions of people to exploits. This is not the same as applications using sqlite internally, the web is a much hotter environment: most apps may expect potentially hostile input in SQL params, but with WebSQL everything is potentially hostile: the queries, the database files, even the schema may conspire under confusing concurrency conditions to corrupt the state of the sqlite engine to break out and exploit the user. Whatever you think you have to agree that this is a LOT of new surface area.

I much prefer this approach where the site brings their own sandboxed SQLite WASM package.

联系我们 contact @ memedata.com