(评论)
(comments)

原始链接: https://news.ycombinator.com/item?id=39417503

围绕使用 SQL 操作 CSV/TSV 数据的讨论强调了实现类似概念的可能性,该概念允许使用 SQL 命令查询 CSV/TSV 文件。 这种方法消除了对传统数据库中常见的额外数据结构的需求,为较小的数据集提供了灵活性。 然而,也有一些限制需要考虑,包括缺乏结构化模式、由于并发更新而难以管理不一致的数据、对高级查询功能的支持有限以及现有工具的潜在冗余。 相比之下,像 SQLite 这样成熟且经过验证的数据库可以为更复杂的数据管理需求提供改进的功能,尽管它在大小限制、读/写能力和同步挑战方面引入了不同的权衡。 最终,这些方法之间的选择取决于具体的要求和偏好。

相关文章

原文
Hacker News new | past | comments | ask | show | jobs | submit login
Wddbfs – Mount a SQLite database as a filesystem (adamobeng.com)
268 points by vitplister 1 day ago | hide | past | favorite | 40 comments










This seems really nice!

If this is posted by the author looking for feedback:

1) WebDAV is a much better choice than FUSE. FUSE is a good concept, but buggy and poorly-implemented. Things like sshfs can break in very bad ways if e.g. there is a network connectivity issue. Not a hack.

2) Writes seem like a very bad idea. Keep those out unless you come up with a clean way to handle them (which seems difficult if not impossible given the differences in FS versus relational abstractions, especially with regards to data validation). Not a limitation.

In other words, the "hacks" seem like design choices a good architect would likely have made. Continuing:

3) The major use-case I have is if I have a small (

4) I think a major theoretical question is how to fuse the two models. I would like to be able to do 'generic' things like the above on databases, while still being able to be relational.

5) I don't have an answer to the above, but perhaps natural first step might be to allow something like queries or virtual tables to sit on the file system:

wddbfs --anonymous --db-path=/path/to/an/example/database/like/Chinook_Sqlite.sqlite

wddbfs_query myjoin "SELECT * FROM table_1, table_2 WHERE table_1.id=table_2.id"

And voila! A /virtual/myjoin.csv file pops up.

(Even more) half-baked thoughts:

There might be more clever ways to do it too. I'm thinking through half-baked thoughts on how to make files and tab completion work. My half-baked thoughts are moving towards something like:

wddbfs_SELECT * from Customer.tsv\, Employee.tsv WHERE

But I don't like all the potential bugs with escaping. I'm also thinking about when output wants to go to the console versus into a virtual table.



> WebDAV is a much better choice than FUSE.

That's a pretty spicy opinion. WebDAV is slow; FUSE works great. Use interrupt mounts for things like sshfs (or NFS, for that matter).

> Writes seem like a very bad idea. Keep those out unless you come up with a clean way to handle them (which seems difficult if not impossible given the differences in FS versus relational abstractions, especially with regards to data validation).

You could imagine supporting appending new records in json format to the json/jsonl files in a pretty clean way, but I agree it seems well out of scope. And removing records seems basically impossible.



What's an "interrupt mount"?


lazy/async/autoreconnect


> WebDAV is a much better choice than FUSE. FUSE is a good concept, but buggy and poorly-implemented. Things like sshfs can break in very bad ways if e.g. there is a network connectivity issue. Not a hack.

I have had problems with particular fuse implementations (sshfs included, s3fs especially), but I've never had trouble writing FUSE implementations that behave correctly. Is there something deficient in the spec that you want to call out, or is the idea just that WebDAV carefully built networking into its core concepts, where you have to do a lot of extra work to squeeze those ideas correctly into a FUSE implementation?



> There is no way to query all tables for something,

Ha. I would bet anything you could adopt my MySQL findall() to SQLite for the purpose. https://stackoverflow.com/a/68915413/308851



This SQL is definitely supervillain territory.

I just can't decide if you're a sci-fi supervillain, a political thriller supervillain, or a Despicable Me supervillain.

;-)



Nice! I hope I remember this when I next need it.


Huh. I guess I'm one of the few who's had generally good experiences with FUSE, then. I've used several different FUSE filesystems, ranging from disk-based, to network-oriented, to... well, let's just say a bit odd, kind of fun, but probably not particularly useful. I even wrote a couple of the latter myself, just for funsies.

Oh well, there's my FUSE rant for the day, I guess. Now, on to what I came to comment on....

I'm indifferent on the WebDAV vs FUSE thing, actually. But if the author wanted a nicer way to query a bunch of text files, where each line would roughly represent what a db row would, and there's whatever random delimiter character in there (comma, tab, pipe, or even those weird ASCII control characters nobody ever knows or remember exist for the purpose [0]), then I think he gave in and went to a database too soon. Besides the ubiquitous GNU textutils, there are tons of tools out there for working with record-oriented CSV and TSV files. [1] There's even `jq` if you're a real masochist and want to use json as your text-based data storage, query, and interchange format.

Now you've got me wondering if I was lucky in not having any significant issues with FUSE. But, as far as OP and the technical problems they were facing, I think it was a little premature to jump to `sqlite`. All those things they put down to "clunkiness" really just makes me thing that `sqlite` + WebDAV ended up being one of the members of the set of simple, elegant, and wrong solutions most problems seem to have. ¯\_(ツ)_/¯

---

[0]: https://stackoverflow.com/a/18782271

[1]: https://stackabuse.com/running-sql-on-csv-data-data-conversi...



I don't quite agree. There's a reason relational won, and continues to win more than a half-century after it was invented, despite no lack of wannabe usurpers.

For your model to work, what would be needed is the reverse: a system which allows me to interact with TSV files with SQL.

For 98% of uses of SQLite, performance and storage efficiency don't matter; we're talking about dozens or hundreds of records. It doesn't need to be well-implemented from that perspective. It does need to be robust. For example, databases solve a lot of problems especially around consistency in threaded systems which this would need to handle as well (although a KISS solution, such as handling all access from one thread / process, and simply blocking other requests during transactions, would be fine).

Something like that would likely eat SQLite's lunch.

q seems close. It'd need to support writes, table creation, and all the other stuff, perhaps be more actively maintained, provide documented API (not just command-line) access, and work with things like ORMs.



> For your model to work, what would be needed is the reverse: a system which allows me to interact with TSV files with SQL.

Several such systems exist, one of which is SQLite. https://www.sqlite.org/csv.html

I don't know if there's a virtual table extension for tab separated values, but SQLite imports them just fine.



> a system which allows me to interact with TSV files with SQL

ClickHouse and DuckDB can both do this. Pretty sure AWS has some built-in stuff for this too.



I think those are a very different use case. sqlite is used for things like app settings. Both of those are heavyweight monsters designed for rather large data.

That said, I'm reading their documentation right now, and I think one or the other might slot into a different use-case I have where I do have rather large data. I'd last looked at this space quite a number of years ago, and both are way ahead of what I'd recalled.



> Things like sshfs can break in very bad ways if e.g. there is a network connectivity issue.

That’s a peculiar critique of FUSE. You could write the same thing about NFS or SMB (but especially NFS).



Why not just add views?


Don't people use NFS instead of Fuse now? I'd assume it would have much better performance than webdav and it should handle a bunch of the issues around writes`


I am moving outside of my zone of expertise, but when I last looked (decades ago), systems like NSF assumed a filesystem backing store and thinks about things like byte ranges. There are a lot of operations which file systems support which work very poorly when this is not true, such as seeks, memory-mapped files, etc.

If I want the 50,004,123,121th byte of a file from a disk, that's very fast. If I want the same for a virtual object from an HTTP server, object store, virtual table, etc. it literally involves creating the whole object, and stepping through it byte-by-byte until I get there.

If the next request is doing the same 1k ahead, on a disk, that's probably in cache, and if not, I can get there quickly. If this was a SQL query, I probably need to redo the whole thing.

You get a natural explosion from O(1) to O(n) in many common cases, and for something like a complex SQL query, it can be much, much worse.



NFSv4 is stateful, and while it's true that the server has to support stateless reads for bad clients, most clients are ok with keeping the state ids that are returned after an open. Whether you have to step through the object byte-by-byte isn't inherent to the protocol, it depends on how you've stored the data.


I've implemented both in the last 6 months and the only reason I would expect anyone to use NFS over FUSE is for MacOS without macfuse/fuse-t. fuse-t is a cool project that provides (iirc) the high level fuse interface (libfuse) via NFS on MacOS. Compare to macfuse, which implements the low level interface (/dev/fuse) via a kext.

FUSE is much, much easier to work with. The protocol (even v4) should also be less chatty, and if performance is your goal there are projects like ExtFuse that can move some caching into the kernel via eBPF which I don't think would be possible with NFS.

If you poke around other user space file systems (sshfs, cephfs, objectivefs, etc) they're all FUSE.



I can implement a basic Fuse filesystem in a handful of lines of code in several languages. I'm not aware of any libraries that makes it equally simple to write an NFS server.


I'm also working on something like this:

https://github.com/Airsequel/SQLiteDAV

My mapping is: table -> dir, row -> dir, cell -> file



I wonder if mapping index->directory would be the best match, that way you could at least hypothetically reclaim some of the SQL benefits, and the directory entries could have more natural names. You'd need to have slightly different structure for unique vs non-unique indices, but that seems like minor issue.


I build a couchDB webdav server back in the day, you could also edit json documents or file blobs directly. the problem i discovered was that all OSes totally staled their webdav support and there are also enough differences between oses to be annoying. In the end to build somthing with great performance you would also need to control the webdav client side and probably build a fuse webdav client. I would have loved to see webdav maturing and becoming what the 9P vision was just for the web, but this obviously never happened as all the applications just went into to the web and used rest intead of webdav and everything else moved to sync protocols that sync to local folders.


I'm with you on wishing WebDAV continued its rollout. These days there are great low-drama server-side deployments like https://github.com/sigoden/dufs. It's run relative too - you could habe multiple dufs processes serving up different directories in different ways. But for WebDAV, you can't simply mount that on the client side for every OS that's equally low configutaion. For that reason, I really like sshfs as it can be initiated from the client-side without a lot of config (just a mkdir of the mapped dir), and it's OK most time despite it's lack of speed and multi-day uptime. I'm on a chromebook now and it turns out that Samba is the easiest client-side tech to use for remote file systems. DAv should've been uniquitous.


Wrap in an SSH tunnel and you can do some fun stuff over the network, too.


Over the network (when using azure or gcp storage) there's also https://sqlite.org/cloudsqlite/doc/trunk/www/index.wiki which doesn't require loading everything in memory.


> Although for now, the whole table gets read into memory for every read so this won’t work well for very large database files. There’s also no write support… yet.

At such a set of features I would prefer a tool that converts databases to a directory of real csv and jsonl files, at least there are no performance issues to worry about



I really like the idea of running a watch on an sqlite table with my common cli tools. If I understand correctly, it is being re-read on every request. Does that mean that changes to the sqlite database will be visible on the next read of a csv file?


> the SQL syntax for selecting a few records is much more verbose than head -n or tail -n

I use DBeaver to inspect SQLite files, and to also work with Postgres databases.

I kind of miss MySQL Workbench, but MySQL is pretty dead to me. And SQL Server Management Studio is a relic that keeps being updated.

I also sometimes make dashboards from SQLite files using Grafana, but the time functions for SQLite are pretty bad.



I was expecting this to be a way to mount so-called SQL Archives (https://sqlite.org/sqlar.html) but this is just as cool.


Wasn't there an extension that let you mount a filesystem as a table or db in sqlite? I wonder how far you can inception that. Mount a db as a filesystem, then mount that filesystem as a db..etc.


Borrowing concepts from DB2 here


This looks pretty cool.


Cute, seems legitimately useful, succinct. Not everything has to be super technically challenging to be valuable. I can see how this would be really handy.


I think its neat proof of concept, but I struggle to see any case where this would be particularly useful. Or rather when this would be more handy than what sqlite cli already offers.

like is this really meaningful improvement

    $ tail -n 3 Chinook_Sqlite.sqlite/Album.tsv
over this?

    $ sqlite3 -tabs Chinook_Sqlite.sqlite 'select * from Album' | tail -n 3


Nice. I think exposing the tables as csv, tsv, json & jsonl is to much of a cluttering. Format should be a mount option.


It is


Why webdav instead of making short sqltocsv, sqltojson, etc scripts? You could make completion work (with some enormous time investment).


Because those all exist already, and this is cool?


Does it support mounting a sqlar file?






Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact



Search:
联系我们 contact @ memedata.com