大型 Feed 与 RFC 5005

大型 Feed 与 RFC 5005
Large Feeds and RFC 5005

原始链接: https://alexschroeder.ch/view/2025-09-10-large-feeds

作者正在构建Xobaque，一个使用`indieblog.page`的独立博客搜索引擎，在将近5000个订阅源导入SQLite数据库时遇到了性能瓶颈。主要问题是由于单个SQLite写入器导致的写入速度慢，并且由于无法在全文搜索虚拟表中使用UPSERT而更加复杂。一个令人惊讶的发现是某些博客订阅源的巨大尺寸——一个超过12,000页——引发了对订阅源目的（存档与更新）以及订阅源是否遵守RFC 5005分页规范的疑问。通过分批处理1000个条目的订阅源，并使用预处理语句进行插入和更新，性能得到了改善，处理时间减少到每次运行12小时。未来的优化包括使用全局预处理语句来最小化SQL解析，以及根据发布/更新日期过滤订阅源页面，以避免不必要的数据库活动。作者计划存储Web服务器日期和订阅源发布日期，以便进行更有效的过滤。

黑客新闻新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录大型 Feed 和 RFC 5005 (alexschroeder.ch) 5 分，来自 8organicbits 2 小时前 | 隐藏 | 过去 | 收藏 | 讨论指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索：

原文

I’ve been working on Xobaque because I’m using it to back Search indieblog.page.

As I’m importing nearly 5000 feeds, I find that a single SQLite writer is in fact a bottle-neck; that my simple “select foo” if found “update foo” else “insert foo” isn’t very fast. Sadly, I cannot use UPSERT because full text search is implemented by a virtual table which doesn’t allow constraints or indexes and UPSERT requires a constraint to work.

I also discovered that some blogs come with a gazillion pages going back to the beginning of the century. There’s one blog with over 12000 pages in its feed. Short pages, for sure, but still. There’s one blog with over 4000 pages; one with over 2000 pages; six with over 1000 pages. What the hell!

I understand having a lot of pages on the blog. I don’t understand having a lot of pages in a feed. The feed is for updates. Do people expect readers to read the entire back catalogue? I mean, I have listened through entire podcasts, which are powered for a feed, so I understand the feed for a blog and the feed for a podcast to serve different needs. But still. 12000 pages in a feed feels like downloading the whole site, every single time somebody requests and update. These feeds are not “news” but “archives” and there’s a place for both of them. Upload an archive for the initial-load, then keep uploading news to keep the index up to date.

This is what RFC 5005 is for: feed paging and archiving.

I guess I’m just confused because it all illustrates that I walked into this search engine stuff without understanding the huge variety out there.

The current architecture is this: 10 go routines fetch feeds from the OPML; this uses If-Modified-Since and If-None-Match headers and skips feeds that return a 304. This is fast.

Before a feed is written to disk a lock is acquired so we only have one SQLite writer at a time. This is slow and cannot be changed unless moving away from SQLite.

On fedi, @jonny had some interesting suggestions for my use of SQLite: Split select, insert and updates in to chunks. And I did, and it helped!

Each feed is broken into groups of 1000 items each (only very few of them are); up to 1000 pages with matching URL are read from disk and their title and page is compared to the data in the feed; the result is a number of updates, a number of inserts, a number of duplicate URLs that get ignored, and a number of unchanged pages. The updates are executed with a single prepared statement; then the inserts are executed with a single prepared statement. This already reduces the running time of one pass to about 12h. 😅 This is better than before!

I’m not sure what to do next to improve performance. The most obvious step would be to use three global variables for prepared select, insert and update to reduce SQL parsing. Not sure how much that will help. I have a log with 45 min of output. In this time window, 1782 feeds resulted in some SQL based processing and 2179 feeds were skipped due to HTTP caching.

If I use the timestamp of the Last-Modified header that I have on disk and filter the pages in the feed for published or updated dates greater than that, I might speed things up by excluding them from the select statements. If such a date is missing, no big deal. For many feeds, however, this will result in less database activity. Looking at the numbers of the piece of log I have, it shows 93119 pages processed, of which 160 were duplicates, 9534 resulted in inserts, 362 resulted in updates and 83063 were unchanged. Clearly, skipping the unchanged ones would save database activity!

Actually, I’ll store the feed’s published/updated date in addition to the web server’s last-modified date, if available, and use that instead.

#Xobaque #Feeds #Blogs

大型 Feed 与 RFC 5005 Large Feeds and RFC 5005

大型 Feed 与 RFC 5005
Large Feeds and RFC 5005