ISBN 的困境
The Perils of ISBN

原始链接: https://rygoldstein.com/posts/perils-of-isbn

受到Letterboxd电影平台用户友好设计的启发,作者旨在为书籍创建一个类似的追踪和评论平台——目前市场上虽然有Goodreads和Storygraph,但它们的界面不够流畅。核心理念是一个简单直观的系统,用于记录阅读和推荐书籍。 然而,构建这个“Goodreads 2.0”面临一个重大挑战:可靠地获取书籍数据。Google Books API虽然免费,但由于每个版本和格式的ISBN不同,它会返回同一本书的多个条目。这种级别的细节对于仅仅记录*读过*一本书来说没有用处。 作者发现了图书馆学概念FRBR——区分“作品”(书本身)及其各种“表现形式”和“体现形式”——强调需要一个专注于作品而非ISBN的数据库。虽然OpenLibrary提供了一个更好的模型,但其数据仍然不完善。与Letterboxd依赖维护良好的The Movie Database不同,一个可比的、高质量的开源书籍数据库并不存在,这使得该项目由于书籍数据的庞大规模(OpenLibrary中超过4000万部作品)而变得更加复杂。尽管存在这些障碍,作者仍然决心探索这种可能性。

## ISBN 的困境:摘要 一篇 Hacker News 的讨论强调了使用 ISBN 唯一标识书籍的复杂性和不足之处。虽然 ISBN 最初的设计意图是作为简单的产品编号,但它常常无法准确区分作品的不同*版本*——不同的版本、格式(精装本 vs. 平装本 vs. 电子书),甚至像曲目列表的细微变化。 这场讨论与像 MusicBrainz 这样的音乐数据库相提并论,后者细致地追踪“作品”、“表现形式”和“载体”来记录细微的差异。用户指出 ISBN 的重复使用、分配错误以及数字格式 ISBN 的激增,导致混淆和个人图书馆编目的困难。 讨论中提到了几种书籍记录的替代方案,包括 StoryGraph、Hardcover.app、Open Library 和 Anna’s Archive。最终,这场讨论强调了 ISBN 专为*库存管理*而设计,并非全面的书目标识,因此需要一个更强大的系统来准确追踪和区分书籍的不同版本。
相关文章

原文

Last year I got into using Letterboxd, to complement my goal of watching more (good) movies. It’s got a really clean interface, the social features are useful but unobtrusive, and it makes remembering what I’ve watched and when I watched it easy. So why isn’t there a Letterboxd for books?

Funnily enough, Letterboxd still describes itself as “like GoodReads for movies”. But GoodReads itself is a mess. Take this screenshot of my childhood GoodReads account as an example:

Where do I log and review a book I’ve read? (The searchbar, but it takes half a dozen clicks and involves up to three different ways to log something). Where can I see the list of books I have read so I can recommend one to a friend? Where can I find books I plan to read? (Both under “My Books”, which by default shows them inter-mixed). Why is so much of the UI taken up with stuff like reading challenges and newsletters? Storygraph, leading independent alternative to GoodReads, has similar problems. These interfaces don’t lead me to log books; instead I just have some files in Obsidian that I sometimes remember to update.

So let’s build our own GoodReads, with a UI that’s convenient enough to actually use. First we gotta build a search function for books. Then–

Wait, a search function for books. How do we do that? Well there’s the Google Books API, and it’s free which is nice. But when I search for “The Last Unicorn” (and do a little munging of the contents with jq):

$ curl -X GET 'https://www.googleapis.com/books/v1/volumes?q=The+Last+Unicorn' | jq ".items | .[] | .volumeInfo | {title: .title, authors: .authors, isbns: .industryIdentifiers | map(.identifier)  }"

I get this mess:

{
  "title": "The Last Unicorn",
  "authors": [
    "Peter S. Beagle"
  ],
  "isbns": [
    "9780451450524",
    "0451450523"
  ]
}
{
  "title": "The Last Unicorn",
  "authors": [
    "Peter S. Beagle"
  ],
  "isbns": [
    "1417644931",
    "9781417644933"
  ]
}
{
  "title": "The Last Unicorn",
  "authors": [
    "Peter S. Beagle"
  ],
  "isbns": [
    "9780593547342",
    "0593547349"
  ]
}
{
  "title": "The Last Unicorn",
  "authors": [
    "Peter S. Beagle"
  ],
  "isbns": [
    "0345028929",
    "9780345028921"
  ]
}
{
  "title": "The Last Unicorn",
  "authors": [
    "Jane Elizabeth Cammack"
  ],
  "isbns": [
    "8853010932",
    "9788853010933"
  ]
}
{
  "title": "The Last Unicorn",
  "authors": [
    "Peter S. Beagle"
  ],
  "isbns": [
    "1596060832",
    "9781596060838"
  ]
}
{
  "title": "Last Unicorn",
  "authors": [
    "Peter S. Beagle"
  ],
  "isbns": [
    "1399606972",
    "9781399606974"
  ]
}
{
  "title": "Peter S. Beagle's “The Last Unicorn”",
  "authors": [
    "Timothy S. Miller"
  ],
  "isbns": [
    "9783031534256",
    "3031534255"
  ]
}
{
  "title": "The Last Unicorn the Lost Journey",
  "authors": [
    "Peter S. Beagle"
  ],
  "isbns": [
    "1616963085",
    "9781616963088"
  ]
}
{
  "title": "The Last Unicorn",
  "authors": [
    "Peter S. Beagle"
  ],
  "isbns": [
    "1616963182",
    "9781616963187"
  ]
}

Uh-oh. Why do we have so many distinct versions of The Last Unicorn? Well, each distinct format of a work has its own ISBN (so a hardcover, paperback, and eBook all have different ISBNs), even though the text may be identical. Then different editions (for example, a new foreword for a classic novel) all have their own set of unique ISBNs. Any given book may have dozens of ISBNs, each with their own unique entry in this API. That’s not going to work well for a search function: I just want to record that I read a book, not meticulously select which version of a book I read.

Works, not ISBNs

I was complaining about my situation to my partner; xe informed me that librarians think of this through the FRBR model. In short, there’s a distinction between the work (the book The Last Unicorn), the expression (a given edition of the book), a manifestation (a given physical format for an expression, such as paperback or hardcover), and an item (an individual object in a collection).

I’m firmly working in the realm of the abstract, so items are irrelevant to me. Google Books’s API is giving back different expressions or manifestations (I’m not entirely clear on which), but we want works. How do we get our hands on those? There are some other book database options, most notably OpenLibrary, which have a model closer to what we want. Here’s the OpenLibrary work page for The Last Unicorn, for example. But the data’s still a little messy. Peek at the search results for Hotel Iris by Yoko Ogawa; the same work is duplicated four times. I’m still exploring ways to get data as clean as GoodReads or StoryGraph, but it turns out that there’s not a high-quality open-source database of books.

Letterboxd benefits from The Movie Database, which serves as its canonical source for films and film metadata. I would almost venture to describe Letterboxd as a commercialization of the commons, though the slick UI and social features are undeniably added value. If you want to build a similar book-focused project, it turns out that no analogue really exists. It could be a chicken-and-egg problem (I’m sure having a large, commercial service attached drives contributions to TMDB), but there’s also a difference of scale: there are around a million movies in the database today. Having played around with the data, I can say that OpenLibrary current has more than 40 million works in its (incomplete) catalogue. The problem is at least an order of magnitude harder and has much less money behind it.

Doesn’t mean I won’t try, though; look out for a potential future blogpost!

联系我们 contact @ memedata.com