语义网现已被广泛采用

语义网现已被广泛采用
The semantic web is now widely adopted

语义网是指利用结构化数据的机器可读网站。这个概念通常被称为 Web 3.0，目前很流行并被广泛接受，使其成为当前状态而不是未来。与传统网站不同，语义丰富的内容以机器易于理解的格式包含有关内容的附加信息。通常，此结构是使用 JSON-LD（链接数据的 JavaScript 对象表示法）定义的，它在 HTML 文件的头部部分中利用专门的 JavaScript 语法来传达上下文详细信息。常见元数据元素的示例可以包括标题、描述、作者、发布日期和图像。通过包含此数据，可以实现增强的社交媒体预览、改进的搜索结果相关性和更高的可见性等优势。虽然并不总是由人类阅读，但自动化工具（例如机器人）经常处理这些元数据，从而带来好处，例如增加链接点击次数和更有效的网络爬行。因此，企业、开发人员和个人应该投资采用语义网络技术，以提高在线参与度、可发现性和整体成功。

自推出以来，由于运气不佳和愿景不完整等多种因素，语义网一直难以获得关注。虽然提供改进的搜索元数据不是“杀手级应用程序”，但它不是语义网的最终目的。相反，语义网旨在将网络上的大量信息组织成机器可以理解和处理的结构化格式。然而，当前的方法在易用性方面存在不足，并且缺乏对日常用户的具体实用性。语义网面临的一项挑战是实现它所需的技术工具的复杂性，例如编写本体、使用 SPARQL 查询以及理解推理系统。这些工具对于大多数人来说很难掌握，并且需要对计算机科学原理有深入的了解。另一个问题是语义网碎片化为各种脱节的工作，没有中央组织指导开发。尽管存在这些挑战，语义网络仍然是构建与人工智能系统无缝集成的智能网络的一个有前途的方向。需要改进的关键领域包括为普通用户开发更直观的界面，减轻创建和维护本体的负担，以及促进研究人员和开发人员之间更大的协作。总之，语义网络为提高现代网络的功能和可用性提供了巨大的潜力，但需要持续的研究和创新来克服现有的挑战并充分发挥其潜力。

原文

The Semantic Web is the old Web 3.0. Before "Web 3.0" meant crypto-whatnot, it meant "machine-readable websites".

I thought this concept hadn't really gone anywhere but it turns out that The Semantic Web (best read in a very deep voice) is now very widely adopted.

The Semantic Web is so widely adopted in fact that I think it's fair to say that we're already on Web 3.0. It's not the future, it's the present. I suppose that means the blockchain crowd will need to argue their case to get all that crypto stuff into the next major version. Good luck to them, really.

If Web 3.0 is already here, where is it, then? Mostly, it's hidden in the markup.

JSON-LD for a `BlogPosting`

A worked example might help. Here's how Semantic Web metadata is added for a blog post, for example the one you're reading now.

You include a special <script type="application/ld+json"> element inside the <head> of an HTML page. Inside that element is some JSON-LD.

JSON-LD (JSON "for Linked Data") is the leading format for encoding Sematic Web metadata. It's just JSON, really. There are schemas that lay out the "types" you can use to describe the stuff on your page. For a blog post, the right type is BlogPosting. Here's a (mildly truncated) example of a BlogPosting:

{
    "@context": "https://schema.org",
    "@type": "BlogPosting",
    "headline": "From Shell to Excel - with a little bit of HTTPS",
    "url": "https://csvbase.com/blog/10",
    "description": "Write once, read everywhere",
    "author": {
        "@type": "Person",
        "name": "Cal Paterson",
        "email": "[email protected]",
        "url": "https://calpaterson.com/about.html"
    },
    "image": "https://csvbase.com/blog-static/excel.png",
    "datePublished": "2024-08-12",
    "dateCreated": "2024-08-12",
    "dateModified": "2024-08-12"
}

It's all pretty self expanatory:

keys starting with @ are metadata (meta-metadata?)
- @context is the namespace. Usually it's schema.org
- @type is the class, BlogPosting
the other keys are just things allowed in the BlogPosting type
the values of keys can be other types, eg Person for the author key in this case

What's in it for me?

Does anyone actually read all that stuff? Yes, a lot of bots are out there parsing JSON-LD metadata. Nice things tend to happen to blog posts that include the Semantic Web metadata:

Social media sites (Twitter/Discord/Facebook/WhatsApp/etc) start showing that nice link preview with an image for your links. Link previews usually look a bit like this:

People do click on these link previews more often than on "bare urls", which is something that drives adoption of the Semantic Web.

Search engine web crawlers also make use of this metadata both to crawl a bit more intelligently but also to show more information in search results (for example: who the author is). Again, people are more likely to click on such "enhanced" search results, which is another thing that has driven adoption.

Automated link aggregators show your post to their users. I once woke to many many nice emails from people who normally wouldn't have read my blog. After being initially puzzled I eventually learned that Android had decided, all on its own, to show my blog post on its on-phone news screen. All possible because of Semantic Web metadata.

Many other, usually nice, things can happen. The joy of The Semantic Web is that it is permissionless as well as "vendor-neutral". Anyone can take the metadata and run with it. That's cool, that's the web spirit.

Is it really hard?

Not really, no. A bit of JSON, containing things already on the page anyway, just laid out in a way for computers to read. If people can write enormous frontend apps I think they can figure out JSON-LD.

JSON-LD for other stuff

There are other JSON-LD "types" than just BlogPosting. A lot more. Some well supported ones:

csvbase is a website full of table data (tinder-for-dogs style pitch: "Github, for data tables"), so it uses the Dataset type to describe each of tables (for example) in a machine-readable way. Here's an example of the JSON-LD csvbase includes on each table page:

{
    "@context": [
        "https://schema.org",
        {
            "csvw": "https://www.w3.org/ns/csvw#"
        }
    ],
    "@type": "Dataset",
    "name": "stock-exchanges",
    "url": "https://csvbase.com/meripaterson/stock-exchanges",
    "isAccessibleForFree": true,
    "distribution": [
        {
            "@type": "DataDownload",
            "contentUrl": "https://csvbase.com/meripaterson/stock-exchanges.csv",
            "encodingFormat": "text/csv",
            "contentSize": "16222"
        },
        {
            "@type": "DataDownload",
            "contentUrl": "https://csvbase.com/meripaterson/stock-exchanges.parquet",
            "encodingFormat": "application/parquet",
            "contentSize": "10751"
        },
        {
            "@type": "DataDownload",
            "contentUrl": "https://csvbase.com/meripaterson/stock-exchanges.xlsx",
            "encodingFormat": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
            "contentSize": "15500"
        },
        {
            "@type": "DataDownload",
            "contentUrl": "https://csvbase.com/meripaterson/stock-exchanges.jsonl",
            "encodingFormat": "application/x-jsonlines",
            "contentSize": "38627"
        }
    ],
    "dateCreated": "2022-04-25T13:43:24.746075+01:00",
    "dateModified": "2023-04-02T20:27:33.255648+01:00",
    "maintainer": {
        "@type": "Person",
        "name": "meripaterson",
        "url": "https://csvbase.com/meripaterson"
    },
    "description": "The world's stock exchanges...",
    "mainEntity": {
        "@type": "csvw:Table",
        "csvw:tableSchema": {
            "csvw:columns": [
                {
                    "csvw:name": "csvbase_row_id",
                    "csvw:datatype": "integer"
                },
                {
                    "csvw:name": "Continent",
                    "csvw:datatype": "string"
                },
                {
                    "csvw:name": "Country",
                    "csvw:datatype": "string"
                },
                {
                    "csvw:name": "Name",
                    "csvw:datatype": "string"
                },
                {
                    "csvw:name": "MIC",
                    "csvw:datatype": "string"
                },
                {
                    "csvw:name": "Last changed",
                    "csvw:datatype": "date"
                }
            ]
        }
    }
}

There's a lot of stuff in there. From basic things like the last time the data was changed (personal bugbear of mine) to some more complicated stuff.

Several DataDownload objects are included, to help bots understand where people can go to just download the data, without having the visit the site. Hopefully that's nice for people viewing tables through data-specific search engines, like Google's Dataset Search. Save them a click.

There's also some more advanced stuff. The JSON-LD document actually includes two schemas. One is the ordinary "schema.org" one (Dataset) but the other is a schema called "CSV on the Web", which allows you to describe the columns and the types of those columns --- as well as other more detailed data I haven't sorted out yet.

I haven't direct evidence that anyone is yet doing much with these CSVW types yet but Google do document that they read it so perhaps there is more to come on this in the future from them. And of course, there is nothing to stop anyone else using it. It's all standardised stuff.

It is really necessary? "Doesn't AI solve this?"

It would of course be possible to sic Chatty-Jeeps on the raw markup and have it extract all of this stuff automatically. But there are some good reasons why not.

The first is that large language models (LLMs) routinely get stuff wrong. If you want bots to get it right, provide the metadata to ensure that they do.

The second is that requiring an LLM to read the web is throughly disproportionate and exclusionary. Everyone parsing the web would need to be paying for pricy GPU time to parse out the meaning of the web. It would feel bizarre if "technological progress" meant that fat GPUs were required for computers to read web pages.

Alternatives

JSON-LD is not the only form of Semantic Web metadata, though it's increasingly got the most momentum. There is some overlap between the various schemes and many bots will parse whatever they can find. Here's a quick overview of the alternatives:

At some point Facebook created the Open Graph Protocol, a standard based on using special <meta> tags. It's widely used and supported but only covers the kinds of things you might post to Facebook: articles, songs, videos, etc. What zoomers call "Content".

There is also "microdata". It's very simple but I think quite hard to parse out. Bits of it are well supported.

Twitter published a spec for "Twitter Cards" which pretty much just lets you describe how you want your thing to appear on Twitter. It's pretty limited - but Twitter is popular - so this is widely implemented.

Before JSON-LD there was a nest of other, more XMLy, standards emitted by the various web steering groups. These actually have very, very deep support in many places (for example in library and archival systems) but on the open web they are not a goer.

Boring technology

It is surprising how low-key The Semantic Web is. Thousands of sites are already setting this metadata. The Semantic Web is already widespread, it's just that there was no moment of victory.

If you think csvbase is cool:

Notes

Googlers, if you're reading this, JSON-LD could have the same level of public awareness as RSS if only you could release, and then shut down, some kind of app or service in this area. Please, for the good of the web: consider it.

Semantic Web information on websites is a bit of a "living document". You tend publish something, then have a look to see what people have parsed (or failed to parse) it and then you try to improve it a bit. If you have ideas for ways that csvbase can improve its metadata, please open an issue on the github repo.