展示 HN:Kanon 2 Enricher – 第一个分层石墨化模型
Show HN: Kanon 2 Enricher – the first hierarchical graphitization model

原始链接: https://isaacus.com/blog/kanon-2-enricher

## Kanon 2 Enricher 摘要 Kanon 2 Enricher 是一种新型的分层图谱化模型,最近完成了由领先的律师事务所和科技公司 102 名参与者组成的测试计划。该模型从零开始构建,利用 58 个任务头和 70 个损失项来深度分析法律文件,严格遵守 Isaacus 法律图谱模式 (ILGS) 的约束。 其关键创新在于高效的“一次性”标注流程——与生成式模型不同,它一次分析整个文档,即使像 *Dred Scott v. Sandford* (111,267 字) 这样冗长的文本也能在十秒钟内完成富化。这使得能够快速识别实体,例如人物、地点和引用的文件。 未来的版本将利用 Kanon 2 Enricher 实现新的应用,包括增强的语义分块、文本到 Markdown 转换器以及全面的法律信息公共知识图谱。该模型的架构优先考虑计算效率,并避免生成式人工智能中常见的“幻觉”。

## Kanon 2 Enricher:一种用于知识图谱的新型人工智能 Kanon 2 Enricher 是一种新的人工智能模型,旨在将文档集合转换为结构化的知识图谱。与典型的 LLM(大型语言模型)不同,它“一次性”对文档标记进行分类,而不是顺序处理,从而实现更快的处理速度并降低出现幻觉的风险。 该模型执行实体提取、分层分割(直至段落级别)和文本标注——识别和分类诸如标题和目录之类的元素。它输出的数据格式符合 Isaacus Legal Graph Schema (ILGS),利用 70 个损失项优化的 58 个任务头。 应用范围涵盖法律研究、金融取证和法规分析,并通过加拿大政府的法律知识图谱以及澳大利亚高等法院案件的 3D 地图等用例得到证明。 目前,该模型已通过公开发布提供,此前曾与 Harvey 和 KPMG 等公司进行封闭测试。开发者计划在 AWS 和 Azure Marketplaces 上提供可自我托管、隔离的版本,以解决数据隐私问题。定价信息可在其文档网站上找到。
相关文章

原文

In total, there were 102 participants in the Isaacus Beta Program, including Harvey, KPMG Law, Clyde & Co, Cleary Gottlieb, Alvarez & Marsal, Khaitan & Co, Gilbert + Tobin, Smokeball, Moonlit, LawY, Lawpath, UniCourt, and AccuFind. We thank each and every one of them for being amongst the first to play with Kanon 2 Enricher and for providing critical early feedback that helped improve Kanon 2 Enricher ahead of its release.

Over the coming weeks and months, we will be releasing our own applications built atop Kanon 2 Enricher such as a new LLM-powered semantic chunking mode in semchunk, a new Python package for automatically converting plain text into Markdown, and a first-of-a-kind public knowledge graph of laws, regulations, cases, and contracts from around the world, which can then be ingested into your own systems.

As the first hierarchical graphitization model, Kanon 2 Enricher was built entirely from scratch. Every single node, edge, and label representable in the Isaacus Legal Graph Schema (ILGS) corresponds to one or more bespoke task heads. Those task heads were trained jointly, with our Kanon 2 legal encoder foundation model producing shared representations that all other heads operate on. In total, we built 58 different task heads optimized with 70 different loss terms.

In designing Kanon 2 Enricher, we had to work around several hard constraints of ILGS such as that each entity must be anchored to a document through character-level spans corresponding to entity references and all such spans must be well-nested and globally laminar within a document (i.e., no two spans in a document can partially overlap). Wherever feasible, we tried to enforce our schematic constraints architecturally, whether by using masks or joint scoring, otherwise resorting to employing custom regularizing losses.

One of the trickiest problems we had to tackle was hierarchical document segmentation, where every heading, reference, chapter, section, subsection, table, figure, and so on is extracted from a document in a hierarchical fashion such that segments can be contained within other segments at any arbitrary level of depth. To solve this problem, we had to implement our own novel hierarchical segmentation architecture, decoding approach, and loss function.

Thanks to the many architecture innovations that have gone into Kanon 2 Enricher, it is extremely computationally efficient, far more so than a generative model. Indeed, instead of generating annotations token by token, which introduces the possibility of generative hallucinations, Kanon 2 Enricher directly annotates all the tokens in a document in a single shot. Thus, it takes Kanon 2 Enricher less than ten seconds to enrich the entirety of Dred Scott v. Sandford, the longest US Supreme Court decision, containing 111,267 words in total. In that time, Kanon 2 Enricher identifies 178 people referenced in the decision some 1,340 times, 99 locations referenced 1,294 times, and 298 documents referenced 940 times.

联系我们 contact @ memedata.com