(评论)
(comments)

原始链接: https://news.ycombinator.com/item?id=40456236

Dataherald 发布了整个编码框架,包括基本引擎、客户端应用程序和身份验证/RBAC 后端。 用户可以采用该系统来开发自己的文本转SQL功能。 现实生活中的数据给现代语言模型 (LLM) 带来了挑战,因为它们擅长生成语法正确的 SQL,但由于模糊性和缺乏对单个数据集的专业知识,在处理复杂、真实的关系数据时却失败了。 为了解决这个问题,Dataherald 的核心 NL 到 SQL 引擎采用了由思想链 (CoT) 逻辑和各种附加功能支持的高级 Lang 模型代理,例如: 1. 通过存储在数据存储或矢量数据库中的数据库配置和相关资源(例如数据字典和半结构化文本)收集上下文。 2. 为用户提供输入示例自然语言 - SQL 对的选项,称为“黄金 SQL”,可以在简短提示中使用,也可以进行细化以提高专门针对某些数据集定制的 NL 到 SQL LLM 的精度。 3. 对数据库运行生成的 SQL 查询以检索样本并处理任何错误恢复。 4. 使用评估器为生成的 SQL 语句分配可靠性分数。 该存储库由四个主要组件组成: 1. 引擎:包括主要的 Lang Model Agent,以及向量存储和数据库连接器。 2. 管理控制台:一个简单的前端,旨在管理和监控引擎设置和性能。 3. 企业后端:增强版本,添加了安全功能、缓存和用于前端交互的 API。 4. Slackbot:将Dataherald直接集成到Slack环境中,在日常操作中轻松进行数据查询。 我们欢迎社区关于在关系数据系统上实现自然语言接口的建设性反馈,以及任何关于提高效率而不需要投入大量时间进行模型调整的想法。 您是否在没有人工干预的情况下在现场制作中使用了这些技术? 分享见解!

相关文章

原文
Long story short: We (Dataherald) just open-sourced our entire codebase, including the core engine, the clients that interact with it and the backend application layer for authentication and RBAC. You can now use the full solution to build text-to-SQL into your product.

The Problem: modern LLMs write syntactically correct SQL, but they struggle with real-world relational data. This is because real world data and schema is messy, natural language can often be ambiguous and LLMs are not trained on your specific dataset.

Solution: The core NL-to-SQL engine in Dataherald is an LLM based agent which uses Chain of Thought (CoT) reasoning and a number of different tools to generate high accuracy SQL from a given user prompt. The engine achieves this by:

- Collecting context at configuration from the database and sources such as data dictionaries and unstructured documents which are stored in a data store or a vector DB and injected if relevant

- Allowing users to upload sample NL <> SQL pairs (golden SQL) which can be used in few shot prompting or to fine-tune an NL-to-SQL LLM for that specific dataset

- Executing the SQL against the DB to get a few sample rows and recover from errors

- Using an evaluator to assign a confidence score to the generated SQL

The repo includes four services https://github.com/Dataherald/dataherald/tree/main/services:

1- Engine: The core service which includes the LLM agent, vector stores and DB connectors.

2- Admin Console: a NextJS front-end for configuring the engine and observability.

3- Enterprise Backend: Wraps the core engine, adding authentication, caching, and APIs for the frontend.

4- Slackbot: Integrate Dataherald directly into your Slack workflow for on-the-fly data exploration.

Would love to hear from the community on building natural language interfaces to relational data. Anyone live in production without a human in the loop? Thoughts on how to improve performance without spending weeks on model training?

联系我们 contact @ memedata.com