The Problem: modern LLMs write syntactically correct SQL, but they struggle with real-world relational data. This is because real world data and schema is messy, natural language can often be ambiguous and LLMs are not trained on your specific dataset.
Solution: The core NL-to-SQL engine in Dataherald is an LLM based agent which uses Chain of Thought (CoT) reasoning and a number of different tools to generate high accuracy SQL from a given user prompt. The engine achieves this by:
- Collecting context at configuration from the database and sources such as data dictionaries and unstructured documents which are stored in a data store or a vector DB and injected if relevant
- Allowing users to upload sample NL <> SQL pairs (golden SQL) which can be used in few shot prompting or to fine-tune an NL-to-SQL LLM for that specific dataset
- Executing the SQL against the DB to get a few sample rows and recover from errors
- Using an evaluator to assign a confidence score to the generated SQL
The repo includes four services https://github.com/Dataherald/dataherald/tree/main/services:
1- Engine: The core service which includes the LLM agent, vector stores and DB connectors.
2- Admin Console: a NextJS front-end for configuring the engine and observability.
3- Enterprise Backend: Wraps the core engine, adding authentication, caching, and APIs for the frontend.
4- Slackbot: Integrate Dataherald directly into your Slack workflow for on-the-fly data exploration.
Would love to hear from the community on building natural language interfaces to relational data. Anyone live in production without a human in the loop? Thoughts on how to improve performance without spending weeks on model training?