（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=40441945

我们的团队开发了尖端的LLM（大语言模型）动态路由系统。它根据质量、速度和成本等各种因素将任务分配给最佳模型和服务提供商。您可以通过这段视频查看其操作说明：[链接] 该系统利用神经评分功能预先预测最佳的法学硕士课程，类似于提示和正在评估的法学硕士课程的 BERT 架构。通过在一批中处理多个 LLM，该方法保持了评分功能的灵活性。该评分算法通过使用 GPT4 等开放 LLM 数据集的监督方法进行训练，评估来自各大洲实时测试的成本和速度。其损失函数与令牌间延迟和首次令牌时间等元素线性结合，允许用户控制权重因子。虽然小型法学硕士适合更简单的查询，但在处理复杂查询时，它们的局限性就变得很明显。微妙地令人不安的措辞可能会导致更大的法学硕士严重陷入困境。我们的路由器利用人工智能来检测此类特质，确保有效地使用更小、更便宜但可靠的法学硕士来完成特定任务。我们的定价计划中不添加任何利润率 - 我们和客户收取相同的费用。新用户将获得价值 50 美元的免费积分。如果需要，您可以根据您自己的数据专门训练我们的路由器，以提高效率。在下面分享您的想法！这种创新有好处吗？欢迎各种形式的反馈！

原文

Hey HN, we've just finished building a dynamic router for LLMs, which takes each prompt and sends it to the most appropriate model and provider. We'd love to know what you think!

Here is a quick(ish) screen-recroding explaining how it works: https://youtu.be/ZpY6SIkBosE

Best results when training a custom router on your own prompt data: https://youtu.be/9JYqNbIEac0

The router balances user preferences for quality, speed and cost. The end result is higher quality and faster LLM responses at lower cost.

The quality for each candidate LLM is predicted ahead of time using a neural scoring function, which is a BERT-like architecture conditioned on the prompt and a latent representation of the LLM being scored. The different LLMs are queried across the batch dimension, with the neural scoring architecture taking a single latent representation of the LLM as input per forward pass. This makes the scoring function very modular to query for different LLM combinations. It is trained in a supervised manner on several open LLM datasets, using GPT4 as a judge. The cost and speed data is taken from our live benchmarks, updated every few hours across all continents. The final "loss function" is a linear combination of quality, cost, inter-token-latency and time-to-first-token, with the user effectively scaling the weighting factors of this linear combination.

Smaller LLMs are often good enough for simple prompts, but knowing exactly how and when they might break is difficult. Simple perturbations of the phrasing can cause smaller LLMs to fail catastrophically, making them hard to rely on. For example, Gemma-7B converts numbers to strings and returns the "largest" string when asking for the "largest" number in a set, but works fine when asking for the "highest" or "maximum".

The router is able to learn these quirky distributions, and ensure that the smaller, cheaper and faster LLMs are only used when there is high confidence that they will get the answer correct.

Pricing-wise, we charge the same rates as the backend providers we route to, without taking any margins. We also give $50 in free credits to all new signups.

The router can be used off-the-shelf, or it can be trained directly on your own data for improved performance.

What do people think? Could this be useful?

Feedback of all kinds is welcome!

（评论） (comments)

（评论）
(comments)