Nvidia 的新开源 AI 模型在基准测试中击败了 GPT-4o

Nvidia 的新开源 AI 模型在基准测试中击败了 GPT-4o
Nvidia's New Open-Source AI Model Beats GPT-4o On Benchmarks

原始链接: https://www.zerohedge.com/technology/nvidias-new-open-source-ai-model-beats-gpt-4o-benchmarks

Nvidia 推出了一款名为 Llama-3.1-Nemotron-70B-Instruct 的新 AI 模型，该模型基于 Meta 的 Llama-3.1-70B-Instruct。 Nvidia 声称 Nemotron 在实用性方面优于 GPT-4o 和 Claude-3 等最先进的模型。 Nemotron 在 Chatbot Arena 排行榜的“硬”测试中获得了 85 分，这将使其成为得分最高的模型。值得注意的是，Llama-3.1-70B 是 Meta 的中端模型，表明 Nemotron 通过 Nvidia 的优化取得了进步。确定 AI 模型性能的确切方法是主观的，但 Nemotron 的性能基于比较测试和人类评估。 Nvidia 声称 Nemotron 与其他领先模型相比提高了准确性和实用性。

Authored by Tristan Greene via CoinTelegraph.com,

Nvidia unceremoniously launched a new artificial intelligence model on Oct 15 that’s purported to outperform state-of-the-art AI systems including GPT-4o and Claude-3.

According to a post on the X.com social media platform from the Nvidia AI Developer account, the new model, dubbed Llama-3.1-Nemotron-70B-Instruct, “is a leading model” on lmarena.AI’s Chatbot Arena.

Nvidia AI announces the benchmarks score for Nemotron. Source: Nvidia AI

Nemotron

Llama-3.1-Nemotron-70B-Instruct is, essentially, a modified version of Meta’s open-source Llama-3.1-70B-Instruct.

The “Nemotron” portion of the model’s name encapsulates Nvidia’s contribution to the end result.

The Llama “herd” of AI models, as Meta refers to them, are meant to be used as open-source foundations for developers to build on.

In the case of Nemotron, Nvidia took up the challenge and developed a system designed to be more “helpful” than popular models such as OpenAI’s ChatGPT and Anthropic’s Claude-3.

Nvidia used specially curated datasets, advanced fine-tuning methods, and its own state-of-the-art AI hardware to turn Meta’s vanilla model into what might be the most “helpful” AI model on the planet.

An engineer’s post on X.com expressing excitement for Nemotron’s capabilities. Source: Shayan Taslim

“I asked it a few coding questions I usually ask to compare LLMs and got some of the best answers from this one. lol, holy shit.”

Benchmarking

When it comes to determining which AI model is “the best,” there’s no clear-cut methodology. Unlike, for example, measuring the ambient temperature with a mercury thermometer, there isn’t a single “truth” that exists when it comes to AI model performance.

Developers and researchers have to determine how well an AI model performs the same as humans are evaluated: through comparative testing.

AI benchmarking involves giving different AI models the same queries, tasks, questions, or problems and then comparing the usefulness of the results. Often, due to the subjectivity of what is and isn’t considered useful, human proctors are used to determine a machine’s performance through blind evaluations.

In Nemotron’s case, it appears that Nvidia is claiming the new model outperforms existing state-of-the-art models such as GPT-4o and Claude-3 by a fairly wide margin.

The top of the Chatbot Arena leaderboards. Source: LMArenea.AI

The image above depicts the ratings on the automated “Hard” test on the Chatbot Arena Leaderboards. While Nvidia’s Llama-3.1-Nemotron-70B-Instruct doesn’t appear to be listed anywhere on the boards, if the developer’s claim that it scored an 85 on this test is valid, it would be the de facto top model in this particular section.

What makes the achievement perhaps even more interesting is that Llama-3.1-70B is Meta’s middle-tier open-source AI model.

There’s a much larger version of Llama-3.1, the 405B version (where the number refers to how many billion parameters the model was tuned with).

By comparison, GPT-4o is estimated to have been developed with over one trillion parameters.