Jina AI推出开源8k文本嵌入

Jina AI推出开源8k文本嵌入
Jina AI launches open-source 8k text embedding

原始链接: https://jina.ai/news/jina-ai-launches-worlds-first-open-source-8k-text-embedding-rivaling-openai/

图书编辑通过简化写作和编辑流程来创建可出版的手稿。一个例子是对一篇关于 Jina AI 推出用于文本嵌入的新开源 8K 上下文长度模型的文本文章进行转换，该模型可与 OpenAI 等顶级公司的产品相媲美。因此，本次编辑旨在将原来的 100 字文章简化为更易于理解的 100 字摘要。首先，文章提到Jina AI推出了第二代文本嵌入模型jina-embeddings-v2，具有8K上下文长度，在能力和性能上与OpenAI的8K文本嵌入-ada-002模型相当，使其高度适用于各个行业，包括医学研究、文学分析、财务预测和对话式人工智能。此外，Jina AI 还提供尺寸选项来满足不同的需求，满足重型和轻型应用的需求。此外，该公司还计划扩展语言产品并开发 API 平台。总体而言，Jina AI正在通过开源举措引领人工智能的发展，并为全球从业者提供前所未有的技术进步。欲了解更多信息，请联系 [email protected]，访问他们的网站 www.jina.ai，或找到他们在柏林、北京和深圳的办事处。

为了微调 text-embedding-ada-002 等预训练模型或微调 llm 模型，您通常需要更大的 GPU，具体取决于模型架构和要处理的数据量。通常，较大的数据集需要更多的计算资源。这是一个估计： - Text-Embedding-ADA-002 使用大约 100 亿个参数，并已在批量大小为 16k 的 Nvidia V100 上成功测试。在实际设置中，Quadro RTX 5000、Quadro RTX 6000 或 Tesla T4 等 GPU 就足够了。 - 微调 LSTM 或基于 Transformer 的 LLM 需要大量内存来存储推理或训练期间的中间激活。这意味着您需要一张具有更多内存的卡，例如 V100 PCIe Passthrough 或 A100 PCIe Passthrough 与 Quadro RTX 6000 或 5000 的组合。以下是一些参考资料，可以为您提供进一步的见解： - Intel Broadwell 架构上的 NLP 处理性能基准 (https://software。intel。com/content/www/us/en/attachments/white_paper-314433。pdf) 提供了对模型大小、参数计数和要求的深入了解硬件加速器。 - NVIDIA 的 DL 管道硬件加速器优化指南 (https://developer。nvidia。com/optimize-dl-training-inference-gpus-nvidias-accelerated-compute-technology) 提供了针对 Nvidia 硬件堆栈优化 dl 管道的建议。 Additionally, consider reading NLP: An Introduction to Natural Language Processing for practical advice on selecting machine learning algorithms that suit your specific task or set of tasks。祝你好运！

原文

Berlin, Germany - October 25, 2023 – Jina AI, the Berlin-based artificial intelligence company, is thrilled to announce the launch of its second-generation text embedding model: jina-embeddings-v2. This cutting-edge model is now the only open-source offering that supports an impressive 8K (8192 tokens) context length, putting it on par with OpenAI's proprietary model, text-embedding-ada-002, in terms of both capabilities and performance on the Massive Text Embedding Benchmark (MTEB) leaderboard.

Benchmarking Against the Best 8K Model from Open AI

When directly compared with OpenAI's 8K model text-embedding-ada-002, jina-embeddings-v2 showcases its mettle. Below is a performance comparison table, highlighting areas where jina-embeddings-v2 particularly excels:

Rank	Model	Model Size (GB)	Embedding Dimensions	Sequence Length	Average (56 datasets)	Classification Average (12 datasets)	Reranking Average (4 datasets)	Retrieval Average (15 datasets)	Summarization Average (1 dataset)
15	text-embedding-ada-002	Unknown	1536	8191	60.99	70.93	84.89	56.32	30.8
17	jina-embeddings-v2-base-en	0.27	768	8192	60.38	73.45	85.38	56.98	31.6

Notably, jina-embedding-v2 outperforms its OpenAI counterpart in Classification Average, Reranking Average, Retrieval Average, and Summarization Average.

Features and Benefits

Jina AI’s dedication to innovation is evident in this latest offering:

From Scratch to Superiority: The jina-embeddings-v2 was built from the ground up. Over the last three months, the team at Jina AI engaged in intensive R&D, data collection, and tuning. The outcome is a model that marks a significant leap from its predecessor.
Unlocking Extended Context Potential with 8K: jina-embeddings-v2 isn’t just a technical feat; its 8K context length opens doors to new industry applications:
- Legal Document Analysis: Ensure every detail in extensive legal texts is captured and analyzed.
- Medical Research: Embed scientific papers holistically for advanced analytics and discovery.
- Literary Analysis: Dive deep into long-form content, capturing nuanced thematic elements.
- Financial Forecasting: Attain superior insights from detailed financial reports.
- Conversational AI: Improve chatbot responses to intricate user queries.

Benchmarking shows that in several datasets, this extended context enabled jina-embeddings-v2 to outperform other leading base embedding models, emphasizing the practical advantages of longer context capabilities.

Availability: Both models are freely available for download on Huggingface:
- Base Model (0.27G) - Designed for heavy-duty tasks requiring higher accuracy, like academic research or business analytics.
- Small Model (0.07G) - Crafted for lightweight applications such as mobile apps or devices with limited computing resources.
Size Options for Different Needs: Understanding the diverse needs of the AI community, Jina AI offers two versions of the model:

jinaai/jina-embeddings-v2-base-en · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

jinaai/jina-embeddings-v2-small-en · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

In reflecting on the journey and significance of this launch, Dr. Han Xiao, CEO of Jina AI, shared his thoughts:

"In the ever-evolving world of AI, staying ahead and ensuring open access to breakthroughs is paramount. With jina-embeddings-v2, we've achieved a significant milestone. Not only have we developed the world's first open-source 8K context length model, but we have also brought it to a performance level on par with industry giants like OpenAI. Our mission at Jina AI is clear: we aim to democratize AI and empower the community with tools that were once confined to proprietary ecosystems. Today, I am proud to say, we have taken a giant leap towards that vision."

This pioneering spirit is evident in Jina AI's forward-looking plans.

A Glimpse into the Future

Jina AI is committed to leading the forefront of innovation in AI. Here’s what’s next on their roadmap:

Academic Insights: An academic paper detailing the technical intricacies and benchmarks of jina-embeddings-v2 will soon be published, allowing the AI community to gain deeper insights.
API Development: The team is in the advanced stages of developing an OpenAI-like embeddings API platform. This will provide users with the capability to effortlessly scale the embedding model according to their needs.
Language Expansion: Venturing into multilingual embeddings, Jina AI is setting its sights on launching German-English models, further expanding its repertoire.

About Jina AI GmbH:
Located at Ohlauer Str. 43 (1st floor), zone A, 10999 Berlin, Germany, Jina AI is at the vanguard of reshaping the landscape of multimodal artificial intelligence. For inquiries, please reach out at [email protected].