Llama 3-V：与 GPT4-V 相匹配，型号小 100 倍，售价 500 美元

Llama 3-V：与 GPT4-V 相匹配，型号小 100 倍，售价 500 美元
Llama 3-V: Matching GPT4-V with a 100x smaller model and 500 dollars

原始链接: https://aksh-garg.medium.com/llama-3v-building-an-open-source-gpt-4v-competitor-in-under-500-7dd8f1f6c9ee

Llama3-V 是 Llama3 的最新版本，Llama3 是一种突破性的多模式语言模型，超越了 GPT3.5 和早期版本的 GPT4 等之前的版本。它是第一个在 Llama3 之上构建的同类模型，提供了处理视觉信息的增强功能。通过集成 SigLIP 模型，Llama3-V 将图像嵌入到一系列补丁嵌入中，并使用投影块将它们与文本标记对齐。生成的视觉标记与文本标记连接起来并输入 Llama3 进行进一步处理。该团队在基准测试结果方面取得了显着改进，与 Lava 等现有最先进模型相比，性能提高了 10-20%。尽管取得了这些进步，Llama3-V 的成本还是可以承受的，旨在以低于 500 美元的价格训练所有内容。

本文讨论了 CogVLM2 和 InternVL 在光学字符识别 (OCR) 任务中的比较。两者都缺乏 llama.cpp 支持，限制了它们的适用性。然而，CogVLM2 在低质量扫描方面表现不佳，但处理表格数据提取的能力相当好。没有提供有关其处理轻微轮换或重新运行的能力的信息。作者建议将其与长文档的混合方法配对。对于要求高精度的纯 OCR 任务，InternVL 因其卓越的图块支持而表现出色。作者计划进一步探索其潜力。尽管有其优势，但这两个系统都无法与 GPT4v 的性能水平相匹配。作者强调了解各种人工智能技术的局限性和合适用途的重要性。

原文

Llama3 took the world by storm, outperforming GPT3.5 in almost all benchmarks and GPT4 on several. And then GPT4o came out, reclaiming the throne with its multimodal finesse. Today, we’re releasing something to change that: Llama3-V, the first-ever multimodal model built on top of Llama3. As a bonus, we train everything in under $500.

How are the benchmarks you ask? We’ll let the tables speak for themselves. We have a 10–20% boost over Llava, the current SOTA and most popular model for multimodal understanding. Additionally, we fair very comparably to the closed source models of 100x the size on all metrics except MMMU.

• 🤗: https://huggingface.co/mustafaaljadery/llama3v/

• Github: https://github.com/mustafaaljadery/llama3v

Model Architecture
Training Framework
Systems Optimizations
Summary

The bulk of our engineering efforts go into making Llama3 understand visual information. To do so, we take an input image and embed it into a series of patch embeddings using the SigLIP model. These embeddings are then aligned with the textual tokens via a projection block, which applies two self-attention blocks to put the textual and visual embeddings in the same plane. Finally, the visual tokens from the projection block are prepended to the textual tokens and the joint representation is passed into Llama3, just as it normally would.

Llama3-V Architecture: We use SigLIP to embed our input image in patches. Then we train a projection block with two self-attention blocks to align our textual and visual tokens.

The diagram above illustrates at a high-level how everything works. Now, let’s dive into each stage in detail.

SigLIP: SigLIP (Sigmoid Loss for Language Image Pre-Training) is an image embedding model that is similar to CLIP as we see in the figure below. However, unlike CLIP which uses a contrastive loss with softmax normalization, SigLIP utilizes a pairwise sigmoid loss, which allows the model to operate independently on each image-text pair, without requiring a global view across all pairs in a batch. At a high-level, SigLIP’s vision encoder splits the image into a sequence of non-overlapping image patches and projects them into a lower-dimensional linear embedding space, producing a sequence of patch embeddings. These patch embeddings then go through a vision encoder, which applies self-attention to capture long-range dependencies and extract higher-level visual features. For our purposes, we directly use the original SigLIP model trained by Google DeepMind.

Illustration of how SigLIP embeddings work. We train an image and text decoder concurrently but in our case the text encoding module is kept fixed. Unlike CLIP, we minimize a sigmoid loss instead of a softmax loss but most other things stay the same. Image from twitter post by Merve

Alignment with textual embeddings: To save computational resources, we keep SigLIP fixed. However, to align the output image embeddings with the textual embeddings used in Llama3, we use an extra projection module. Unlike Llava, which applies a single linear layer to the original image embeddings, we instead train two self-attention blocks to better capture patterns in the input embeddings, producing the final image embedding vector.

Prepending image tokens: For the textual inputs, we first tokenize the text using a Byte Pair Encoding (BPE) vocabulary, producing a sequence of textual tokens. We demarcate these tokens by enclosing them within special <text> and </text> tags. As for the image embeddings from the projection block, we treat each vector as a separate “visual token” and demarcate them using <image> and </image> tags. Finally, we prepend the sequence of visual tokens to the sequence of textual tokens, forming the joint input representation that is passed into Llama3 for processing.

Training these models is expensive. To optimize for computation resources, we make two major optimizations. The first is a simple caching mechanism and the second is on the MPS/MLX front.

Caching: The SigLIP model is much smaller than Llama3. Therefore, if we run everything serially, we have very little GPU utilization when SigLIP is running. Moreover, we can’t push the utilization up by increasing the batch size up on SigLIP as then Llama runs into OOM errors. Instead we observed that our SigLIP model stays the same and instead pre-compute the image embeddings. Then, for both pre-training and SFT, we directly pass in these precomputed image embeddings instead of re-running the SigLIP module. Not only does this allow us to increase the batch size and maximally use our GPUs for running the SigLIP modules, it also saves us training/inference time as the two parts of the pipeline can occur separately.

MPS/MLX Optimizations: Our second optimization was again driven by SigLIP’s smaller size relative to Llama. Specifically, since SigLIP fit on our Macbooks, we ran inference on an MPS optimized SigLIP model, which allowed us to attain a throughput of 32 images/second — allowing our caching step happen relatively quickly.

Precompute the embeddings from SigLIP: let’s now dive into the first step of our pre-training process: precomputing the image embeddings via SigLIP. In this step, our goal is to essentially pass in images into the SigLIP embedding model to obtain a vector representation or embedding of the image. One technical detail: due to higher resolutions, we follow the approach taken by LLaVA-UHD and perform image-splitting. The purpose of image-splitting is to divide the image into variable-sized patches or segments for more efficient encoding. These split images are processed concurrently in batches.

Now let’s dive into how exactly we use the SigLIP embedding. We first load the SigLIP model and processor/tokenizer. We then preprocess the provided input image using the processor. We then pass the preprocessed image to the model. Following this, the model outputs logits for the image-text pairs. We now proceed to apply the sigmoid activation function to the logits to get the probabilities. We now see that the image embedding is contained within these probabilities. So far this embedding captures the visual information in the image.

Following the computation of the image embedding via SigLIP, we now proceed to learn a projection matrix — you can also think of this as the projection layer, which is typically a linear or feed-forward layer. As described above in the ingredients section, the projection layer maps the vision embedding from its original space into a joint multimodal embedding space. Specifically, the projection layer applies a learned weight matrix W_v to the vision embedding v to get the projected multimodal vision embedding W_v * v. So after this projection step, the vision and text embeddings are essentially aligned into a common multimodal embedding space, allowing their representations to interact and be combined for multimodal modeling tasks like visual question answering, image captioning, etc. More specifically, the result of the projection layer is the generated “latents.”

Once the latents are computed, we then prepend them as image tokens before the text tokens. The reason for the prepending is that having the image before the text makes it easier for the model to learn during pretraining. Think of it as having tokens representing the actual image and then tokens representing the contents of the image in text: almost like a caption paired with an image. Our architecture is nearly identical to that of LLaVA-UHD (they choose CLIP-ViT while we use SigLIP and they work with Vicuna-13B) so we provide their illustration as reference below:

Now that we’ve established the data needed for pretraining, we can dive into what that actually looks like. In pre-training, we then use 600,000 examples of prepended images to text. In this step we keep the main weights of the Llama-3 architecture frozen. The key is that we want to only update the gradients of the projection matrix. Crucially, we keep the rest of the weights frozen. And with that, we’ve wrapped up our intuition and process for the pretraining step. The key here was aligning the embedded images (latents) with their text in a joint representation and then pretraining LLaMA-3 to focus on updating the projection matrix based on the examples encountered.

Supervised Finetuning

Following pretraining, we perform supervised finetuning to enhance the performance of our model. In this step, we are freezing our computed embeddings (from the projection layer) and we keep everything except the vision and projection matrices frozen. In other words, if you look at the image below, the red components are unfrozen while the blue components are frozen. This is meant to serve as “instruction” finetuning — in other words making the model stronger for a multimodal text output. In this stage, we use 1M examples (7M split images).

We add a vision encoder to Llama3 8B
Our model offers 10–20% performance boosts over Llava the current open-source SOTA vision language model.
We offer comparable vision abilities of models close to 100x* larger in size like GPT4v, Gemini Ultra, and Claude Opus.
We describe an efficient pipeline to pretrain and instruction finetune the model in under $500.

Llama 3-V：与 GPT4-V 相匹配，型号小 100 倍，售价 500 美元 Llama 3-V: Matching GPT4-V with a 100x smaller model and 500 dollars

Llama 3-V：与 GPT4-V 相匹配，型号小 100 倍，售价 500 美元
Llama 3-V: Matching GPT4-V with a 100x smaller model and 500 dollars