Llama3 took the world by storm, outperforming GPT3.5 in almost all benchmarks and GPT4 on several. And then GPT4o came out, reclaiming the throne with its multimodal finesse. Today, we’re releasing something to change that: Llama3-V, the first-ever multimodal model built on top of Llama3. As a bonus, we train everything in under $500.
How are the benchmarks you ask? We’ll let the tables speak for themselves. We have a 10–20% boost over Llava, the current SOTA and most popular model for multimodal understanding. Additionally, we fair very comparably to the closed source models of 100x the size on all metrics except MMMU.
• 🤗: https://huggingface.co/mustafaaljadery/llama3v/
• Github: https://github.com/mustafaaljadery/llama3v
- Model Architecture
- Training Framework
- Systems Optimizations
- Summary
The bulk of our engineering efforts go into making Llama3 understand visual information. To do so, we take an input image and embed it into a series of patch embeddings using the SigLIP model. These embeddings are then aligned with the textual tokens via a projection block, which applies two self-attention blocks to put the textual and visual embeddings in the same plane. Finally, the visual tokens from the projection block are prepended to the textual tokens and the joint representation is passed into Llama3, just as it normally would.
The diagram above illustrates at a high-level how everything works. Now, let’s dive into each stage in detail.
SigLIP: SigLIP (Sigmoid Loss for Language Image Pre-Training) is an image embedding model that is similar to CLIP as we see in the figure below. However, unlike CLIP which uses a contrastive loss with softmax normalization, SigLIP utilizes a pairwise sigmoid loss, which allows the model to operate independently on each image-text pair, without requiring a global view across all pairs in a batch. At a high-level, SigLIP’s vision encoder splits the image into a sequence of non-overlapping image patches and projects them into a lower-dimensional linear embedding space, producing a sequence of patch embeddings. These patch embeddings then go through a vision encoder, which applies self-attention to capture long-range dependencies and extract higher-level visual features. For our purposes, we directly use the original SigLIP model trained by Google DeepMind.
Alignment with textual embeddings: To save computational resources, we keep SigLIP fixed. However, to align the output image embeddings with the textual embeddings used in Llama3, we use an extra projection module. Unlike Llava, which applies a single linear layer to the original image embeddings, we instead train two self-attention blocks to better capture patterns in the input embeddings, producing the final image embedding vector.
Prepending image tokens: For the textual inputs, we first tokenize the text using a Byte Pair Encoding (BPE) vocabulary, producing a sequence of textual tokens. We demarcate these tokens by enclosing them within special <text> and </text> tags. As for the image embeddings from the projection block, we treat each vector as a separate “visual token” and demarcate them using <image> and </image> tags. Finally, we prepend the sequence of visual tokens to the sequence of textual tokens, forming the joint input representation that is passed into Llama3 for processing.
Training these models is expensive. To optimize for computation resources, we make two major optimizations. The first is a simple caching mechanism and the second is on the MPS/MLX front.
Caching: The SigLIP model is much smaller than Llama3. Therefore, if we run everything serially, we have very little GPU utilization when SigLIP is running. Moreover, we can’t push the utilization up by increasing the batch size up on SigLIP as then Llama runs into OOM errors. Instead we observed that our SigLIP model stays the same and instead pre-compute the image embeddings. Then, for both pre-training and SFT, we directly pass in these precomputed image embeddings instead of re-running the SigLIP module. Not only does this allow us to increase the batch size and maximally use our GPUs for running the SigLIP modules, it also saves us training/inference time as the two parts of the pipeline can occur separately.
MPS/MLX Optimizations: Our second optimization was again driven by SigLIP’s smaller size relative to Llama. Specifically, since SigLIP fit on our Macbooks, we ran inference on an MPS optimized SigLIP model, which allowed us to attain a throughput of 32 images/second — allowing our caching step happen relatively quickly.
Precompute the embeddings from SigLIP: let’s now dive into the first step of our pre-training process: precomputing the image embeddings via SigLIP. In this step, our goal is to essentially pass in images into the SigLIP embedding model to obtain a vector representation or embedding of the image. One technical detail: due to higher resolutions, we follow the approach taken by LLaVA-UHD and perform image-splitting. The purpose of image-splitting is to divide the image into variable-sized patches or segments for more efficient encoding. These split images are processed concurrently in batches.
Now let’s dive into how exactly we use the SigLIP embedding. We first load the SigLIP model and processor/tokenizer. We then preprocess the provided input image using the processor. We then pass the preprocessed image to the model. Following this, the model outputs logits for the image-text pairs. We now proceed to apply the sigmoid activation function to the logits to get the probabilities. We now see that the image embedding is contained within these probabilities. So far this embedding captures the visual information in the image.
Following the computation of the image embedding via SigLIP, we now proceed to learn a projection matrix — you can also think of this as the projection layer, which is typically a linear or feed-forward layer. As described above in the ingredients section, the projection layer maps the vision embedding from its original space into a joint multimodal embedding space. Specifically, the projection layer applies a learned weight matrix W_v to the vision embedding v to get the projected multimodal vision embedding W_v * v. So after this projection step, the vision and text embeddings are essentially aligned into a common multimodal embedding space, allowing their representations to interact and be combined for multimodal modeling tasks like visual question answering, image captioning, etc. More specifically, the result of the projection layer is the generated “latents.”
Once the latents are computed, we then prepend them as image tokens before the text tokens. The reason for the prepending is that having the image before the text makes it easier for the model to learn during pretraining. Think of it as having tokens representing the actual image and then tokens representing the contents of the image in text: almost like a caption paired with an image. Our architecture is nearly identical to that of LLaVA-UHD (they choose CLIP-ViT while we use SigLIP and they work with Vicuna-13B) so we provide their illustration as reference below:
Now that we’ve established the data needed for pretraining, we can dive into what that actually looks like. In pre-training, we then use 600,000 examples of prepended images to text. In this step we keep the main weights of the Llama-3 architecture frozen. The key is that we want to only update the gradients of the projection matrix. Crucially, we keep the rest of the weights frozen. And with that, we’ve wrapped up our intuition and process for the pretraining step. The key here was aligning the embedded images (latents) with their text in a joint representation and then pretraining LLaMA-3 to focus on updating the projection matrix based on the examples encountered.
Supervised Finetuning
Following pretraining, we perform supervised finetuning to enhance the performance of our model. In this step, we are freezing our computed embeddings (from the projection layer) and we keep everything except the vision and projection matrices frozen. In other words, if you look at the image below, the red components are unfrozen while the blue components are frozen. This is meant to serve as “instruction” finetuning — in other words making the model stronger for a multimodal text output. In this stage, we use 1M examples (7M split images).
- We add a vision encoder to Llama3 8B
- Our model offers 10–20% performance boosts over Llava the current open-source SOTA vision language model.
- We offer comparable vision abilities of models close to 100x* larger in size like GPT4v, Gemini Ultra, and Claude Opus.
- We describe an efficient pipeline to pretrain and instruction finetune the model in under $500.