Image
Multi-discipline college-level reasoning problems
59.4%0-shot pass@1
Gemini Ultra (pixel only*)
56.8%0-shot pass@1
GPT-4V
Natural image understanding
77.8%0-shot
Gemini Ultra (pixel only*)
77.2%0-shot
GPT-4V
OCR on natural images
82.3%0-shot
Gemini Ultra (pixel only*)
78%0-shot
GPT-4V
Document understanding
90.9%0-shot
Gemini Ultra (pixel only*)
88.4%0-shot
GPT-4V (pixel only)
Infographic understanding
80.3%0-shot
Gemini Ultra (pixel only*)
75.1%0-shot
GPT-4V (pixel only)
Mathematical reasoning in visual contexts
53%0-shot
Gemini Ultra (pixel only*)
49.9%0-shot
GPT-4V
Video
English video captioning
(CIDEr)
62.74-shot
Gemini Ultra
564-shot
DeepMind Flamingo
Video question answering
54.7%0-shot
Gemini Ultra
46.3%0-shot
SeViLA
Audio
Automatic speech translation
(BLEU score)
40.1Gemini Pro
29.1Whisper v2
Automatic speech recognition
(based on word error rate, lower is better)
7.6%Gemini Pro
17.6%Whisper v3