Multidisciplinary university-level reasoning problems

Rina7RS · Post by **Rina7RS** » Wed Feb 05, 2025 10:11 am

DROP Reading comprehension (F1 score) 82.4 Variable-shot 80.9 3-shot (report)
HellaSwag Common-sense reasoning for everyday tasks 87.8% 10-shot* 95.3% 10-shot* (reported)
math GSM8K Basic arithmetic operations (including elementary school math problems) 94.4% maj32 92.0% 5-shot (report)
MATH Challenging math problems (including algebra, geometry, pre-calculus, etc.) 53.2% 4-shot 52.9% 4-shot (API)
coding HumanEval Python Code Generation 74.4% 0-shot (IT)* 67.0% 0-shot* (reported)
Natural2Code Python code generation. New HumanEval-like dataset not leaked online 74.9% 0-shot 73.9% 0-shot (API)
Multimodal performance
Gemini is a naturally multimodal model that can transform any hungary mobile database type of input into any type of output. For example, Gemini can generate code based on different inputs.

ability Benchmarks describe Gemini GPT-4V
image MMMU 59.4% (0-shot) 56.8% (0-shot)
Gemini Ultra (pixel only)* OCR+PA
VQAv2 Natural Image Understanding 77.8% (0-shot) 77.2% (0-shot)
Gemini Ultra (pixel only)* OCR+PA
TextVQA OCR on natural images 82.3% (0-shot) 78.0% (0-shot)
Gemini Ultra (pixel only)* OCR+PA
DocVQA Document understanding 90.9% (0-shot) 88.4% (0-shot)
Gemini Ultra (pixel only)* OCR+PA
Infographic VQA Infographic understanding 80.3% (0-shot) 75.1% (0-shot)
Gemini Ultra (pixel only)* OCR+PA
MathVista Mathematical reasoning in a visual context 53.0% (0-shot) 49.9% (0-shot)
Gemini Ultra (pixel only)* OCR+PA
video VATEX English video subtitles (CIDEr) 62.7 56.0
Gemini Ultra DeepMind Flamingo.