AI Breakfast
Posts
Google's Gemini 2.5 Tops Leaderboards

Google's Gemini 2.5 Tops Leaderboards

AI Breakfast
March 26, 2025

Good morning. It’s Wednesday, March 26th.

On this day in tech history: In 1976, the first 8-inch floppy disk drive, the Shugart SA-801, was introduced by Shugart Associates.

In today’s email:

Google’s Gemini 2.5 Tops Leaderboards
OpenAI’s New Image Generator Solves Text In Images
4 New AI Tools
Latest AI Research Papers

You read. We listen. Let us know what you think by replying to this email.

Today’s trending AI news stories

Google’s Gemini 2.5 Tops Leaderboards, Supports 1M Token Inputs

Google has introduced Gemini 2.5, its most advanced reasoning model, pushing the boundaries of AI-driven problem-solving in math, coding, and multimodal analysis. The first in this new model family, Gemini 2.5 Pro Experimental, is now available in Google AI Studio and the Gemini Advanced subscription. Equipped with a 1 million-token context window—soon expanding to 2 million—the model handles vast datasets, technical documents, and full code repositories with improved reasoning capabilities. It leads in key benchmarks, scoring 68.6% on Aider Polyglot for code editing and 18.8% on Humanity’s Last Exam, though it lags behind Claude 3.7 Sonnet on SWE-bench Verified for AI-assisted software development.

🚨 Gemini 2.5 Pro Exp dropped and it's now #1 across SEAL leaderboards:
🥇 Humanity’s Last Exam
🥇 VISTA (multimodal)
🥇 (tie) Tool Use
🥇 (tie) MultiChallenge (multi-turn)
🥉 (tie) Enigma (puzzles)
Congrats to @demishassabis@sundarpichai & team!
🔗 scale.com/leaderboard
— Alexandr Wang (@alexandr_wang)
5:45 PM • Mar 25, 2025

At the same time, Google is rolling out real-time vision to Gemini Live, letting users point their phone cameras at objects or share on-screen content for instant AI analysis. Read more.

OpenAI’s New Image Generator Solves The ‘Text Problem’

OpenAI has integrated GPT-4o’s native image generation into ChatGPT, making it the default across free and paid tiers. Unlike previous DALL-E implementations, GPT-4o processes text and images together, improving spatial accuracy and object consistency. The model can handle up to 20 objects at once while maintaining spatial relationships, making it more precise in rendering text and complex scenes.

4o image generation has arrived.
It's beginning to roll out today in ChatGPT and Sora to all Plus, Pro, Team, and Free users.
— OpenAI (@OpenAI)
6:34 PM • Mar 25, 2025

Users can refine images through conversation, leveraging in-context learning to iteratively improve results. While the system offers greater creative flexibility than DALL-E 3, OpenAI still enforces restrictions on explicit content, deepfakes, and unauthorized likenesses. All generated images include C2PA metadata for transparency.

OpenAI has also refined ChatGPT’s Advanced Voice Mode, making conversations smoother by reducing interruptions. Free users now experience more natural dialogue, while paying subscribers gain enhanced voice interactions.

On the leadership front, OpenAI is undergoing restructuring, with CEO Sam Altman stepping back from daily operations to focus on research and product strategy. COO Brad Lightcap will now oversee operations, partnerships, and international growth.

However, OpenAI’s models are struggling with a new test of artificial general intelligence. The Arc Prize Foundation’s ARC-AGI-2 benchmark, designed to assess adaptive reasoning, has exposed significant gaps. OpenAI’s o1-pro and DeepSeek’s R1 barely scored above 1%, while human test groups averaged 60%. Even o3 (low), which previously dominated ARC-AGI-1, now manages just 4% accuracy—despite a staggering $200 per task compute cost.

The Arc Prize Foundation has launched a contest challenging developers to hit 85% accuracy on ARC-AGI-2 for just $0.42 per task, marking a new frontier in AI’s pursuit of true reasoning capabilities. Read more.