Gemini 3 Leak?

Good morning. It’s Monday, October 13th.

On this day in tech history: In 2010, Google announced its acquisition of BlindType, a tiny startup working on machine-learning–based text-entry prediction for touchscreens. The tech became part of what evolved into Gboard’s autocorrection and swipe-typing models. Instead of forcing users to tap perfectly on tiny keys, BlindType’s system guessed the intended letters based on patterns and context, even if the touches were way off.

In today’s email:

  • Gemini 3 leak

  • OpenAI Takes Heat

  • Anthropic’s battle with poison data

  • 5 New AI Tools

  • Latest AI Research Papers

You read. We listen. Let us know what you think by replying to this email.

In partnership with Wall Street Prep

8 Weeks. Actionable AI Skills. MBA-Style Networking.

  • Build AI confidence with role-specific use cases

  • Learn how leaders are implementing AI strategies at top financial firms

  • Secure a lasting network that supports your career growth

Earn your certificate from Columbia Business School Executive Education—program starts November 10.

Enroll by Oct. 13 to get $200 off tuition + use code AIBREAKFAST for an additional $300 off.

Thank you for supporting our sponsors!

Today’s trending AI news stories

Gemini 3 leak gains steam with caveats, as DeepMind drops ‘Vibe Checker’

A leaked memo circling Reddit and 𝕏 now pegs Gemini 3 for an October 22 launch, bumping past the earlier October 9 rumor. The doc claims upgrades in multimodal reasoning, latency, inference cost, and even original music generation. Early testers say it is already edging out Gemini 2.5 and Anthropic’s Sonnet 4.5 on coding and SVG work. The drop may also bundle Veo 3.1 and a Nano Banana variant of Gemini 3 Pro. On the interface side, Google is testing a “My Stuff” asset hub, browser-level Agent Mode, and a refreshed take on “connected apps.” None of it is confirmed, so take it with a grain of salt.

Meanwhile, DeepMind and several U.S. researchers are pushing back on pass@k benchmarks that only verify whether code runs and ignore what developers actually scrutinize: style, docstrings, API limits, and error handling. Their response is Vibe Checker, powered by VeriCode, a curated set of 30 rule types pulled from more than 800 Ruff linter checks and paired with deterministic verifiers. They also turned BigCodeBench and LiveCodeBench into BigVibeBench and LiveVibeBench, now covering more than 2,100 tasks.

Both methods test for functional correctness and instruction following. | Image: Zhong et al.

When they ran 31 leading models, pass@1 scores fell about 6 percent with only five added instructions. Once three or more constraints were introduced, none of the models broke 50 percent. Comparing results with more than 800,000 human ratings confirmed what no benchmark had quantified until now: accuracy plus instruction-following tracks real developer preference far better than any single metric in circulation.

Image: Google DeepMind

Google is also touting a new flex: 1.3 quadrillion tokens processed per month. The total is up 320 trillion since June, but the spike traces back to heavier models, not user growth. Gemini 2.5 Flash alone consumes 17 times more tokens per query and can cost 150 times more to run. Even a trivial prompt can spin into multiple reasoning passes, and multimodal inference inflates the count further.

The number signals infrastructure strain, not adoption. It also clashes with Google’s sustainability pitch of 0.24 watt-hours per Gemini request, a figure that only fits lightweight text use and ignores video workloads, agent chaining, and long-context reasoning. Impressive on a slide, less so on a utility bill. Read more.

OpenAI is getting squeezed from multiple fronts. In New York, the company is staring down a potential multibillion-dollar copyright fight after authors and publishers uncovered internal emails about scrubbing a dataset packed with pirated books. They’re now asking the court to force disclosure of OpenAI’s legal communications, arguing the company knew what it was doing and may have destroyed evidence. If a judge agrees, damages could explode past a billion dollars, especially after Anthropic already paid $1.5 billion to make a similar lawsuit go away. Insurers are reportedly balking at underwriting either company.

At the same time, OpenAI is alienating the very policy advocates it claims to collaborate with. Encode and The Midas Project, both tiny nonprofits that backed California’s new AI transparency law, SB 53, say OpenAI sent sheriffs to serve subpoenas demanding their private emails with lawmakers, journalists, students, and former OpenAI staff. The company insists it’s all part of its lawsuit against Elon Musk and aimed at sniffing out undisclosed backing. Both groups say they’ve never taken Musk’s money and view the move as legal intimidation timed to ongoing reviews of OpenAI’s $500 billion reorganization.

To blunt growing distrust, the company is also trying to show progress on AI bias. In a new internal audit, OpenAI stress-tested GPT-4o, OpenAI o3, and the newer GPT-5 models with 500 prompts across hot-button political topics, from immigration to reproductive rights. Another model judged the responses using rules against emotive escalation, one-sided framing, personal opinions, and dismissive language.

OpenAI tested ChatGPT’s objectivity in responding to prompts about divisive topics from varying political perspectives. Image screenshot: OpenAI

The company claims GPT-5 instant and GPT-5 thinking cut biased replies by 30 percent compared to older models and were harder to push off balance with slanted prompts. Most failures still emerged under aggressively liberal framing. Read more.

Anthropic battles poisoned data, branding blitz, and legal bills

In collaboration with the UK AI Security Institute and the Alan Turing Institute, the company showed that just 250 poisoned documents, 0.00016 percent of a training corpus, can reliably backdoor large language models from 600 million to 13 billion parameters. Across 72 models, a trigger word, “SUDO,” caused the model to output gibberish. Fewer samples failed; more offered no additional effect, revealing a threshold effect rather than proportional scaling. While low-risk, the results underscore how even minimal data contamination can silently alter model behavior.

Image: Anthropic

Concurrently, Anthropic is accelerating its consumer push. Its New York “Zero Slop Zone” pop-up, a screen-free newsstand offering coffee, books, and “thinking” caps, drew 5,000 visitors and 10 million social impressions. Access required the Claude app, reinforcing product adoption. This anchors the multimillion-dollar “Keep Thinking” campaign, spanning streaming, sports, and print media. Anthropic, now valued at $18.3 billion, projects $5 billion revenue in 2025, primarily from Claude Code, while launching its strongest code model yet, Claude 4.5 Sonnet.

Both Anthropic and OpenAI are now confronting escalating legal and financial exposure. Insurers are shying away from AI-related coverage, forcing firms to consider using investor funds as self-insurance. OpenAI has $300 million in coverage, far short of potential multibillion-dollar liabilities, while Anthropic has already tapped internal capital for a $1.5 billion settlement. These developments reveal a new reality. As AI scales, the industry faces inseparable technical, legal, and financial pressures that test both innovation and resilience. Read more.

arXiv is a free online library where researchers share pre-publication papers.

Thank you for reading today’s edition.

Your feedback is valuable. Respond to this email and tell us how you think we could add more value to this newsletter.

Interested in reaching smart readers like you? To become an AI Breakfast sponsor, reply to this email or DM us on 𝕏!