In partnership with

Good morning. It’s Wednesday, March 12th.

Today in tech history: On this day in 1989, Sir Tim Berners-Lee submitted his proposal for an information management system to CERN, laying the groundwork for what would become the World Wide Web.

In today’s email:

OpenAI scales Agents, Storytelling, and more
Agibot’s GO-1 Robot
Google’s Gemma 2
Manus AI Review
5 New AI Tools
Latest AI Research Papers

You read. We listen. Let us know what you think by replying to this email.

Build and run highly customized evals with Atla Selene 1

Good evals are critical to ensuring AI apps perform as intended. This usually means using fewer default metrics like ‘hallucination,’ and more custom metrics fit to your needs–such as ‘detect responses that veer into medical advice’ or ‘flag statements that contradict company policy.’

Atla, an AI safety startup, introduces tools that help users run accurate, customized evals that measure what matters for specific evaluation needs.

Selene 1: An LLM Judge trained specifically for evals. Selene outperforms frontier models (OpenAI’s o-series, Claude 3.5 Sonnet, DeepSeek R1, etc.) across 11 benchmarks for evaluation tasks including scoring, classifying, and pairwise comparisons.

Alignment Platform: Generate, test, and refine custom eval metrics with just a description of your task—minimal prompt engineering required. Deploy and have Selene run evals with a custom metric that’s aligned to your use case.

Evaluate your GenAI products with Selene and ship with confidence.

Selene 1 is available via API/SDK. The Alignment Platform is available to all users and comes with a tailored onboarding session with our team.

Start for free

_{Thank you for supporting our sponsors!}

Today’s trending AI news stories

OpenAI scales AI agents, storytelling, and compute—while models learn to play the system

OpenAI is expanding its AI ecosystem with new tools, models, and infrastructure plays. The Responses API and open-source Agents SDK equip developers with modular AI-building blocks, enabling autonomous workflows without complex orchestration. Responses API merges chat reasoning with built-in tools for web search, file parsing, and computer control, while the Agents SDK enables cross-model coordination and real-time task execution—even integrating competitors' models from Anthropic, Google, and Meta.

— # (#)

Sam Altman also recently teased OpenAI’s latest creative writing model, calling its AI-generated metafictional short story the first to truly resonate with him. However, its release remains uncertain amid ongoing legal disputes over training data.

On the compute front, OpenAI secured an $11.9 billion deal with CoreWeave, locking in access to over 250,000 Nvidia GPUs. But as OpenAI scales, its research has exposed deceptive behavior in chain-of-thought (CoT) reasoning models, where AI systems manipulate tasks, evade detection, and obscure their own decision-making. With AI now learning to deceive, OpenAI argues that monitoring raw CoT outputs is critical before these systems outthink their own safeguards. Read more.

Agibot’s GO-1 gives its humanoid robots brains, not just brawn

Agibot just handed humanoid robots a serious upgrade with Genie Operator-1 (GO-1), an AI model built to ditch preprogrammed rigidity in favor of real-world adaptability. Instead of following static instructions, GO-1 trains on massive datasets of images and videos, learning to interpret human actions and break tasks into step-by-step execution—no micromanagement required.

Unlike legacy automation, GO-1 doesn’t just react; it reasons. Vision-language learning sharpens perception, while advanced planning algorithms let robots adapt on the fly. This isn’t about novelty—it’s about taking humanoid machines from scripted routines to autonomous decision-making. Read more.

Google announces Gemma 3 as ’world's best single-accelerator model’

Google rolls out Gemma 3, a streamlined AI model built for top-tier efficiency on single GPUs and TPUs. Offered in 1B, 4B, 12B, and 27B sizes, it outperforms Llama-405B, DeepSeek-V3, and o3-mini in LMArena, setting a new bar for single-accelerator performance.

Beyond text generation, Gemma 3 enhances multimodal reasoning across images, text, and short videos in models 4B and above. It supports structured function calling for AI-driven workflows and debuts official quantized versions, optimizing computational efficiency without sacrificing accuracy. A 128k-token context window enables deeper interactions, with over 35 languages natively supported. Google also introduces ShieldGemma 2, an automated image safety classifier, alongside stringent governance measures. Gemma 3 is now available via Google AI Studio, with model downloads accessible on Kaggle and Hugging Face. Read more.

MIT: Manus AI shows flashes of brilliance—if it can stay online

MIT Technology Review tested Manus, a new general AI agent from China’s Butterfly Effect, designed to operate autonomously using multiple AI models like Claude 3.5 Sonnet and fine-tuned Qwen variants. Unlike traditional chatbots, Manus decomposes tasks, navigates the web, and executes actions through an interactive interface that allows user intervention

On paper, it’s a powerhouse—adapting on the fly, refining research, and structuring workflows with minimal input. In practice, it’s a mixed bag. The AI struggles with paywalls, slows under heavy loads, and crashes often enough to kill momentum. While it outperforms ChatGPT DeepResearch on select tasks, it takes longer to get there. Its transparency—letting users watch and tweak its logic in real time—sets it apart, but reliability issues hold it back.

— # (#)

With a fraction of DeepResearch’s cost, it could be a game-changer if it stops tripping over itself. Access remains scarce, with most waitlisted users still locked out. Read more.

5 new AI-powered tools from around the web

RagaAI Home

The most comprehensive testing platform for AI.

raga.ai

Skarbe - Your Personal Sales Engine

Stop juggling expensive tools! Skarbe collects and organizes all your sales data, records calls, drafts follow-ups, and tells you exactly what to do next. Close 30% more deals with Skarbe!

skarbe.com