AI Breakfast
Posts
Inside GPT-4o Voice

Inside GPT-4o Voice

AI Breakfast
July 31, 2024

Good morning. It’s Wednesday, July 31st.

Did you know: On this day in 1971, Apollo 15 astronauts James B. Irwin and David Scott first used the four-wheeled battery-powered Lunar Roving Vehicle to extensively explore the Moon's surface.

In today’s email:

Meta’s SAM 2: Advanced model for image and video segmentation.
Nvidia’s Microservices: Expanded support for 3D and robotics.
OpenAI’s GPT-4o Long Output: New model with up to 64,000 tokens.
OpenAI’s Advanced Voice Mode: New voice feature for ChatGPT Plus with natural conversation capabilities.
5 New AI Tools
Latest AI Research Papers

You read. We listen. Let us know what you think by replying to this email.

_{In partnership with CREATE}

Build websites and AI apps in seconds for free with just English, no code needed!

Introducing Create: a new tool that turns your ideas into reality. Whether you’re a founder, PM, designer, marketer, or engineer, Create empowers you to build bespoke apps, MVPs, prototypes, designs, embeddable AI tools, and landing pages quickly and easily.

Why choose Create?

Effortless: Simply describe your vision in English (or any language), or paste a screenshot.
Instant Results: Watch Create build your project in real-time.
AI-Powered: Leverage multiple foundational models like GPT 4, Claude Sonnet 3.5, and more
Full-stack: Add user accounts, databases, backend functions and more
Extensible: Access 100s of built-in integrations plus connect to any external API
Runs on code: Enjoy fast, powerful performance with the option to directly edit the code.
Community: We're growing fast, 140k projects, +1k new projects/day

_{Thank you for supporting our sponsors!}

Today’s trending AI news stories

Meta's new open-source model SAM 2 could be the "GPT-4 moment" for computer vision

Meta has launched SAM 2, an advanced foundation model for image and video segmentation, open-sourcing its model, code, and dataset. While its predecessor SAM was trained on 11 million images primarily for image segmentation, SAM 2 extends its capabilities to video segmentation.

It is trained on the SA-V dataset, the largest video segmentation dataset available, comprising 50,900 videos and 642,600 mask annotations, totaling 35.5 million individual masks. This dataset was created using Meta’s “Data Engine,” which combines SAM models with human annotators to ensure rapid and accurate labeling.

Architecturally, SAM 2 builds on a Transformer-based framework with a novel memory module that tracks objects across video frames, enhancing object tracking in longer sequences. SAM 2 achieves better segmentation accuracy than previous methods, with three times fewer interactions, and performs six times faster in image segmentation than SAM. Although effective in various conditions, SAM 2 faces limitations in accurately tracking fine-grained elements or multiple identical objects in motion. Read more.

OpenAI launches experimental GPT-4o Long Output model with 16X token capacity

OpenAI has introduced the GPT-4o Long Output model, significantly increasing its token capacity to allow outputs of up to 64,000 tokens—16 times more than the previous model's limit. Despite maintaining a total context window of 128,000 tokens, users can now input up to 64,000 tokens and receive equivalent output, addressing the demand for more detailed and extended responses. This is especially advantageous for applications that require comprehensive answers, such as code editing and writing enhancements.

Priced at $6 per million input tokens and $18 per million output tokens, the model is slightly more expensive than the standard GPT-4o but offers considerable value for its extended capabilities. Initially available to select partners for alpha testing, this model's effectiveness in real-world applications is being evaluated. If successful, OpenAI intends to expand access, potentially altering how developers utilize AI for complex problem-solving. Read more.

OpenAI begins alpha testing new AI voice feature for ChatGPT Plus

New GPT-4o Voice performs with intense intonation and emotion as it imitates a soccer commentator:
(Advanced voice model was released to select users last night)
— AI Breakfast (@AiBreakfast)
12:52 PM • Jul 31, 2024

OpenAI has commenced alpha testing of its "Advanced Voice Mode" for ChatGPT Plus users. This feature, designed to facilitate more fluid and natural conversations, allows users to interrupt the AI at any time. The rollout is gradual, with initial access granted to a select group via email and in-app notifications, and a broader launch planned for the fall.

The voice capabilities are powered by GPT-4o and have been tested across 45 languages with over 100 external red teams. To ensure privacy, the mode utilizes four preset voices, and safeguards are in place to prevent deviations and inappropriate content.

The feature’s introduction follows a delay from its initial June release due to security concerns and controversy over its voice resembling actress Scarlett Johansson. The insights gained from this phase will help refine the voice capabilities and address any concerns that arose from the initial announcement. Read more.

NYT slams OpenAI's request for reporter notes as "unprecedented" and "harassing" OpenAI is requesting internal documents, including research notes, from The New York Times as part of a copyright lawsuit. This request, which the Times decries as “unprecedented” and “harassing,” is seen by the newspaper as an attempt to intimidate journalists and undermine intellectual property rights. OpenAI argues that these documents are crucial to assessing the validity of the Times' copyright claims, which center on allegations that ChatGPT reproduced the newspaper’s content verbatim. The Times maintains that the copyright status should be evaluated based on published works rather than private research. Read more.

Agents might be the next frontier of AI and OpenDevin wants to open source them: Developed by a consortium of academic and commercial researchers, OpenDevin features a flexible architecture that includes an agent abstraction, an event stream for monitoring actions and observations, and a runtime environment for executing tasks. The platform supports a secure sandbox for running code, integrating tools like bash shells, Jupyter notebooks, and web browsers, facilitating complex software development and web-based tasks. OpenDevin includes pre-built agents such as CodeAct and a web browsing agent, showing competitive performance in initial benchmarks. It also supports the creation of "micro-agents" and allows for collaboration among agents, such as task delegation. The AgentSkills library, extendable with new functionalities, enhances the platform's capabilities. OpenDevin is community-driven, with its source code available on GitHub under the MIT license. Read more.

Etcetera: Stories you may have missed

AI video generator RunwayML introduces new image-to-video feature

Apple used Google's chips to train two AI models, research paper shows

Nvidia expands microservices library and support for 3D and robotic model creation

Hugging Face Offers Developers Inference-as-a-Service Powered by NVIDIA NIM

Shutterstock releases generative 3D, Getty Images upgrades service powered by Nvidia

A match 'made in heaven': Inside Canva’s stunning Leonardo AI acquisition

Meta's reality check: Inside the $45 billion cash burn at Reality Labs

Perplexity is cutting checks to publishers following plagiarism accusations

U.S. government advised not to restrict open-source AI models for now, NTIA report finds

Amuse 2.0 beta released for easy on-device AI image generation on modern AMD hardware

Neuralink rival Synchron's brain implant now lets people control Apple's Vision Pro with their minds

OpenAI endorses Senate bills that could shape America's AI policy

Lawmakers want to carve out intimate AI deepfakes from Section 230 immunity

5 new AI-powered tools from around the web

Jamie provides human-like AI-powered meeting summaries across all platforms, supporting 15+ languages, ensuring a seamless and privacy-first experience.

Billy is an AI copilot for WordPress, generating blog content, coding custom widgets, and assisting with site analysis and configuration.

Rodin Gen-1 uses Stable Diffusion and ControlNet to rapidly convert text or images into detailed, production-ready 3D models.

GitStart AI Ticket Studio generates precise engineering tickets by analyzing your codebase, reducing communication errors and improving workflow efficiency in software projects.

table AI offers an AI-first approach to personal CRM, centralizing and enriching network connections while integrating with multiple tools and platforms.