AI Video Has Crossed The Uncanny Valley

Good morning. It’s Friday, February 16th.

Did you know: 14 years ago today, IBM Super computer Watson beats the two best Jeopardy players in three day event with a score greater than the two humans combined?

In today’s email:

  • Groundbreaking AI Announcements from OpenAI and Google

  • AI Research and Development

  • AI Applications and Business

  • 5 New AI Tools

  • Latest AI Research Papers

  • ChatGPT Creates Comics

You read. We listen. Let us know what you think by replying to this email.

In partnership with SCRIBE

Automatically create step-by-step guides with Scribe

Scribe just bagged $25M in Series B funding to give you a well-deserved break from answering people’s questions all day. Instead, just create step-by-step guides (automatically, thanks to AI) to share with your team.

• Capture any process using the Chrome extension

• Easily customize steps and redact sensitive info

• Share with colleagues – and get back to your work

Thank you for supporting our sponsors!

Today’s trending AI news stories

Groundbreaking AI Announcements

> OpenAI introduces Sora, a unbelievably powerful new generative video model that generates near-perfect AI video. Key to Sora's innovation is a unique spatiotemporal video encoding scheme fed into a transformer architecture, enabling the production of minute-long, high-definition videos from text descriptions. This approach significantly improves visual fidelity and consistency across diverse styles.

Prioritizing responsible development, OpenAI is initially limiting Sora's availability to safety experts and select artists. This cautious strategy will facilitate feedback, the development of safeguards, and a careful exploration of ethical considerations before any potential public release.

Here are some examples from X that demonstrate Sora's impressive video generation.

> Google has unveiled Gemini 1.5, a major upgrade to its large language model. Leveraging "Mixture of Experts" techniques, Gemini 1.5 Pro matches the high-end Gemini Ultra on 87% of benchmark tests, demonstrating improved speed and efficiency. A standout feature is the model's massive context window, capable of ingesting up to 1 million tokens (equivalent to hours of video or thousands of lines of code). This enhancement paves the way for complex queries across large data sets. CEO Sundar Pichai highlights the potential for analyzing films, financial records, and other vast bodies of information. Google will initially offer Gemini 1.5 to developers and businesses within Vertex AI and AI Studio, later extending it to broader consumer use with safety and ethical considerations in place.

AI Hardware and Infrastructure

> Nvidia unveiled Eos, its newest enterprise AI supercomputer and the company's fastest to date. Ranked 9th globally in FP64 performance, Eos is purpose-built for advanced AI development and scalability. The system leverages 576 DGX H100 systems, integrating 4,608 Nvidia H100 GPUs and 1,152 Intel Xeon Platinum 8480C CPUs. This configuration, alongside Nvidia's Quantum-2 InfiniBand with In-Network Computing (400Gb/s), yields impressive performance metrics: 121.4 FP64 PetaFLOPS (Rmax) and 18.4 FP8 ExaFLOPS for AI workloads. Notably, Nvidia outfits Eos with a comprehensive software stack optimized for AI development, deployment, and orchestration, addressing applications ranging from generative AI to large-scale model training.

> XPeng Motors has developed a groundbreaking AI framework called "Anything in Any Scene." This system can insert objects into videos with unprecedented photorealism. The framework meticulously considers elements like lighting, shadows, and occlusion, ensuring the added objects blend seamlessly with the original footage. This technology has potential applications in film production and autonomous vehicle training, where accurately simulating diverse scenarios is crucial.

AI Applications and Business

> Leaked documents have exposed an internal Google project codenamed "Goose," a large language model aimed at boosting staff productivity. Trained on Google's vast engineering knowledge, Goose aids in product development, answers complex code-related questions, and can even write new code. This aligns with Google's efficiency push, using AI to streamline operations. While Google maintains that AI won't eliminate jobs, tools like Goose could significantly enhance developer capabilities and optimize Google's product creation process.

> Microsoft has developed an AI system called UFO (UI-Focused Agent) that could replace traditional Windows interfaces. Powered by OpenAI's image recognition model, UFO can understand what's on your screen and navigate within apps to complete tasks for you. In tests, UFO was remarkably successful, completing 86% of complex tasks while taking fewer steps and more security precautions than its competitors. Microsoft intends to further improve UFO, potentially revolutionizing how we interact with our computers.

> OpenAI is rumored to be creating a web search product aimed at challenging Google's dominance. Sources told The Information the tool may leverage Microsoft's Bing search technology, potentially offering a conversational and knowledge-oriented search experience. While details remain limited, this move aligns with OpenAI's focus on extending ChatGPT's capabilities and Microsoft's substantial investments in the company. It's uncertain whether this AI-powered approach will lure users away from established search engines like Google. However, OpenAI's potential entry into the market highlights the increasing focus on AI-driven search experiences.

> Google expands its Gemini large language model offerings within the Vertex AI platform. Gemini 1.0 Pro, previously in preview, enters general availability. Gemini 1.0 Ultra enters a restricted general availability with an "allowlist," though Google's terminology is atypical. Further, Google unveils Gemini 1.5 Pro, claiming performance parity with their flagship Gemini 1.0 Ultra and the capacity to process a massive one-million-token context (video, code, text). Adapter-based tuning arrives in Vertex, with further AI techniques expected. Developers gain API access in Dart SDK (for Flutter apps), Project IDX integration, and an extension within Firebase mobile development platform.

5 new AI-powered tools from around the web

Newsprint scans thousands of news stories every day to help communications professionals track discussions of their companies, clients and industries.

AutoCut is an Adobe Premiere Pro plugin that leverages AI to automate silence removal, caption generation, zoom management, and stock footage integration, streamlining video editing workflows.

Studio Neiro AI generates human-like video avatars with accurate lip-syncing and emotional expressions, customizable from your script or audio.

ChatAvatar utilizes generative AI to transform text descriptions and images into production-ready, customizable 3D avatars with advanced export options for 3D workflows.

MagiScan is an AI-powered mobile 3D scanning app that transforms real-world objects into high-quality 3D models (with formats like USDZ, GTLF, GLB, OBJ, STL, FBX, PLY) for diverse applications.

arXiv is a free online library where researchers share pre-publication papers.

Magic-Me is a framework for identity-specific video generation built on diffusion models. It features a specialized ID module that disentangles identity information from background noise in reference images, improving identity preservation in generated videos. A 3D Gaussian Noise Prior enhances inter-frame correlation during video generation for increased stability and consistency. Magic-Me also includes V2V modules (Face VCD and Tiled VCD) that refine facial details and upscale video resolution. A unique prompt-to-segmentation training technique helps mitigate background noise during identity learning. Overall, Magic-Me offers improved ID preservation compared to baselines and can work with publicly available, finetuned text-to-image models.

NeRF Analogies focuses on editing the visual appearance of Neural Radiance Fields (NeRFs) without changing their underlying geometry. To achieve this, it leverages a concept called "semantic affinity" derived from pre-trained image models (specifically, DiNO-ViT). The method first extracts dense feature descriptors from the source and target NeRFs. Then, it establishes correspondences between the source NeRF's appearance and the target NeRF's geometry based on how similar these features are. Finally, a new NeRF is trained to combine the target geometry with the transferred appearance from the source. This approach provides semantic control over NeRF editing while ensuring that the results are consistent across multiple viewpoints. User studies have shown that NeRF Analogies outperforms traditional stylization methods.

L3GO is a system that leverages large language models (LLMs) to generate 3D objects from text descriptions. Unlike diffusion models, which often struggle with precise object configurations, L3GO decomposes object creation into iterative steps. It uses LLMs to identify object parts, determine their size and placement, and solicit feedback from a 3D simulation environment (SimpleBlenv). L3GO outperforms other LLM-based generation methods on benchmarks and excels at creating objects with unconventional attributes.

This paper introduces HiSS (Hierarchical State-Space Models), a novel architecture for continuous sequence-to-sequence prediction designed to address the challenges of real-world sensor data. HiSS leverages a temporal hierarchy of structured state-space models to efficiently learn features and improve prediction accuracy. Evaluated on CSP-Bench, a new benchmark consisting of six real-world sensor datasets, HiSS outperforms state-of-the-art sequence models like LSTMs, Transformers, and flat SSMs (including S4 and Mamba) by at least 23% on MSE. Notably, HiSS demonstrates increased sample efficiency for smaller datasets and compatibility with standard data-filtering techniques, making it a promising solution for a wide range of sensor-based prediction tasks.

This paper by researchers at Google DeepMind investigates whether large language models (LLMs) can perform chain-of-thought (CoT) reasoning without explicit prompts. Surprisingly, the authors find that CoT reasoning paths naturally emerge when examining alternative top-k tokens during decoding, instead of relying solely on greedy decoding. They observe that LLMs demonstrate higher confidence in the final answer when a CoT path is present. This leads to the development of 'CoT-decoding', which selects more reliable decoding paths and significantly outperforms standard greedy decoding on various reasoning tasks. This study challenges the reliance on prompting for reasoning and highlights the intrinsic reasoning abilities of pre-trained LLMs.

ChatGPT Creates Comics

Thank you for reading today’s edition.

Your feedback is valuable. Respond to this email and tell us how you think we could add more value to this newsletter.

Interested in reaching smart readers like you? To become an AI Breakfast sponsor, apply here.