• AI Breakfast
  • Posts
  • GPT-4V and Spotify Podcast Translation with AI

GPT-4V and Spotify Podcast Translation with AI

Plus a tool for generating royalty-free music with AI

Good morning. It’s Wednesday, September 26th.

Did you know: 40 years ago today, Lt. Col. Stanislav Petrov prevented nuclear war by disobeying orders when the USSR’s early warning system erroneously detected nuclear missile launches from the United States.

In today’s email:

  • AI Capabilities and Features

  • AI Hardware and Manufacturing

  • AI Business & Partnerships

  • AI in National Security

  • AI in Entertainment

  • AI in Science

  • 5 New AI Tools

  • Latest AI Research Papers

You read. We listen. Let us know what you think of this edition by replying to this email, or DM us on Twitter.

Today’s edition is brought to you by:

Fed up with the endless hunt for royalty-free music for your content?

Splash Pro is the most powerful AI music engine on the planet, letting you describe the song you want to hear, customize it, download it and deploy it in any project you choose!


Use the code OGHMUXXD for 50% off our MAX tier for 2 months!*


- Unlimited generative songs
- 15 x AI singers & rappers
- Production-quality mastering
- Unlimited commercial license

Try it here

*Offer applies to ‘MAX’ subscription tier & expires 30 Dec 2023

Today’s trending AI news stories

AI Capabilities and Features

ChatGPT can now hear, speak, see, and understand multimodal prompts: OpenAI is introducing exciting new features to ChatGPT, enhancing its multimodal capabilities. Users can soon expect output in addition to the existing voice input feature, making interactions more intuitive. OpenAI has developed its own text-to-speech model, competing with companies specializing in synthetic voices like ElevanLabs. This model can generate human-like voices from text with just a few seconds of sample audio. Furthermore, ChatGPT is gaining image recognition abilities, allowing users to combine voice and images for various applications, from describing landmarks during travel to suggesting recipes based on images of the fridge and pantry. These enhancements promise a more interactive and versatile ChatGPT experience.

Getty Images Launches Commercially Safe Generative AI, a tool powered by NVIDIA’s Edify model architecture. It allows customers to create visuals with generative AI while respecting creators’ intellectual property rights. This commercially safe offering provides royalty-free licenses, uncapped indemnification, and perpetual, worldwide, nonexclusive use rights. The generated content won’t be added to existing Getty Images libraries for licensing. Creators will be compensated for their content used in training sets. Customers can enable Generative AI on GettyImages.com or integrate it into existing workflows via an API. Customization options and additional features will be added later.

LLMs are surprisingly great at compressing images and audio, DeepMind researchers find. LLMs, typically used for predicting text, were repurposed to perform arithmetic coding, a lossless compression technique. This transformation was possible due to LLMs being trained with log-loss or cross-entropy, creating a probability distribution for sequences, and equating them with compression tools. The study found LLMs excelling in text compression and achieving remarkable compression rates on image and audio data. However, their size and speed compared to traditional compression algorithms limit their practicality for data compression. This fresh perspective on LLMs highlights insights into their scalability and evaluation for future applications.

Spotify Will Translate Podcasts Into Other Languages Using AI. They’re introducing a “Voice Translation” feature that matches the translated voice and style to the original speaker’s. This innovation leverages OpenAI’s voice transcription tool Whisper. Initially, Spanish translations of select podcast episodes are available, with French and German translations coming soon. Spotify’s “Voice Translation Hub” will host these translated podcasts, catering to its 100 million regular podcast listeners. This move follows the trend of companies using generative AI for various products.

OpenAI’s GPT-4 with vision still has flaws, paper reveals: GPT-4V faces challenges in making accurate inferences and avoiding biases in image analysis. While the model has undergone safeguards to prevent misuse, such as identifying dangerous substances from images, it still exhibits issues like hallucination, missing text, and misunderstandings in certain contexts.

AI Hardware and Manufacturing

China to build AI chip factory as global semiconductor race intensifies to bolster its position in the global semiconductor industry. These accelerators will enable the creation of high-quality light sources needed for on-site manufacturing of AI semiconductor chips. China’s pursuit of AI chip production may help bypass current U.S. sanctions. This move reflects China’s determination to compete in the semiconductor race, while the U.S. seeks to tighten its grip on AI manufacturing, with European regulators also considering their stance on export controls and restrictions on China.

AI Business & Partnerships

Snap partners with Microsoft on ads in its ‘My AI’ chatbot feature. My AI will suggest sponsored links related to users’ conversations with the AI chatbot. For example, if a user asks the chatbot for dinner recommendations, it may respond with a link sponsored by a local restaurant. Microsoft’s Ads for Chat API powers those sponsored links, allowing advertisers to reach users at moments of interest. This move follows similar experiments by Bing and Google, where ads were inserted into AI chatbots to drive traffic and engagement.

Microsoft seeks Plan B for more cost-effective AI, sidestepping OpenAI's GPT-4 Microsoft, which owns nearly half of OpenAI, is pursuing an alternative AI strategy due to the high operational costs of OpenAI’s models. They are working on creating smaller, more cost-effective conversational AI systems, acknowledging they may be less powerful than GPT-4. Microsoft is currently testing its AI models in products like Bing Chat. While they remain a significant OpenAI shareholder, tensions have arisen as both companies compete for the same target audience with products like ChatGPT Enterprise and Bing Chat Enterprise. Microsoft is actively exploring more efficient AI solutions.

AI in National Security

CIA Builds Its Own Artificial Intelligence Tool in Rivalry With China: The Central Intelligence Agency (CIA) is developing its own artificial intelligence tool, similar to OpenAI’s ChatGPT, to analyze vast amounts of public data. The CIA’s Open-Source Enterprise division aims to enhance intelligence analysts’ access to open-source information using AI. This move reflects the agency’s effort to keep pace with technological advancements in intelligence analysis. The development is part of the ongoing rivalry between US and Chinese intelligence agencies.

AI in Entertainment

The writers' strike is over; here’s how AI negotiations shook out: After a five-month strike, the Writers Guild of America (WGA) has reached an agreement with Hollywood studios to end the strike, allowing writers to return to work. The negotiations reportedly involved the use of AI and technology-driven solutions, although specific details about their role in the agreement are not provided. The development highlights the increasing influence of AI in resolving disputes and facilitating negotiations in various industries, including entertainment.

AI in Science

New AI algorithm can detect signs of life with 90% accuracy. Scientists want to send it to Mars: This AI system, designed to identify subtle molecular patterns indicative of biological signals, could be integrated into advanced sensors on robotic space explorers for missions to Mars, the moon, and potentially habitable worlds like Enceladus and Europa. The technology leverages the fundamental differences between the chemistry of life and abiotic molecules, presenting an innovative approach to astrobiology research.

5 new AI-powered tools from around the web

AppyHigh Prime provides exclusive access to three flagship generative AI apps: ImagineGO Text to Image Generator, PixelGO Photo & Video Enhancer, and AI Avatar Maker. Additionally, use PRO access to a wide range of productivity apps, social media tools, and utilities, all in one membership.

LogicBalls is an AI-powered platform with over 150 tools for generating high-quality, SEO-friendly content quickly. Say goodbye to writer’s block and save time with instant creativity. LogicBalls is your ultimate companion for writing compelling marketing copy, blogs, and more.

ProdPad introduces its AI Assistant for Product People, transforming product management with AI capabilities. Automate tasks, save time and benefit from AI guidance to excel in product management. The AI Assistant handles grunt work, generates content, refines your backlog, and provides real-time coaching and advice.

Circum AI introduces its AI Icon Generation feature, expanding its library of 1621 icons. Create icons effortlessly in five distinct styles, with more to come in future updates.

Opinly.ai Save time and enhance your product with our efficient competitor research tool. Input a YouTube link, generate a comprehensive report, and gain valuable insights into your competitor’s performance.

arXiv is a free online library where scientists share their research papers before they are published. Here are the top AI papers for today.

The paper presents a method called AUTOCALIBRATE that aims to calibrate large language models (LLMs) as reference-free evaluators for natural language generation. LLMs have shown potential but often lack alignment with human preferences. AUTOCALIBRATE is a gradient-free approach that automatically aligns LLM-based evaluators with human preferences by drafting, revisiting, and applying scoring criteria. Through experiments on various text quality evaluation datasets, the method shows significant improvements in correlation with expert evaluation after calibration, offering insights into effective scoring criteria alignment.

This research paper explores the cognitive moral development of large language models (LLMs) by introducing a novel evaluation framework based on the Defining Issues Test (DIT) from moral psychology. While LLMs are extensively used in various applications requiring ethical judgment, assessing their moral reasoning has been challenging due to cultural and ethical diversity. The proposed framework aims to categorize LLMs’ ethical reasoning abilities in terms of moral consistency and Kohlberg’s moral development stages. This innovative approach bridges the gap between AI and human psychology, providing insights into LLMs’ moral decision-making processes and their alignment with human values.

MosaicFusion is a novel diffusion-based data augmentation technique for large vocabulary instance segmentation. It generates synthetic labeled data without relying on extensive manual labeling. By dividing an image canvas into regions and using diffusion models with text prompts, it creates multiple object instances simultaneously. Corresponding masks are generated by aggregating cross-attention maps and refining them. This method significantly improves instance segmentation models, especially for rare and novel categories. MosaicFusion is training-free and versatile, capable of generating diverse synthetic data, addressing the data scarcity challenge in instance segmentation. It outperforms existing augmentation techniques, making it a valuable tool for researchers.

In this study, researchers aimed to assess the ethical reasoning abilities of the large language models (LLMs) by bridging the fields of human psychology and AI. They proposed an evaluation framework using the Defining Issues Test (DIT) to measure moral consistency and Kohlberg’s moral development stages in LLMs. The authors discussed the significance of ethics and moral judgment in AI applications, emphasizing the challenge of pluralism and cultural differences. They criticized binary classification approaches for oversimplifying ethical reasoning and presented the DIT as a more conducted experiment with GPT-3 and observed limitations in its responses, highlighting the need for aligning LLMs with human intent. Another variant, Text-davinci-002, also failed to provide meaningful responses. Further experiments and considerations are needed to improve the ethical reasoning capabilities of LLMs.

The paper presents V-PTR, a novel approach to enhance robotic offline reinforcement learning (RL) using large-scale human video datasets. V-PTR consists of three key phases. First, Video Pre-Training via TD-Learning leverages video data to train an intent-conditioned value function through temporal-difference learning. Second, Multi-Task Robot Pre-Training via Offline RL refines the learned representation using robot data with actions and rewards, bridging the domain gap between video and robot data. Finally, Fine-Tuning to a Target Task adapts the system to specific tasks using a limited target dataset. V-PTR significantly improves zero-shot generalization and robustness across manipulation tasks, highlighting its potential in utilizing internet-scale video data to advance robotic learning.

Thank you for reading today’s edition.

Your feedback is valuable.


Respond to this email and tell us how you think we could add more value to this newsletter.