• AI Breakfast
  • Posts
  • Try The Open-Source GPT-4V Image Interpreter

Try The Open-Source GPT-4V Image Interpreter

...even if you don't have GPT-4V access

Good morning. It’s Monday, October 9th.

Did you know: There is an open-source version of GPT-4V image interpreter? It’s called LLaVA: Large Language and Vision Assistant and you can try it here.

In today’s email:

  • Advances and Releases in AI Technology

  • AI Policies and Ethical Considerations

  • Impact of Sociopolitical Conditions on AI Industry Events

  • 5 New AI Tools

  • Latest AI Research Papers

You read. We listen. Let us know what you think of this edition by replying to this email, or DM us on Twitter.

Today’s edition is brought to you by:

Our book, Decoding AI: A Non-Technical Explanation of Artificial Intelligence is on sale for just $2.99 today only!

(with a 100% money-back guarantee)

Decoding AI breaks down the complexities of AI into digestible concepts, walking you through its history, evolution, and real-world applications.

We'll introduce you to the key players in the AI field, as well as explain the underlying algorithms, data, and machine learning concepts that power AI systems. You'll gain a deeper understanding of deep learning, neural networks, and reinforcement learning, and we'll explore various types of AI, from rule-based systems to probabilistic networks and beyond.

The goal was to make this book an approachable discovery of how AI works.

It discusses a wide range of applications AI has in areas like natural language processing, computer vision, robotics, and predictive analytics. It also delves into the regulatory landscape and policy issues surrounding AI, as well as the potential future developments in AI, such as its applications in healthcare, education, transportation, and even space exploration.

You'll also learn the difference between narrow AI and Artificial General Intelligence (AGI), and how to get started with using AI through tips and resources.

Price goes back to $9.99 after today’s sale. 100% money-back guarantee if not satisfied.

Today’s trending AI news stories

Advances and Releases in AI Technology

Samsung unveils next-generation mobile processor the Exynos 2400, its cutting-edge flagship mobile processor, at the Samsung System LSI Tech Day 2023 event in San Jose, California. This advanced chip boasts the Xclipse 940 graphics processing unit, leveraging AMD’s RDNA 3 architecture, promising a 1.7x faster data processing speed and a remarkable 14.7 improvements in AI performance compared to its predecessor, the Exynos 2200. The Exynos 2400 aims to elevate gaming experiences with enhanced ray tracing capabilities for realistic optical effects. While it’s speculated that the chip will power Samsung Galaxy’s S24 smartphones, the company has yet to confirm this detail.

Nvidia is reportedly working on an AI-focused graphics card, potentially named RTX 4080 Super or RTX 4080 Ti, slated for release in early 2024. This card is expected to cater to the growing demand for AI-related tasks and applications. If the rumors hold true, it could be priced similarly to the RTX 4080, potentially prompting a price drop of $100-200. Nvidia’s move aims to address concerns about high pricing while boosting AI capabilities, making it more competitive in the market.

Military metaverse like a ‘multiplayer video game’ that will train soldiers using augmented reality and AI. The digital combat space, described as a “massive multiplayer video game,” offers a more cost-effective alternative to traditional training methods. Pilots will receive real-time training experiences that allow them to adapt and learn from their actions quickly. This innovation could significantly enhance the efficiency of military training and save costs while preparing soldiers for real-life combat scenarios.

AI Startup Reka Challenges ChatGPT with Multimodal AI Assistant 'Yasa-1' which can comprehend not only text but also images, short videos, and audio snippets. Yasa-1, currently in private preview, allows customization on private datasets and supports multiple languages. It competes directly with OpenAI’s ChatGPT in providing responses with context from the internet and handling multimedia inputs. However, Reka has cautioned about Yasa-1’s limitations, such as hallucinations, and plans to improve its capabilities while aiming to create a world where superintelligent AI collaborates with humans to address significant challenges.

Stable Signature: A new method for watermarking images created by open-source generative AI. Unlike traditional watermarks, this method is invisible to the naked eye but can be detected by algorithms, even if images are edited or transformed. Stable Signature addresses concerns of image misuse and deception in the age of AI-generated content. FAIR has shared this research with the AI community, emphasizing responsible AI use, and hopes to explore its integration across more generative AI modalities in the future.

Prompt transformation makes ChatGPT OpenAI's covert moderator for DALL-E 3. OpenAI has implemented safety measures, including “prompt transformations,” to rewrite prompts that may lead to content violations. An image classifier has also been trained to detect and block sexist or offensive content. While these measures have reduced risks, DALL-E 3 still exhibits cultural biases and challenges related to copyright issues. The integration aims to balance generating high-quality images and avoiding inappropriate or harmful content.

Spotify spotted prepping a $19.99/mo ‘Superpremium’ service with lossless audio, AI playlists, and more. References to Superpremium were discovered in the app’s code, indicating a broader feature set than expected. This may include advanced mining tools, extra hours of audiobook listening, and a personalized offering called “Your Sound Capsule.” While Spotify declined to comment on these findings, it suggests the company is expanding its premium offerings to cater to a broader audience. The pricing is rumored to be $19.99 per month, but Spotify has not officially confirmed these details.

AI Policies and Ethical Considerations

BackerKit, a crowdfunding platform has banned AI-generated content, including art for board games, from its platform. This decision contrasts with its rival, Kickstarter, which allows AI-generated content. BackerKit’s policy aims to ensure that all content and assets on its platform are created by humans and not solely generated by AI tools. The move comes amid growing concerns without proper compensation and permission from human operators.

The European Union is finding common ground with Japan in its approach to generative AI, according to Vera Jourova, European Commission Vice-President for Values and Transparency. The UE has been active in regulating AI with its AI Act, while Japan is pursuing more flexible guidelines to stimulate economic growth. The two regions are enhancing collaboration in areas such as AI, cybersecurity, and chips, recognizing their importance for economic security. Discussions are underway among the Group of Seven industrial powers regarding guidelines for generative AI, with consultations on an AI framework on track, but a code of conduct for AI-involved companies requiring further development.

Impact of Sociopolitical Conditions on AI Industry Events

Nvidia has canceled its two-day AI Summit in Israel, originally scheduled for October 15-16, due to safety concerns amid escalating violence in the region. The event, featuring a keynote by Nvidia CEO Jensen Huang, aimed to explore the latest developments in accelerated computing and AI applications. With the Israeli death toll from Hamas attacks rising and ongoing hostilities, Nvidia chose to prioritize the safety and well-being of participants. It’s unclear whether the event will go virtual or be rescheduled.

5 new AI-powered tools from around the web

Hummingbird is a lightweight, fast, and free native macOS AI personal assistant with features like quick answers, web browsing, real-time search, and customizable responses.

Airparser revolutionizes data extraction with GPT-4 AI. Extract structured data from various sources and export in real time to numerous apps. It’s a versatile tool for automating data extraction from emails, PDFs, and more, offering secure and efficient solutions for various use cases.

Danelfin is an AI-powered stock analytics platform that empowers investors with data-driven insights, rankings, and AI Scores for US-listed stocks and ETFs, aiding in making informed investment decisions and improving portfolio performance. It offers a range of use cases, including identifying prime investment opportunities and tracking portfolio AI Score evolution.

Code Companion is a GPT-4-powered programming tutor that offers real-time help and feedback within an online code editor. It provides guidance and assistance for programming problems, making it a valuable resource for learners and students.

Kino AI is an AI assistant designed for efficient media asset management, offering inferred metadata, keyword indexing, and AI transcription. It streamlines video editing workflows by automating essential tasks.

arXiv is a free online library where scientists share their research papers before they are published. Here are the top AI papers for today.

The paper introduces HeaP (Hierarchical Policies for Web Actions using LLMs), a framework that leverages large language models (LLMs) to perform web tasks into modular sub-tasks, each solved by low-level policies, enabling LLMs to generalize across tasks and interfaces with fewer demonstrations. HeaP outperforms prior works across various web task benchmarks while requiring significantly less training data. The paper discusses the challenges of teaching LLMs to perform web tasks and presents experimental results showcasing the effectiveness of the HeaP framework.

In this study, improved baselines are introduced for Large Multimodal Models (LMMs) with Visual Instruction Tuning. The authors demonstrate the effectiveness of the fully-connected vision-language cross-modal connector within the LLaVA framework. By making simple modifications, such as employing CLIP-ViT-L-336px with an MLP projection and integrating academic-task-oriented Visual Question Answering (VQA) data, stronger baselines are established, surpassing existing models on 11 benchmark tasks. The approach uses only 1.2M publicly available data and can be trained in approximately one day on a single 8-A100 node, aiming to make state-of-the-art LLM research more accessible. The authors have made their code and models publicly available.

This study introduces a novel training paradigm called ITIT (InTegrating Image Text) that leverages cycle consistency to enable generative vision-language training on unpaired image and text data. The approach unifies text-to-image and image-to-text generation within a single framework, enabling bidirectional generation. During training, a small set of paired image-text data is used to ensure reasonable performance in both directions, while cycle consistency is enforced between unpaired samples and their cycle-generated counterparts. Experiments show that ITIT with unpaired data achieves performance similar to models trained with orders of magnitude more paired with data, making it a promising approach for scaling generative vision-language models with limited paired data.

In this paper, the authors introduce DragView, an interactive framework for generating novel views of unseen scenes. Unlike previous methods, DragView does not rely on explicit camera poses and instead leverages a user-selected starting view to create a relative coordinate system for view synthesis. The approach includes pixel-aligned features, view-dependent modulation layers to handle occlusion, and an OmniView Transformer for feature aggregation. DragView outperforms existing pose-free and generalizable NeRF methods in terms of view synthesis quality and robustness against camera pose noise. The framework is evaluated on synthetic and real-world datasets, demonstrating superior performance.

UniAudio, a groundbreaking audio generation model, offers a universal solution by harnessing Large Language Models (LLMs) to produce various audio types like speech, music, and sounds based on diverse input conditions. This system tokenizes all audio types, alongside other input modalities, and utilizes a multi-scale Transformer architecture to handle the challenges posed by neural codec-based tokenization, achieving impressive scalability with 165K hours of audio data and 1B parameters. UniAudio not only excels in its trained tasks but also seamlessly adapts to new audio generation tasks through fine-tuning, demonstrating its potential as a foundation model for versatile audio generation. Experiments showcase its competitive or superior performance across 11 tasks compared to task-specific models.

Thank you for reading today’s edition.

Your feedback is valuable.


Respond to this email and tell us how you think we could add more value to this newsletter.