Updates on Google’s Gemini: The Role of Project Astra in Key Announcements at I/O

At the Google I/O 2024 developer conference held on Tuesday, Google unveiled advancements to its AI-driven chatbot, Gemini. The enhancements aim to improve both the chatbot’s understanding of real-world context and its interactions with users.

A key highlight was the introduction of Gemini Live, a new feature enabling “in-depth” voice interactions with Gemini via smartphones. This functionality allows users to interrupt and ask clarifying questions while the chatbot is speaking, with Gemini adjusting to their speech patterns in real-time. Additionally, Gemini can analyze and respond to the users’ environment through photos or videos captured with their smartphone cameras.

“With Live, Gemini gains a better understanding of the user,” stated Sissie Hsiao, General Manager for Gemini experiences at Google, during a press briefing. “It is fine-tuned to engage in a back-and-forth, natural conversation with the underlying AI model.”

Gemini Live represents an evolution of Google’s existing technologies like Google Lens, which performs image and video analysis, and Google Assistant, known for its speech recognition and generation abilities across various devices.

While at first glance Live may seem like an incremental upgrade, Google asserts it employs cutting-edge techniques from the field of generative AI. This aims to deliver more accurate image analysis and integrates an advanced speech engine for more consistent, emotionally expressive, and realistic dialogues.

“It’s a real-time voice interface with exceptionally powerful multimodal capabilities and long-context understanding,” explained Oriol Vinyals, Principal Scientist at DeepMind, Google’s AI research subsidiary, in an interview with TechCrunch. “This combination will make interactions feel significantly more powerful.”

The innovative strides powering Live are partly a result of Project Astra, a DeepMind-led initiative focused on developing AI-powered applications and agents for real-time, multimodal comprehension.

“We’ve always aspired to create a universal agent that enhances everyday life,” noted Demis Hassabis, CEO of DeepMind, during the briefing. “Imagine agents that perceive and comprehend our environment, and respond swiftly in conversations, making interactions feel naturally paced and qualitatively superior.”

Scheduled to launch later this year, Gemini Live can address questions about objects in the smartphone camera’s view, such as identifying a neighborhood or naming a part of a broken bicycle. It can interpret and explain portions of computer code and even recall the last known location of objects like a pair of glasses.

Through these advancements, Google aims to redefine the interaction paradigms between users and AI, enhancing the practical utility and emotional resonance of digital assistants in everyday scenarios.

Gemini Live is engineered to function as a virtual coach, assisting users in areas such as event rehearsal, idea brainstorming, and more. For example, it can recommend which skills to emphasize in an upcoming job or internship interview and provide guidance on public speaking.

“Gemini Live can communicate more concisely and interact more naturally than traditional text-based interfaces,” Sissie noted. “We envision an AI assistant that not only tackles complex problems but also offers a seamless and fluid user experience.”

This capability of “remembering” user interactions is facilitated by the model’s underlying architecture: Gemini 1.5 Pro, the flagship of Google’s Gemini generative AI models. Featuring an extended context window, it can process and analyze substantial data — approximately an hour of video content — before formulating a response.

“That equates to hours of video interactions with the model, retaining all prior activities,” Vinyals explained.

Live draws parallels to the generative AI used in Meta’s Ray-Ban glasses, which can similarly analyze camera-captured images in near real-time. Based on Google’s pre-recorded demonstration reels, Live bears a notable resemblance to the updated version of OpenAI’s ChatGPT.

A significant distinction is that Gemini Live will not be freely available. Upon its release, it will be part of Gemini Advanced, a more sophisticated edition accessible through the Google One AI Premium Plan, which costs $20 per month.

In a potential critique of Meta, one Google demonstration featured AR glasses with a Gemini Live-like application. However, Google, perhaps wary of repeating past missteps in the wearable tech sector, did not confirm whether such glasses or any AI-powered eyewear would be commercially available soon.

Vinyals remained open to the idea, commenting, “We are still in the prototyping phase and presenting [Astra and Gemini Live] to the public. Feedback from early users will help shape our future direction.”

Other Gemini updates

Beyond Live, Gemini is set to receive a series of enhancements aimed at increasing its utility in everyday tasks.

Gemini Advanced users across more than 150 countries and over 35 languages will gain access to the expanded capabilities of Gemini 1.5 Pro. This version boasts an extended context feature that allows the chatbot to analyze, summarize, and respond to questions regarding extensive documents, accommodating up to 1,500 pages. While Live will become available later in the year, Gemini Advanced users can begin utilizing Gemini 1.5 Pro immediately. Documents can be imported directly from Google Drive or uploaded from mobile devices.

Further advancements are scheduled for later this year for Gemini Advanced users, including an even larger context window that will support up to 2 million tokens. This update will also introduce features such as uploading and analyzing videos up to two hours in length and handling expansive codebases exceeding 30,000 lines of code.

According to Google, the expanded context window will enhance Gemini’s image recognition capabilities. For instance, when provided with a photo of a fish dish, Gemini will be able to recommend a similar recipe. Additionally, Gemini will offer step-by-step solutions to math problems.

Furthermore, Gemini will assist in comprehensive trip planning, making it a versatile tool for a wide range of applications.

In the upcoming months, Gemini Advanced is set to introduce a new “planning experience” designed to create custom travel itineraries based on user prompts. This feature will leverage flight information (extracted from a user’s Gmail inbox), meal preferences, and local attractions data sourced from Google Search and Maps, while also considering the distances between these points of interest. Gemini will dynamically update the itinerary to reflect any changes in real time.

In the near term, users of Gemini Advanced will have the capability to develop Gems—custom chatbots powered by Google’s Gemini models. Similar to OpenAI’s GPTs, these Gems can be created from natural language inputs, such as “You’re my running coach. Provide me with a daily running plan,” and can be either shared with others or kept private. While there’s no current information on whether Google intends to launch a marketplace for Gems akin to OpenAI’s GPT Store, we anticipate more details will emerge during the ongoing I/O event.

Additionally, both Gems and the core Gemini platform will soon benefit from enhanced integrations with various Google services including Google Calendar, Tasks, Keep, and YouTube Music. These integrations are aimed at automating a variety of time-saving tasks to enhance user productivity.

Hsiao explained that if you receive a flyer from your child’s school with various events you’d like to add to your personal calendar, you can simply take a photo of the flyer. Using the Gemini app, this information can be seamlessly converted into calendar entries.

While this feature promises significant time savings, it is important to approach Google’s claims with a degree of skepticism, especially given that generative AI models often produce inaccurate summaries and have received mixed reviews in their early stages. However, should the enhanced versions of Gemini and Gemini Advanced function as described by Hsiao, they could indeed offer considerable efficiency benefits.