Google’s annual developer conference, Google I/O, took place in the early hours of May 15th, Taiwan time. This year, the focus was on updating AI capabilities. Throughout the event, the term “AI” was mentioned a total of 122 times.
One of the major updates was the integration of Gemini, a “multimodal” search engine and assistant. Starting this year, Google will be able to search using videos. They also introduced the AI Overview feature, which uses AI to summarize search content. The intelligent assistant, Astra, can recognize objects and actions in videos and provide real-time responses to related questions. Additionally, they unveiled the new Gemini 1.5 Flash and Veo image generation models.
During the event, Demis Hassabis, the leader of Google Deepmind, made his first appearance at Google I/O.
AI Revolution 1: Search Engine! Ability to search videos and understand complex commands!
The search engine, which solidifies Google’s leading position, underwent a fundamental update with the addition of Gemini’s new capabilities. It can now not only recognize audiovisual content but also understand longer and more complex commands.
Google can now search using videos!
While Google search has primarily relied on text and images for a long time, it has finally advanced to include “video search.” Users can now shoot videos and ask simple questions using voice or text, and the search engine will automatically analyze the video content and provide relevant responses.
In a demonstration, when faced with a technical difficulty while playing a vinyl record and the needle was irregularly moving, a user recorded a video and asked, “Why is it happening?” Google automatically searched and provided an AI search summary through the Google Overview feature.
AI Overview: Understanding longer and more complex commands
AI Overview, a technology introduced by Google last year, summarizes and organizes search content at the top of the search engine. With the new “multi-step reasoning capability” of the Gemini model, AI Overview can handle complex questions. Whether the question is long, contains many details, or has specific areas to focus on, there is no need for multiple queries.
For example, if a user wants to find a new yoga or Pilates studio in Boston, they can simply search, “Find me the best yoga or Pilates studios in Boston and tell me their new member offers and the time it takes to walk from Lighthouse Hill.” Despite multiple requirements, AI Overview can still complete the task.
AI Revolution 2: Astra Assistant! Real-time analysis of video content, thinking, and responding
Demis Hassabis, the leader of Google Deepmind, took the stage for the first time at Google I/O this year to showcase Google’s “future AI assistant” named Astra. Astra claims to understand the dynamic and complex world like a human.
Astra also has the ability to analyze videos in real-time, thanks to its multimodal capabilities, including instant analysis of dynamic scenes. It can even have memory. This feature received thunderous applause during the presentation.
In a demonstration, a user filmed their surroundings while walking and asked Astra, “Where do you think I am right now?” Then, they captured the computer screen, used a brush to circle the code, and asked Astra, “Where do you think there are problems that need improvement?” Before the video ended, they could ask Astra, “Do you remember where I put my glasses?” Astra could analyze all the frames from the past few minutes and find the frame where the glasses were located, analyze the information in the frame, and finally conclude, “Next to an apple.”
AI Revolution 3: Google Photos Search! AI helps you find photos and document your life
Google also introduced the Ask Photos with Gemini feature in Google Photos. It uses image analysis to classify objects in photos and add keyword tags. For example, users can quickly find photos with their car’s license plate or document the process of their daughter learning to swim and organize related photos. When asked, “When did my daughter learn the backstroke?” Gemini can quickly find relevant photos and provide the date as an answer.
AI Revolution 4: Android! Gemini spans all experiences, including conversations and videos
Android is expected to become the best platform for experiencing Google’s AI capabilities. Gemini is always ready to provide diverse assistance on mobile phones. Based on the applications demonstrated during the conference, Gemini can generate memes during conversations, answer questions about sports videos, and even provide instant answers to PDF files of over 80 pages using the Gemini Advanced App.
Gemini’s ability to process a large number of parameters allows it to read through an entire economics textbook in seconds and provide summaries or answer questions.
AI Revolution 5: Gemini Update! New model Flash, lighter and capable of processing millions of tokens at once
The technology behind large-scale language models is the foundation for all the new features introduced this time. Gemini, as Google’s core AI language model, has evolved to possess two core abilities: “multimodal” and “massive processing.” It can now process millions of tokens of text, images, and videos in one go.
Introducing the new member of the Gemini family, Gemini 1.5 Flash, which falls between Gemini 1.5 Pro and Gemini 1.5 Nano in size. However, it is more lightweight and efficient, offering the same level of capability as Gemini 1.5 Pro. For example, a conversation instruction window can process a million tokens, meaning it can analyze documents spanning 1500 pages or code exceeding 30,000 lines. This lightweight model is achieved through “knowledge distillation” and is more suitable for developers who require speed and cost-efficiency.
Gemini 1.5 Pro Update
Gemini 1.5 Pro, which was only announced in February this year, is also set to be upgraded. It will double its token processing capacity to 2 million, allowing it to process 2 hours of video, 22 hours of audio, over 60,000 lines of code, or over 1.4 million words of text simultaneously.
AI Revolution 6: Veo Image Model! Text-based video generation
In terms of video generation, Google presented Veo, challenging OpenAI’s Sora. Veo can generate high-quality 1080p videos based on natural language text instructions and understands terms related to filmmaking and visual technology. It can incorporate techniques like time-lapse photography during the creative process.
As for OpenAI’s Sora, it can generate complex scenes with multiple characters, specific actions, and numerous details. The AI not only understands various objects mentioned in the prompt but also knows how these objects exist in the real world, creating a stunningly realistic experience.
OpenAI Releases GPT-4o a Day Before Google I/O
Additionally, a day before Google I/O, OpenAI unveiled its new model, GPT-4o. It possesses the intelligence level of GPT-4 and has enhanced capabilities for voice and video processing, giving users a closer experience to interacting with a real person.
GPT-4o can provide real-time translation during conversations, enabling smooth communication between two people speaking different languages. When asked to tell a bedtime story, it can narrate with a more expressive and lively voice. It can also teach people how to solve simple math problems using a voice that closely resembles human speech.
According to OpenAI, GPT-4o can “read” the user’s facial expressions and tone, know when and how to respond, and quickly switch between different tones, ranging from robotic to lively singing.
With two major AI powerhouses releasing their latest technologies within two days, this AI revolution will continue to impact people’s lives.
Editor: Lin Mei-Xin