Gemini, Google’s suite of generative AI models, has expanded its capabilities to process more extensive documents, codebases, videos, and audio recordings than previously possible.
In a keynote at the Google I/O 2024 developer conference on Tuesday, Google unveiled the private preview of an upgraded Gemini 1.5 Pro, the current premier model within the Gemini lineup. This new version can accommodate inputs of up to 2 million tokens, which is twice the capacity of its predecessor.
With this enhancement, the revamped Gemini 1.5 Pro now supports the largest input size of any model available on the market, surpassing even Anthropic’s Claude 3, which has a maximum limit of 1 million tokens.
In the realm of artificial intelligence, “tokens” denote segments of raw data, such as the syllables “fan,” “tas,” and “tic” in the word “fantastic.” The capacity of 2 million tokens is commensurate with approximately 1.4 million words, two hours of video content, or 22 hours of audio data.
Extended token capacity in models can lead to notable improvements in performance, particularly when analyzing large datasets. Models with expansive context windows, such as the Gemini 1.5 Pro, which supports a 2-million-token input, stand out in this regard. Unlike their counterparts with limited token inputs, these models do not easily lose track of recent conversational content, thereby maintaining topical relevance. Moreover, large-context models are often better at understanding and generating responses that faithfully reflect the data’s flow and nuances.
Developers eager to experiment with the 2-million-token context of Gemini 1.5 Pro can join the waitlist via Google AI Studio, a platform dedicated to generative AI development by Google. Meanwhile, the version with a 1-million-token context is set to become widely available through Google’s developer services in the coming month.
In addition to the expanded context window, Google has significantly “enhanced” Gemini 1.5 Pro over the past few months. These enhancements include advancements in code generation, logical reasoning and planning, multi-turn conversation capabilities, and audio and image comprehension. According to Google, the Gemini API and AI Studio now allow 1.5 Pro to process audio alongside images and videos, and introduce a new feature called system instructions that enables the model to be more effectively “steered” during interactions.
Gemini 1.5 Flash, a faster model
For applications with lower demands, Google is introducing Gemini 1.5 Flash in public preview. This “distilled” version of Gemini 1.5 Pro is a compact and efficient model designed for “narrow,” “high-frequency” generative AI tasks. Like Gemini 1.5 Pro, Flash is multimodal, allowing it to analyze audio, video, and images as well as text, though its output is text-only. Flash supports a context window of up to 2 million tokens.
According to Josh Woodward, VP of Google Labs, one of Google’s experimental AI divisions, “Gemini Pro is intended for more complex, often multi-step reasoning tasks. However, developers focused on model output speed will find Flash to be the optimal choice.”
Woodward emphasized that Flash is especially advantageous for applications such as summarization, chat functionalities, image and video captioning, as well as data extraction from extensive documents and tables.
Flash seems to be Google’s counterpart to compact, cost-effective models that are accessible via APIs, akin to Anthropic’s Claude 3 Haiku. Alongside Gemini 1.5 Pro, Flash boasts widespread availability, reaching over 200 countries and territories, including the European Economic Area, the U.K., and Switzerland. However, the version supporting a 2-million-token context is restricted to users on a waitlist.
In a recent update targeting budget-conscious developers, all Gemini models, not just the Flash variant, will soon gain access to a feature known as context caching. This enables developers to store substantial amounts of data, such as a knowledge base or a database of research papers, in a cache that Gemini models can efficiently and cost-effectively access.
The complementary Batch API, now available in public preview through Vertex AI, Google’s enterprise-level generative AI development platform, provides a more economical approach to managing workloads like classification, sentiment analysis, data extraction, and description generation. It allows multiple prompts to be sent to Gemini models within a single request.
Additionally, another feature set to launch later this month in preview on Vertex AI, termed controlled generation, could yield further cost reductions. Woodward suggests that this feature will enable users to dictate Gemini model outputs according to specific formats or schemas, such as JSON or XML.
“You’ll be able to transmit all your files to the model once, eliminating the need for repeated transmissions,” Woodward stated. “This should significantly enhance the utility of long contexts and make them more cost-effective.”