On Thursday, Google and Meta unveiled new models that represent significant advancements in artificial intelligence (AI). The search engine behemoth debuted Gemini 1.5, an enhanced artificial intelligence model with multimodal long-context understanding. In the meantime, Meta revealed the launch of its non-generative Video Joint Embedding Predictive Architecture (V-JEPA) model, which promotes advanced machine learning (ML) through visual media. These products provide fresher methods for investigating AI potential. Notably, on Thursday, OpenAI unveiled Sora, its first text-to-video generation model.
Google Gemini 1.5 model
The release of Gemini 1.5 was announced on a blog post by Google DeepMind CEO Demis Hassabis. The Transformer and Mixture of Experts (MoE) architecture forms the foundation of the more recent model. Although several versions are anticipated, only the Gemini 1.5 Pro model has been made available for early testing thus far. According to Hassabis, the mid-size multimodal model can carry out tasks on par with the company’s largest generative model, Gemini 1.0 Ultra, which is accessible through the Gemini Advanced subscription with Google One AI Premium plan.
The ability of Gemini 1.5 to process long-context information is the largest improvement. There is a 1,28,000 token context window included with the standard Pro version. Gemini 1.0, in contrast, featured a context window with 32,000 tokens. Tokens are complete sentences or sections of words, pictures, audio, video, or code that serve as building blocks when information is processed by a foundation model. According to Hassabis, a model’s output will be more consistent, pertinent, and useful the larger its context window is able to process in a given prompt.
Google is also releasing a special model with a context window of up to 1 million tokens in addition to the standard Pro version. A private preview of this is being made available to a select number of developers as well as its enterprise clients. It can be tested using Vertex AI and Google’s AI Studio, a cloud console tool for testing generative AI models, though there isn’t a specific platform for it. According to Google, this version can process over 7,00,000 words, 11 hours of audio, or an hour of video in a single sitting.
Meta V-JEPA
V-JEPA was made available to the public by Meta in a post on X (formerly known as Twitter). By watching videos, machine learning (ML) systems can learn to comprehend and model the real world—this is not a generative AI model. The business referred to it as a significant advancement in the field of advanced machine intelligence (AMI), which is the goal of Yann LeCun, one of the three “Godfathers of AI.”
Essentially, it’s a predictive analysis model that only picks up knowledge from visual content. It is able to anticipate what will happen next in addition to comprehending what is happening in a video. The company says that in order to train it, portions of the video were masked in both time and space using a new masking technique. This implies that the model had to predict not only the current frame but also the next frame in a video because some frames had blacked-out portions while others had all of their frames removed. The business claims that the model accomplished both tasks effectively. The model is notable for its ability to anticipate and analyse videos up to ten seconds long.
Meta wrote in a blog post that V-JEPA is “quite good compared to previous methods for that high-grade action recognition task, for example, if the model needs to be able to distinguish between someone putting down a pen, picking up a pen, and pretending to put down a pen but not actually doing it.”
Video input is currently absent from the V-JEPA model since it solely utilises visual data. Audio will now be included in the ML model by Meta in addition to video. Enhancing its capabilities in longer videos is one of the company’s other goals.