Google launches new multimodal Gemini Embedding 2 model

What's new? Gemini embedding 2 supports text, image, video, audio and document embeddings in a unified space; available via Gemini API and Vertex AI with adjustable output dimensions;

Erin

10 Mar 2026 · 1 min read

Google has introduced Gemini Embedding 2, a fully multimodal embedding model now available in Public Preview through the Gemini API and Vertex AI. Developers, AI researchers, and enterprise teams are the primary audience, as the model enables a single, unified embedding space for text, images, videos, audio, and documents, supporting over 100 languages. This release is immediate and public, accessible globally wherever Gemini API and Vertex AI are offered.

Say hello to Gemini Embedding 2, our new SOTA multimodal model that lets your bring text, images, video, audio, and docs into the same embedding space! 👀 pic.twitter.com/mjYk8FnTuj
— Logan Kilpatrick (@OfficialLoganK) March 10, 2026

Gemini Embedding 2 marks a step up from previous text-only models by supporting multiple modalities in a single request. Technical specs include processing up to 8192 text tokens, six images (PNG, JPEG), videos up to 120 seconds (MP4, MOV), native audio ingestion, and embedding PDFs of up to six pages. The model incorporates Matryoshka Representation Learning for flexible output dimensions, letting users choose between 3072, 1536, or 768 dimensions to optimize for storage and performance. Early access partners have already begun leveraging these capabilities for advanced tasks like Retrieval-Augmented Generation, semantic search, and data clustering.

Google, the company behind Gemini Embedding 2, has a longstanding history of embedding technology powering core products and AI research. The Gemini architecture forms the backbone of this release, building on Google’s expertise in large language models and multimodal understanding. Early industry feedback points to Gemini Embedding 2 outperforming leading competitors in text, image, and video benchmarks, establishing a new standard for multimodal AI.

Source