Google launches Gemini Omni Flash API at $0.10/second, making video edits “conversational” for enterprises

A stateful interactions API lets teams edit clips through multi-turn instructions, with SynthID, C2PA, and clear deepfake limits.

ByLama Al-RashidTechnology Correspondent, The Executives Brief

about 5 hours ago·4 min read

Google launches Gemini Omni Flash API at $0.10/second, making video edits “conversational” for enterprises

Executive summary

Google’s Gemini Omni Flash, the first model in its “Omni” family, is now rolling out through an API to developers and enterprise customers, after debuting to consumers at I/O 2026. The move turns AI video from a one-shot render into an edit workflow, starting at $0.10 per second of generated 720p video.

If you’ve ever tried to get a 90-second training video approved, you already know the problem Gemini Omni Flash is aiming at: one tiny change can restart the whole pipeline. Google is rolling out Gemini Omni Flash through an API for developers and enterprise customers, and pricing starts at $0.10 per second of generated 720p video. Rough math? A 10-second clip costs about $1. The bigger shift is what happens after that first render: Omni Flash is designed for conversational editing, where each new instruction carries forward the clip and its references.

This is also a response to a real Enterprise “catch” that appeared when Omni launched in May: with no programmatic interface, it behaved more like a consumer or prosumer tool than a production workflow. The API rollout changes the calculus for marketing and learning-and-development teams that routinely crank out internal videos. Google positions the Omni family as “from any input,” starting with video, but the headline feature for enterprise buyers is editing a finished clip through conversation. Think less “book a reshoot,” more “send a note,” because you can iteratively refine without starting from scratch and hoping.

Under the hood, the magic is not just the model. It’s Google’s new interactions API, described as stateful and built for multi-turn tasks rather than open-ended chat. Each turn includes the previous video and the references, which is what makes edits accumulate coherently instead of drifting. Developers can chain generations: generate a clip, edit the cat into a puma kitten, restyle a video into 8-bit retro, then into a watercolor look, and store each version to branch later. That statefulness matters for budgets too, because conversational edits aim to reduce wasted iterations. If context carries across turns, your next attempt is usually “closer,” instead of re-deriving everything from a blank prompt.

Google also wants enterprises to care about control, not just convenience. Omni accepts more than text. In addition to words describing what you want, you can feed multiple reference images and existing video clips. The point is reference-driven control: you can use a photograph of a particular object, ask the model to place that object into a scene, and it reproduces the real thing’s coloring and rough shape instead of inventing a generic stand-in. Hand it a scene with signage and it can rewrite signs in another language, for a brand, or even drop in a company’s logo. Google highlights two strengths tied directly to enterprise needs. First is a world model that keeps physical behavior consistent, like adding light rain and puddles and rendering reflections in wet pavement. Second is text and logo insertion, with the caveat that it is not flawless.

Because this is production-grade territory, the limitations and the “known open problems” matter as much as the wow factor. Clips currently cap at 10 seconds, per the model's published model card. Longer videos require chunking and editing together. Uploaded footage can be edited only if it runs 10 seconds or under and the user holds rights to it. Also, despite generating audio alongside video it produces, it does not take audio as an input yet. On quality and behavior, the early signal is strong but uneven. For example, in LMArena’s Text-to-Video Arena, Omni Flash sat at number one with a score of 1527, where people vote on head-to-head outputs. But Google’s own model card is candid that holding consistency across edits and rendering accurate text remain open problems, and sign tracking in complex scenes can slip.

Now the compliance and safety story, the part CISOs and procurement teams will ask about immediately. Every Omni clip includes Google’s SynthID watermark, and Google is extending C2PA Content Credentials across its generative tools. It also launched an AI Content Detection API that flags AI-generated media, both Google’s and other vendors’. And Google draws a deliberate line on deepfakes: the model won’t take a still photo of a person plus an audio clip and lip-sync them into speech. It will, however, take a recording of someone talking and translate it into another language, which fits enterprise localization for global training content.

Finally, the pricing and quality tradeoffs that will shape who buys what. Omni Flash costs $0.10 per second of generated 720p video, with no 1080p or 4K option. That 720p ceiling is a real constraint if you’re making premium brand work for big screens, and it helps explain why Google’s higher-resolution Veo 3.1 still has a job. Clips run 3 to 10 seconds in landscape (16:9) or portrait (9:16). As reference inputs, the model accepts up to seven images and up to three video clips of three seconds or less. Output is standard MP4, with SynthID watermarking and C2PA credentials baked in.

Strategically, this is less about undercutting competitors on spec sheets and more about changing the workflow. At $1 per 10-second pass, enterprises will still budget for edit-heavy sessions, because each conversational edit is a fresh generation you pay for. But the stateful model shifts the real cost from “restarting the pipeline” to “iterating toward a take that works,” instead of repeatedly rebuilding from scratch. In a world where generative video has often felt like a one-shot render that creates new approval bottlenecks, Gemini Omni Flash is trying to make video behave more like a living document.

Executive ActionsLocked

This story's Key Insights and Take-aways are locked.

Create a free account to unlock Executive Actions for one credit.

Always free for Executives Club members. Join the Club

Taggedgoogle gemini video-generation enterprise-ai api multimodal c2pa synthid pricing deepfake-safety

Google launches Gemini Omni Flash API at $0.10/second, making video edits “conversational” for enterprises

This story's Key Insights and Take-aways are locked.

More in Technology

Chip rally pumps $2T into Micron, Intel, and AMD as AI demand broadens

Palo Alto and CrowdStrike report their best quarters yet, as AI ramps cyber demand

DeepMind trio’s EquiLibre tops $500M valuation as quant hedge funds start paying