Google launches Veo 3.1 video generator with audio and editing tools

Google’s Latest Video Generation Model

Google has released Veo 3.1, an updated version of its AI video generator that now includes audio across all features and introduces new editing capabilities. The timing is interesting—it comes as OpenAI’s Sora 2 app gains popularity with its TikTok-style interface that emphasizes sharing and remixing content. Sora 2 reportedly reached 1 million downloads within five days and topped Apple’s App Store charts.

Veo 3.1 brings several new features to the table. Users can create videos with synchronized ambient noise, dialogue, and sound effects using the “Ingredients to Video” tool, which combines multiple reference images into a single scene. The “Frames to Video” feature generates transitions between starting and ending images, while “Extend” can create clips up to a minute long by continuing motion from existing videos.

Testing the Model’s Capabilities

If you’re thinking about trying Veo 3.1, be prepared for the cost. It’s currently one of the most expensive video generation models available, comparable to Sora 2 and only behind Sora 2 Pro, which costs more than twice as much per generation. Free users get 100 monthly credits—enough for about five videos per month. Through the Gemini API, Veo 3.1 costs approximately $0.40 per second of generated video with audio, while a faster variant called Veo 3.1 Fast costs $0.15 per second.

In text-to-video mode, Veo 3.1 shows definite improvement over its predecessor. The model handles coherence well and demonstrates better understanding of contextual environments. It works across different styles, from photorealism to stylized content. Interestingly, without any reference frame, Veo 3.1 produced better results in text-to-video mode than when using the same prompt with an initial image.

The tradeoff seems to be movement speed. Veo 3.1 prioritizes coherence over fluidity, making it challenging to generate fast-paced action. Elements move more slowly but maintain consistency throughout the clip. Other models like Kling still lead in rapid movement, though they require more attempts to achieve usable results.

Strengths and Limitations

Where Veo 3.1 really shines is in text-to-video with dialogue. It handles lip sync better than any other model currently available. The dialogue, intonation, voices, and emotions are accurate and beat competing models. While other generators can produce ambient noise, only Sora, Veo, and Grok can generate actual words—and of those three, Veo 3.1 requires the fewest attempts to get good results.

However, image-to-video generation appears to be a weaker area in this update. When using different aspect ratios as starting frames, the model struggled to maintain the coherence levels it once had. If the prompt strays too far from what would logically follow the input image, Veo 3.1 tends to generate incoherent scenes or clips that jump between locations and setups.

The “Elements to Video” feature works like inpainting for video, letting users insert or delete elements from a scene. But don’t expect perfect coherence or exact replication of your reference images. Veo 3.1 treats uploaded elements as inspiration rather than strict templates—it generates scenes that follow the prompt and include objects that resemble what you provided, but it won’t precisely recreate specific elements.

Competitive Landscape

The AI video generation market has become increasingly crowded in 2025. Runway’s Gen-4 model targets filmmakers, Luma Labs offers fast generation for social media, Adobe integrates Firefly Video into Creative Cloud, and updates from xAI, Kling, Meta, and Google all compete on realism, sound generation, and prompt adherence.

For image-to-video with dialogue specifically, Sora’s remix feature appears to be the best choice currently. While the model may be censored, its image-to-video capabilities, realistic lip sync, and focus on tone and emotion make it the clear winner for that particular use case. Grok’s video generator comes in second, respecting reference images better than Veo 3.1 while producing solid results.

The market is clearly evolving quickly, with each model carving out its own strengths and use cases. Veo 3.1 seems positioned as a professional tool with strong dialogue capabilities, while competitors focus on different aspects of the video generation experience.