But in reality, the model was not prompted using audio and its responses were only text-based. They were not generated in real time either.
The person speaking in the demo was actually reading out some of the text prompts that were passed to the model, and the robot voice given to Gemini was reading out responses it had generated in text. Still images taken from the video – like the rock, paper, scissors – were fed to the model, and it was asked to guess the game.
Oriol Vinyals, VP of research and deep learning lead at Google DeepMind, who helped lead the Gemini project, admitted that the video demonstrates "what the multimodal user experiences built with Gemini could look like"
https://twitter.com/OriolVinyalsML/status/1732885990291775553
Google then cherry-picked its best outputs and narrated them alongside the footage to make it seem as if the model could respond flawlessly in real time.
"For the purposes of this demo, latency has been reduced and Gemini outputs have been shortened for brevity," the description for the video on YouTube reads.