A rubber duck is then placed on a paper atlas and Gemini is able to identify where the object has been placed. It does all sorts of things – identifying objects, finding where things have been hidden and switched under cups, and more.
Google wowed the internet with a demo video showing the multimodal capabilities of its latest large language model Gemini
But in reality, the model was not prompted using audio and its responses were only text-based. They were not generated in real time either.
The person speaking in the demo was actually reading out some of the text prompts that were passed to the model, and the robot voice given to Gemini was reading out responses it had generated in text. Still images taken from the video – like the rock, paper, scissors – were fed to the model, and it was asked to guess the game.
Google then cherry-picked its best outputs and narrated them alongside the footage to make it seem as if the model could respond flawlessly in real time.
"For the purposes of this demo, latency has been reduced and Gemini outputs have been shortened for brevity," the description for the video on YouTube reads.