Multi-modal

An AI that handles text, voice, images, and video together.

A multi-modal model is one that can take input and/or produce output in more than one medium. It might read an image and answer in text, listen to audio and reply in voice, or watch a short video and summarise it. GIGI is multi-modal, so a single conversation can mix typed questions, photos of a product, and voice replies.