Neil Zeghidour - Multimodal language models
Filmed at dotAI on October 18, 2024 in Paris. More about the conference on https://www.dotai.io In this talk, Neil will present Kyutai's work on open source multimodal language models. This talk will cover how to build a multimodal LLM (text, audio, images) from scratch, and will uncover their technical secrets along with the many surprises and challenges that appear when we try to combine the intelligence of a text LLM along with generative abilities such as natural conversations or image understanding. Who is Neil Zeghidour? Neil is co-founder and Chief Modeling Officer of the Kyutai non-profit research lab. He was previously at Google DeepMind, where he led a team on generative audio, with contributions including Google’s first text-to-music API, and the first neural audio codec that outperforms general-purpose audio codecs. Before that, Neil spent 3 years at Facebook AI Research, working on automatic speech recognition and audio understanding.