【BAYES COFFEE HOUSE TECH TALK SERIES】
Fascinating insights from Prof. Frank Keller’s upcoming talk on Grounding across Modalities and Domains – how systems connect entities, concepts and actions across different modalities.
One of the key challenges in achieving reasoning capabilities in generative AI models is grounding, or the ability to link information bits, such as entities, concepts, actions, etc, across different modalities.
This talk explores how explicit grounding can significantly improve multimodal AI systems through two case studies:
🧠 Visual Storytelling – Maintaining narrative coherence across modalities
🎬 Instructional Video Understanding – True comprehension of procedural knowledge
Title: Grounding across Modalities and Domains
Speaker: Frank Keller | School of Informatics, University of Edinburgh
Time: 04/29(Tues) 11:00-12:00 (UTC+01:00)London
Location: Bayes Center G03
External: https://app.huawei.com/wmeeting/join/97996549/XSZeBJABg80yz3SqXLYLxSeeYuWYBzpdl
Meeting ID: 97996549
Passcode: 228808
Registration: https://www.smartsurvey.co.uk/s/3N8U7J/
Abstract:
In order to understand or generate multimodal inputs, AI systems must perform grounding — the process of linking entities or actions across different modalities. For example, objects depicted in images and videos need to be associated with corresponding textual references.
However, large language models struggle with grounding, limiting their performance in tasks such as image generation and video understanding.
In this talk, I will present two case studies demonstrating how explicit grounding can enhance multimodal AI. First, I will argue that character grounding is essential for visual storytelling — the task of turning a sequence of images into a coherent narrative. I will introduce a model that generates visually grounded stories by building coreference chains for characters across images and text, leading to stories that are more specific, coherent, and engaging.
The second case study focuses on understanding instructional videos, such as those demonstrating cooking or home improvement tasks. In this domain, entities are often implicit (not mentioned in text) and frequently change (being merged, separated, or transformed), making grounding particularly challenging. I will present models that address this challenge by computing the semantic roles of both explicit and implicit entities and tracking them across instructional steps, even as they undergo transformations. These models enhance procedural understanding, improving AI’s ability to follow and reason about complex tasks.
Bio:
Frank Keller is a professor in the School of Informatics at the University of Edinburgh. He has held visiting positions at MIT and the University of Washington. His research focuses on natural language processing, particularly language and vision tasks such as image description, video summarization, and visual storytelling. He is also develops systems that understand long-form narratives, including books and screenplays, and builds computational models of human language processing.
Prof. Keller co-leads the UKRI Centre for Doctoral Training in Responsible Natural Language Processing, which aims to develop trustworthy and ethical NLP systems. He serves on the editorial board of Transactions of the ACL and is an ELLIS fellow. Previously, he was awarded an ERC grant for his research on language and vision.
【BAYES COFFEE HOUSE TECH TALK SERIES】 / Huawei-Edinburgh Joint Lab by blogadmin is licensed under a Creative Commons Attribution CC BY 3.0
Comments are closed
Comments to this thread have been closed by the post author or by an administrator.