Any views expressed within media held on this service are those of the contributors, should not be taken as approved or endorsed by the University, and do not necessarily reflect the views of the University in respect of any particular issue.

Fascinating insights from Prof. Frank Keller’s upcoming talk on Grounding across Modalities and Domains – how systems connect entities, concepts and actions across different modalities.

One of the key challenges in achieving reasoning capabilities in generative AI models is grounding, or the ability to link information bits, such as entities, concepts, actions, etc, across different modalities.

This talk explores how explicit grounding can significantly improve multimodal AI systems through two case studies:

🧠 Visual Storytelling – Maintaining narrative coherence across modalities

🎬 Instructional Video Understanding – True comprehension of procedural knowledge

Title: Grounding across Modalities and Domains

Speaker: Frank Keller | School of Informatics, University of Edinburgh

Time: 04/29(Tues) 11:00-12:00 (UTC+01:00)London

Location: Bayes Center G03

External: https://app.huawei.com/wmeeting/join/97996549/XSZeBJABg80yz3SqXLYLxSeeYuWYBzpdl

Meeting ID: 97996549

Passcode: 228808

Registration: https://www.smartsurvey.co.uk/s/3N8U7J/

Abstract:

In order to understand or generate multimodal inputs, AI systems must perform grounding — the process of linking entities or actions across different modalities. For example, objects depicted in images and videos need to be associated with corresponding textual references.

However, large language models struggle with grounding, limiting their performance in tasks such as image generation and video understanding.

In this talk, I will present two case studies demonstrating how explicit grounding can enhance multimodal AI. First, I will argue that character grounding is essential for visual storytelling — the task of turning a sequence of images into a coherent narrative. I will introduce a model that generates visually grounded stories by building coreference chains for characters across images and text, leading to stories that are more specific, coherent, and engaging.

The second case study focuses on understanding instructional videos, such as those demonstrating cooking or home improvement tasks. In this domain, entities are often implicit (not mentioned in text) and frequently change (being merged, separated, or transformed), making grounding particularly challenging. I will present models that address this challenge by computing the semantic roles of both explicit and implicit entities and tracking them across instructional steps, even as they undergo transformations. These models enhance procedural understanding, improving AI’s ability to follow and reason about complex tasks.

Bio:

Frank Keller is a professor in the School of Informatics at the University of Edinburgh. He has held visiting positions at MIT and the University of Washington. His research focuses on natural language processing, particularly language and vision tasks such as image description, video summarization, and visual storytelling. He is also develops systems that understand long-form narratives, including books and screenplays, and builds computational models of human language processing.

Prof. Keller co-leads the UKRI Centre for Doctoral Training in Responsible Natural Language Processing, which aims to develop trustworthy and ethical NLP systems. He serves on the editorial board of Transactions of the ACL and is an ELLIS fellow. Previously, he was awarded an ERC grant for his research on language and vision.

【BAYES COFFEE HOUSE TECH TALK SERIES】Grounding across Modalities and Domains / Huawei-Edinburgh Joint Lab by blogadmin is licensed under a Creative Commons Attribution CC BY 3.0

Posted by v1stan3

15th April 2025

Categories

coffee house • tech talk

Tags

Tech Talk

Previous post

【BAYES COFFEE HOUSE TECH TALK SERIES】Leveraging Large Language Models for Formal Mathematical Theorem Proving: Data Synthesis and Chain-of-Thought Annotation in the DeepSeek-Prover Series

Next post

【BAYES COFFEE HOUSE TECH TALK SERIES】Practices of Using Small and Large Language Models for Entity Resolution

Comments are closed

Comments to this thread have been closed by the post author or by an administrator.

Report this page

To report inappropriate content on this page, please use the form below. Upon receiving your report, we will be in touch as per the Take Down Policy of the service.

Please note that personal data collected through this form is used and stored for the purposes of processing this report and communication with you.

If you are unable to report a concern about content via this form please contact the Service Owner.

Your name Your email address Please enter an email address you wish to be contacted on. Report description Please describe the unacceptable content in sufficient detail to allow us to locate it, and why you consider it to be unacceptable.
By submitting this report, you accept that it is accurate and that fraudulent or nuisance complaints may result in action by the University.

Cancel