MMCTAgent: Enabling multimodal reasoning over large video and image collections

Microsoft Research · Wed, 12 Nov 2025 12:00:20 +0000

Modern multimodal AI models can recognize objects, describe scenes, and answer questions about images and short video clips, but they struggle with long-form and large-scale visual data, where real-wo...