Visual Chronicles: Using Multimodal LLMs to Analyze Massive Collections of Images
- URL: http://arxiv.org/abs/2504.08727v2
- Date: Mon, 14 Apr 2025 17:30:56 GMT
- Title: Visual Chronicles: Using Multimodal LLMs to Analyze Massive Collections of Images
- Authors: Boyang Deng, Songyou Peng, Kyle Genova, Gordon Wetzstein, Noah Snavely, Leonidas Guibas, Thomas Funkhouser,
- Abstract summary: We present a system using Multimodal LLMs to analyze a large database with tens of millions of images.<n>We aim to capture frequent co-occurring changes ("trends") across a city over a certain period.<n>We find it significantly outperforms baselines and is able to discover interesting trends from images captured in large cities.
- Score: 58.38037252899024
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We present a system using Multimodal LLMs (MLLMs) to analyze a large database with tens of millions of images captured at different times, with the aim of discovering patterns in temporal changes. Specifically, we aim to capture frequent co-occurring changes ("trends") across a city over a certain period. Unlike previous visual analyses, our analysis answers open-ended queries (e.g., "what are the frequent types of changes in the city?") without any predetermined target subjects or training labels. These properties cast prior learning-based or unsupervised visual analysis tools unsuitable. We identify MLLMs as a novel tool for their open-ended semantic understanding capabilities. Yet, our datasets are four orders of magnitude too large for an MLLM to ingest as context. So we introduce a bottom-up procedure that decomposes the massive visual analysis problem into more tractable sub-problems. We carefully design MLLM-based solutions to each sub-problem. During experiments and ablation studies with our system, we find it significantly outperforms baselines and is able to discover interesting trends from images captured in large cities (e.g., "addition of outdoor dining,", "overpass was painted blue," etc.). See more results and interactive demos at https://boyangdeng.com/visual-chronicles.
Related papers
- Evaluating Graphical Perception with Multimodal LLMs [2.090547583226381]
Multimodal Large Language Models (MLLMs) have remarkably progressed in analyzing and understanding images.<n>For visualization, how do MLLMs perform when applied to graphical perception tasks?<n>Our study primarily evaluates fine-tuned and pretrained models and zero-shot prompting to determine if they closely match human graphical perception.
arXiv Detail & Related papers (2025-04-05T16:14:08Z) - LLM Knows Geometry Better than Algebra: Numerical Understanding of LLM-Based Agents in A Trading Arena [44.13836431002364]
We design a virtual numerical game simulating complex economic systems through zero-sum games, where agents invest in stock portfolios.<n>Our experiments reveal that LLMs, including GPT-4o, struggle with algebraic reasoning when dealing with plain-text stock data, often focusing on local details rather than global trends.<n>In contrast, LLMs perform significantly better with geometric reasoning when presented with visual data, such as scatter plots or K-line charts.
arXiv Detail & Related papers (2025-02-25T08:41:01Z) - Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.<n>We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.<n>We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs [61.56904387052982]
This paper proposes a new visual grounding task called multi-context visual grounding.
It aims to localize instances of interest across multiple images based on open-ended text prompts.
We benchmark over 20 state-of-the-art MLLMs and foundation models with potential multi-context visual grounding capabilities.
arXiv Detail & Related papers (2024-10-16T07:52:57Z) - Semantic Alignment for Multimodal Large Language Models [72.10272479476161]
We introduce Semantic Alignment for Multi-modal large language models (SAM)
By involving the bidirectional semantic guidance between different images in the visual-token extraction process, SAM aims to enhance the preservation of linking information for coherent analysis.
By involving the bidirectional semantic guidance between different images in the visual-token extraction process, SAM aims to enhance the preservation of linking information for coherent analysis.
arXiv Detail & Related papers (2024-08-23T06:48:46Z) - Eyeballing Combinatorial Problems: A Case Study of Using Multimodal Large Language Models to Solve Traveling Salesman Problems [6.157421830538752]
Multimodal Large Language Models (MLLMs) have demonstrated proficiency in processing di-verse modalities.
This paper investigates the use of MLLMs' visual capabilities to 'eyeball' solutions for the Traveling Salesman Problem.
arXiv Detail & Related papers (2024-06-11T00:41:08Z) - The Instinctive Bias: Spurious Images lead to Illusion in MLLMs [34.91795817316696]
We identify a typical class of inputs that baffles MLLMs, which consist of images that are highly relevant but inconsistent with answers.
We propose CorrelationQA, the first benchmark that assesses the visual illusion level given spurious images.
We conduct a thorough analysis on 9 mainstream MLLMs, illustrating that they universally suffer from this instinctive bias to varying degrees.
arXiv Detail & Related papers (2024-02-06T06:48:46Z) - Mementos: A Comprehensive Benchmark for Multimodal Large Language Model
Reasoning over Image Sequences [80.54979242912944]
This paper introduces Mementos, a new benchmark designed to assess MLLMs' sequential image reasoning abilities.
We find that MLLMs struggle to accurately describe dynamic information about given image sequences, often leading to hallucinations/misrepresentations of objects.
arXiv Detail & Related papers (2024-01-19T07:10:13Z) - DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent) [73.10899129264375]
This paper explores DoraemonGPT, a comprehensive and conceptually elegant system driven by LLMs to understand dynamic scenes.<n>Given a video with a question/task, DoraemonGPT begins by converting the input video into a symbolic memory that stores task-related attributes.<n>We extensively evaluate DoraemonGPT's effectiveness on three benchmarks and several in-the-wild scenarios.
arXiv Detail & Related papers (2024-01-16T14:33:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.