Imagine the Unseen World: A Benchmark for Systematic Generalization in
Visual World Models
- URL: http://arxiv.org/abs/2311.09064v1
- Date: Wed, 15 Nov 2023 16:02:13 GMT
- Title: Imagine the Unseen World: A Benchmark for Systematic Generalization in
Visual World Models
- Authors: Yeongbin Kim, Gautam Singh, Junyeong Park, Caglar Gulcehre, Sungjin
Ahn
- Abstract summary: We introduce the Systematic Visual Imagination Benchmark (SVIB), the first benchmark designed to address this problem head-on.
SVIB offers a novel framework for a minimal world modeling problem, where models are evaluated based on their ability to generate one-step image-to-image transformations under a latent world dynamics.
We provide a comprehensive evaluation of various baseline models on SVIB, offering insight into the current state-of-the-art in systematic visual imagination.
- Score: 21.043565956630957
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Systematic compositionality, or the ability to adapt to novel situations by
creating a mental model of the world using reusable pieces of knowledge,
remains a significant challenge in machine learning. While there has been
considerable progress in the language domain, efforts towards systematic visual
imagination, or envisioning the dynamical implications of a visual observation,
are in their infancy. We introduce the Systematic Visual Imagination Benchmark
(SVIB), the first benchmark designed to address this problem head-on. SVIB
offers a novel framework for a minimal world modeling problem, where models are
evaluated based on their ability to generate one-step image-to-image
transformations under a latent world dynamics. The framework provides benefits
such as the possibility to jointly optimize for systematic perception and
imagination, a range of difficulty levels, and the ability to control the
fraction of possible factor combinations used during training. We provide a
comprehensive evaluation of various baseline models on SVIB, offering insight
into the current state-of-the-art in systematic visual imagination. We hope
that this benchmark will help advance visual systematic compositionality.
Related papers
- V$^2$R-Bench: Holistically Evaluating LVLM Robustness to Fundamental Visual Variations [1.7971686967440696]
V$2$R-Bench is a benchmark framework for evaluating Visual Variation Robustness of LVLMs.
We show that advanced models that excel at complex vision-language tasks significantly underperform on simple tasks such as object recognition.
These vulnerabilities stem from error accumulation in the pipeline architecture and inadequate multimodal alignment.
arXiv Detail & Related papers (2025-04-23T14:01:32Z) - Mitigating Low-Level Visual Hallucinations Requires Self-Awareness: Database, Model and Training Strategy [53.07517728420411]
We introduce the first instruction database specifically focused on hallucinations in low-level vision tasks.
We propose the Self-Awareness Failure Elimination (SAFEQA) model to improve the perception and comprehension abilities of the model in low-level vision tasks.
We conduct comprehensive experiments on low-level vision tasks, with the results demonstrating that our proposed method significantly enhances self-awareness of the model in these tasks and reduces hallucinations.
arXiv Detail & Related papers (2025-03-26T16:05:01Z) - Can-Do! A Dataset and Neuro-Symbolic Grounded Framework for Embodied Planning with Large Multimodal Models [85.55649666025926]
We introduce Can-Do, a benchmark dataset designed to evaluate embodied planning abilities.
Our dataset includes 400 multimodal samples, each consisting of natural language user instructions, visual images depicting the environment, state changes, and corresponding action plans.
We propose NeuroGround, a neurosymbolic framework that first grounds the plan generation in the perceived environment states and then leverages symbolic planning engines to augment the model-generated plans.
arXiv Detail & Related papers (2024-09-22T00:30:11Z) - ARPA: A Novel Hybrid Model for Advancing Visual Word Disambiguation Using Large Language Models and Transformers [1.6541870997607049]
We present ARPA, an architecture that fuses the unparalleled contextual understanding of large language models with the advanced feature extraction capabilities of transformers.
ARPA's introduction marks a significant milestone in visual word disambiguation, offering a compelling solution.
We invite researchers and practitioners to explore the capabilities of our model, envisioning a future where such hybrid models drive unprecedented advancements in artificial intelligence.
arXiv Detail & Related papers (2024-08-12T10:15:13Z) - Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [56.391404083287235]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.
Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.
We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z) - Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms [91.19304518033144]
We aim to align vision models with human aesthetic standards in a retrieval system.
We propose a preference-based reinforcement learning method that fine-tunes the vision models to better align the vision models with human aesthetics.
arXiv Detail & Related papers (2024-06-13T17:59:20Z) - Foundational Models Defining a New Era in Vision: A Survey and Outlook [151.49434496615427]
Vision systems to see and reason about the compositional nature of visual scenes are fundamental to understanding our world.
The models learned to bridge the gap between such modalities coupled with large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time.
The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions.
arXiv Detail & Related papers (2023-07-25T17:59:18Z) - Predictive Experience Replay for Continual Visual Control and
Forecasting [62.06183102362871]
We present a new continual learning approach for visual dynamics modeling and explore its efficacy in visual control and forecasting.
We first propose the mixture world model that learns task-specific dynamics priors with a mixture of Gaussians, and then introduce a new training strategy to overcome catastrophic forgetting.
Our model remarkably outperforms the naive combinations of existing continual learning and visual RL algorithms on DeepMind Control and Meta-World benchmarks with continual visual control tasks.
arXiv Detail & Related papers (2023-03-12T05:08:03Z) - ComplAI: Theory of A Unified Framework for Multi-factor Assessment of
Black-Box Supervised Machine Learning Models [6.279863832853343]
ComplAI is a unique framework to enable, observe, analyze and quantify explainability, robustness, performance, fairness, and model behavior.
It evaluates different supervised Machine Learning models not just from their ability to make correct predictions but from overall responsibility perspective.
arXiv Detail & Related papers (2022-12-30T08:48:19Z) - The dynamics of belief: continuously monitoring and visualising complex
systems [0.0]
Rise of AI in human contexts places new demands on automated systems to be transparent and explainable.
We develop a theoretical framework for thinking about digital systems in complex human contexts.
arXiv Detail & Related papers (2022-08-11T11:51:35Z) - K-LITE: Learning Transferable Visual Models with External Knowledge [242.3887854728843]
K-LITE (Knowledge-augmented Language-Image Training and Evaluation) is a strategy to leverage external knowledge to build transferable visual systems.
In training, it enriches entities in natural language with WordNet and Wiktionary knowledge.
In evaluation, the natural language is also augmented with external knowledge and then used to reference learned visual concepts.
arXiv Detail & Related papers (2022-04-20T04:47:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.