Related papers: CLEVR Parser: A Graph Parser Library for Geometric Learning on Language Grounded Image Scenes

CLEVR Parser: A Graph Parser Library for Geometric Learning on Language Grounded Image Scenes

URL: http://arxiv.org/abs/2009.09154v2
Date: Thu, 1 Oct 2020 22:56:35 GMT
Title: CLEVR Parser: A Graph Parser Library for Geometric Learning on Language Grounded Image Scenes
Authors: Raeid Saqur and Ameet Deshpande
Abstract summary: CLEVR dataset has been used extensively in language grounded visual reasoning in Machine Learning (ML) and Natural Language Processing (NLP) domains. We present a graph library for CLEVR that provides functionalities for object-centric attributes and relationships extraction, and construction of structural graph representations for dual modalities. We discuss downstream usage and applications of the library, and how it accelerates research for the NLP research community.
Score: 2.750124853532831
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The CLEVR dataset has been used extensively in language grounded visual reasoning in Machine Learning (ML) and Natural Language Processing (NLP) domains. We present a graph parser library for CLEVR, that provides functionalities for object-centric attributes and relationships extraction, and construction of structural graph representations for dual modalities. Structural order-invariant representations enable geometric learning and can aid in downstream tasks like language grounding to vision, robotics, compositionality, interpretability, and computational grammar construction. We provide three extensible main components - parser, embedder, and visualizer that can be tailored to suit specific learning setups. We also provide out-of-the-box functionality for seamless integration with popular deep graph neural network (GNN) libraries. Additionally, we discuss downstream usage and applications of the library, and how it accelerates research for the NLP research community.

Related papers

A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding [78.99798110890157]
Open-vocabulary 3D visual grounding aims to localize target objects based on free-form language queries.<n>Existing language field methods struggle to accurately localize instances using spatial relations in language queries.<n>We propose SpatialReasoner, a novel neural representation-based framework with large language model (LLM)-driven spatial reasoning.
arXiv Detail & Related papers (2025-07-09T10:20:38Z)
IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering [7.247417417159471]
Vision-language models (VLMs) excel at descriptive tasks, but whether they truly understand scenes from visual observations remains uncertain.<n>We introduce IR3D-Bench, a benchmark challenging VLMs to demonstrate understanding through active creation rather than passive recognition.
arXiv Detail & Related papers (2025-06-29T17:02:57Z)
IRef-VLA: A Benchmark for Interactive Referential Grounding with Imperfect Language in 3D Scenes [10.139461308573336]
IRef-VLA is the largest real-world dataset for the referential grounding task consisting of over 11.5K scanned 3D rooms. We aim to provide a resource for 3D scene understanding that aids the development of robust, interactive navigation systems.
arXiv Detail & Related papers (2025-03-20T16:16:10Z)
Language is All a Graph Needs [33.9836278881785]
We propose InstructGLM (Instruction-finetuned Graph Language Model) with highly scalable prompts based on natural language instructions. Our method surpasses all GNN baselines on ogbn-arxiv, Cora and PubMed datasets.
arXiv Detail & Related papers (2023-08-14T13:41:09Z)
LAVIS: A Library for Language-Vision Intelligence [98.88477610704938]
LAVIS is an open-source library for LAnguage-VISion research and applications. It features a unified interface to easily access state-of-the-art image-language, video-language models and common datasets.
arXiv Detail & Related papers (2022-09-15T18:04:10Z)
Video-Text Pre-training with Learned Regions [59.30893505895156]
Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs. We propose a module for videotext-learning, RegionLearner, which can take into account the structure of objects during pre-training on large-scale video-text pairs.
arXiv Detail & Related papers (2021-12-02T13:06:53Z)
DomiKnowS: A Library for Integration of Symbolic Domain Knowledge in Deep Learning [12.122347427933637]
We demonstrate a library for the integration of domain knowledge in deep learning architectures. Using this library, the structure of the data is expressed symbolically via graph declarations. The domain knowledge can be defined explicitly, which improves the models' explainability.
arXiv Detail & Related papers (2021-08-27T16:06:42Z)
Leveraging Language to Learn Program Abstractions and Search Heuristics [66.28391181268645]
We introduce LAPS (Language for Abstraction and Program Search), a technique for using natural language annotations to guide joint learning of libraries and neurally-guided search models for synthesis. When integrated into a state-of-the-art library learning system (DreamCoder), LAPS produces higher-quality libraries and improves search efficiency and generalization.
arXiv Detail & Related papers (2021-06-18T15:08:47Z)
Neuro-Symbolic Representations for Video Captioning: A Case for Leveraging Inductive Biases for Vision and Language [148.0843278195794]
We propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning. Our approach uses a dictionary learning-based method of learning relations between videos and their paired text descriptions.
arXiv Detail & Related papers (2020-11-18T20:21:19Z)
Captum: A unified and generic model interpretability library for PyTorch [49.72749684393332]
We introduce a novel, unified, open-source model interpretability library for PyTorch. The library contains generic implementations of a number of gradient and perturbation-based attribution algorithms. It can be used for both classification and non-classification models.
arXiv Detail & Related papers (2020-09-16T18:57:57Z)
Object Relational Graph with Teacher-Recommended Learning for Video Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy. Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation. Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.