VALUED -- Vision and Logical Understanding Evaluation Dataset
- URL: http://arxiv.org/abs/2311.12610v2
- Date: Tue, 6 Feb 2024 19:49:11 GMT
- Title: VALUED -- Vision and Logical Understanding Evaluation Dataset
- Authors: Soumadeep Saha, Saptarshi Saha, Utpal Garain
- Abstract summary: We present the VALUE (Vision And Logical Understanding Evaluation) dataset, consisting of 200,000$+$ annotated images and an associated rule set.
The curated rule set considerably constrains the set of allowable predictions, and are designed to probe key semantic abilities.
We analyze several popular and state of the art vision models on this task, and show that, although their performance on standard metrics are laudable, they produce a plethora of incoherent results.
- Score: 1.8876415010297893
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Starting with early successes in computer vision tasks, deep learning based
techniques have since overtaken state of the art approaches in a multitude of
domains. However, it has been demonstrated time and again that these techniques
fail to capture semantic context and logical constraints, instead often relying
on spurious correlations to arrive at the answer. Since application of deep
learning techniques to critical scenarios are dependent on adherence to domain
specific constraints, several attempts have been made to address this issue.
One limitation holding back a thorough exploration of this area, is a lack of
suitable datasets which feature a rich set of rules. In order to address this,
we present the VALUE (Vision And Logical Understanding Evaluation) Dataset,
consisting of 200,000$+$ annotated images and an associated rule set, based on
the popular board game - chess. The curated rule set considerably constrains
the set of allowable predictions, and are designed to probe key semantic
abilities like localization and enumeration. Alongside standard metrics,
additional metrics to measure performance with regards to logical consistency
is presented. We analyze several popular and state of the art vision models on
this task, and show that, although their performance on standard metrics are
laudable, they produce a plethora of incoherent results, indicating that this
dataset presents a significant challenge for future works.
Related papers
- Context is Key: A Benchmark for Forecasting with Essential Textual Information [87.3175915185287]
"Context is Key" (CiK) is a time series forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context.
We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters.
Our experiments highlight the importance of incorporating contextual information, demonstrate surprising performance when using LLM-based forecasting models, and also reveal some of their critical shortcomings.
arXiv Detail & Related papers (2024-10-24T17:56:08Z) - Deep Learning-Based Object Pose Estimation: A Comprehensive Survey [73.74933379151419]
We discuss the recent advances in deep learning-based object pose estimation.
Our survey also covers multiple input data modalities, degrees-of-freedom of output poses, object properties, and downstream tasks.
arXiv Detail & Related papers (2024-05-13T14:44:22Z) - Modeling Entities as Semantic Points for Visual Information Extraction
in the Wild [55.91783742370978]
We propose an alternative approach to precisely and robustly extract key information from document images.
We explicitly model entities as semantic points, i.e., center points of entities are enriched with semantic information describing the attributes and relationships of different entities.
The proposed method can achieve significantly enhanced performance on entity labeling and linking, compared with previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-23T08:21:16Z) - Spatio-temporal predictive tasks for abnormal event detection in videos [60.02503434201552]
We propose new constrained pretext tasks to learn object level normality patterns.
Our approach consists in learning a mapping between down-scaled visual queries and their corresponding normal appearance and motion characteristics.
Experiments on several benchmark datasets demonstrate the effectiveness of our approach to localize and track anomalies.
arXiv Detail & Related papers (2022-10-27T19:45:12Z) - Fine-Grained Visual Entailment [51.66881737644983]
We propose an extension of this task, where the goal is to predict the logical relationship of fine-grained knowledge elements within a piece of text to an image.
Unlike prior work, our method is inherently explainable and makes logical predictions at different levels of granularity.
We evaluate our method on a new dataset of manually annotated knowledge elements and show that our method achieves 68.18% accuracy at this challenging task.
arXiv Detail & Related papers (2022-03-29T16:09:38Z) - I Know Therefore I Score: Label-Free Crafting of Scoring Functions using
Constraints Based on Domain Expertise [6.26476800426345]
We introduce a label-free practical approach to learn a scoring function from multi-dimensional numerical data.
The approach incorporates insights and business rules from domain experts in the form of easily observable and specifiable constraints.
We convert such constraints into loss functions that are optimized simultaneously while learning the scoring function.
arXiv Detail & Related papers (2022-03-18T17:51:20Z) - Logic Constraints to Feature Importances [17.234442722611803]
"Black box" nature of AI models is often a limit for a reliable application in high-stakes fields like diagnostic techniques, autonomous guide, etc.
Recent works have shown that an adequate level of interpretability could enforce the more general concept of model trustworthiness.
The basic idea of this paper is to exploit the human prior knowledge of the features' importance for a specific task, in order to coherently aid the phase of the model's fitting.
arXiv Detail & Related papers (2021-10-13T09:28:38Z) - On the Challenges of Open World Recognitionunder Shifting Visual Domains [23.999211737485812]
This work investigates whether Open World Recognition (OWR) algorithms are effective under domain-shift.
OWR has the goal to produce systems capable of breaking the semantic limits present in the initial training set.
Our analysis shows that this degradation is only slightly mitigated by coupling OWR with domain generalization techniques.
arXiv Detail & Related papers (2021-07-09T14:25:45Z) - Streaming Self-Training via Domain-Agnostic Unlabeled Images [62.57647373581592]
We present streaming self-training (SST) that aims to democratize the process of learning visual recognition models.
Key to SST are two crucial observations: (1) domain-agnostic unlabeled images enable us to learn better models with a few labeled examples without any additional knowledge or supervision; and (2) learning is a continuous process and can be done by constructing a schedule of learning updates.
arXiv Detail & Related papers (2021-04-07T17:58:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.