Related papers: SITUATE -- Synthetic Object Counting Dataset for VLM training

SITUATE -- Synthetic Object Counting Dataset for VLM training

URL: http://arxiv.org/abs/2602.00108v1
Date: Mon, 26 Jan 2026 16:17:53 GMT
Title: SITUATE -- Synthetic Object Counting Dataset for VLM training
Authors: René Peinl, Vincent Tischler, Patrick Schröder, Christian Groth,
Abstract summary: We present SITUATE, a novel dataset designed for training and evaluating Vision Language Models.<n>The dataset bridges the gap between simple 2D datasets like VLMCountBench and often ambiguous real-life datasets like TallyQA.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present SITUATE, a novel dataset designed for training and evaluating Vision Language Models on counting tasks with spatial constraints. The dataset bridges the gap between simple 2D datasets like VLMCountBench and often ambiguous real-life datasets like TallyQA, which lack control over occlusions and spatial composition. Experiments show that our dataset helps to improve generalization for out-of-distribution images, since a finetune of Qwen VL 2.5 7B on SITUATE improves accuracy on the Pixmo count test data, but not vice versa. We cross validate this by comparing the model performance across established other counting benchmarks and against an equally sized fine-tuning set derived from Pixmo count.

Related papers

Wireless Dataset Similarity: Measuring Distances in Supervised and Unsupervised Machine Learning [15.036550722400085]
This paper introduces a task- and model-aware framework for measuring similarity between wireless datasets.<n>It enables applications such as dataset selection/augmentation, simulation-to-real (sim2real) comparison, and informing decisions on model training/adaptation to new deployments.
arXiv Detail & Related papers (2026-01-03T01:15:27Z)
Estimating Time Series Foundation Model Transferability via In-Context Learning [74.65355820906355]
Time series foundation models (TSFMs) offer strong zero-shot forecasting via large-scale pre-training.<n>Fine-tuning remains critical for boosting performance in domains with limited public data.<n>We introduce TimeTic, a transferability estimation framework that recasts model selection as an in-context-learning problem.
arXiv Detail & Related papers (2025-09-28T07:07:13Z)
Score Matching With Missing Data [7.9731667982734455]
We adapt score matching to work with missing data in a flexible setting.<n>We provide two separate score matching variations for general use, an importance weighting (IW) approach, and a variational approach.<n>We show our variational approach to be strongest in more complex high-dimensional settings.
arXiv Detail & Related papers (2025-05-31T13:26:51Z)
A Framework for Fine-Tuning LLMs using Heterogeneous Feedback [69.51729152929413]
We present a framework for fine-tuning large language models (LLMs) using heterogeneous feedback. First, we combine the heterogeneous feedback data into a single supervision format, compatible with methods like SFT and RLHF. Next, given this unified feedback dataset, we extract a high-quality and diverse subset to obtain performance increases.
arXiv Detail & Related papers (2024-08-05T23:20:32Z)
Revisiting the Evaluation of Image Synthesis with GANs [55.72247435112475]
This study presents an empirical investigation into the evaluation of synthesis performance, with generative adversarial networks (GANs) as a representative of generative models. In particular, we make in-depth analyses of various factors, including how to represent a data point in the representation space, how to calculate a fair distance using selected samples, and how many instances to use from each set.
arXiv Detail & Related papers (2023-04-04T17:54:32Z)
MV-JAR: Masked Voxel Jigsaw and Reconstruction for LiDAR-Based Self-Supervised Pre-Training [58.07391711548269]
Masked Voxel Jigsaw and Reconstruction (MV-JAR) method for LiDAR-based self-supervised pre-training. Masked Voxel Jigsaw and Reconstruction (MV-JAR) method for LiDAR-based self-supervised pre-training.
arXiv Detail & Related papers (2023-03-23T17:59:02Z)
A Bag-of-Prototypes Representation for Dataset-Level Applications [24.629132557336312]
This work investigates dataset vectorization for two dataset-level tasks: assessing training set suitability and test set difficulty. We propose a bag-of-prototypes (BoP) dataset representation that extends the image-level bag consisting of patch descriptors to dataset-level bag consisting of semantic prototypes. BoP consistently shows its advantage over existing representations on a series of benchmarks for two dataset-level tasks.
arXiv Detail & Related papers (2023-03-23T13:33:58Z)
Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets. We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models. Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z)
TraVLR: Now You See It, Now You Don't! A Bimodal Dataset for Evaluating Visio-Linguistic Reasoning [25.520406167426135]
We present TraVLR, a synthetic dataset comprising four visio-linguistic (V+L) reasoning tasks. Each example in TraVLR redundantly encodes the scene in two modalities, allowing either to be dropped or added during training or testing without losing relevant information. We compare the performance of four state-of-the-art V+L models, finding that while they perform well on test examples from the same modality, they all fail at cross-modal transfer.
arXiv Detail & Related papers (2021-11-21T07:22:44Z)
Comparing Test Sets with Item Response Theory [53.755064720563]
We evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples. We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models. We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
arXiv Detail & Related papers (2021-06-01T22:33:53Z)
Evaluating Models' Local Decision Boundaries via Contrast Sets [119.38387782979474]
We propose a new annotation paradigm for NLP that helps to close systematic gaps in the test data. We demonstrate the efficacy of contrast sets by creating them for 10 diverse NLP datasets. Although our contrast sets are not explicitly adversarial, model performance is significantly lower on them than on the original test sets.
arXiv Detail & Related papers (2020-04-06T14:47:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.