Entity Re-identification in Visual Storytelling via Contrastive Reinforcement Learning
- URL: http://arxiv.org/abs/2507.07340v2
- Date: Fri, 11 Jul 2025 00:58:38 GMT
- Title: Entity Re-identification in Visual Storytelling via Contrastive Reinforcement Learning
- Authors: Daniel A. P. Oliveira, David Martins de Matos,
- Abstract summary: Visual storytelling systems struggle to maintain character and object identity across frames.<n>We propose a contrastive reinforcement learning approach that trains models to discriminate between coherent image sequences and stories from unrelated images.
- Score: 0.2455468619225742
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual storytelling systems, particularly large vision-language models, struggle to maintain character and object identity across frames, often failing to recognize when entities in different images represent the same individuals or objects, leading to inconsistent references and referential hallucinations. This occurs because models lack explicit training on when to establish entity connections across frames. We propose a contrastive reinforcement learning approach that trains models to discriminate between coherent image sequences and stories from unrelated images. We extend the Story Reasoning dataset with synthetic negative examples to teach appropriate entity connection behavior. We employ Direct Preference Optimization with a dual-component reward function that promotes grounding and re-identification of entities in real stories while penalizing incorrect entity connections in synthetic contexts. Using this contrastive framework, we fine-tune Qwen Storyteller (based on Qwen2.5-VL 7B). Evaluation shows improvements in grounding mAP from 0.27 to 0.31 (+14.8%), F1 from 0.35 to 0.41 (+17.1%). Pronoun grounding accuracy improved across all pronoun types except "its", and cross-frame character and object persistence increased across all frame counts, with entities appearing in 5 or more frames advancing from 29.3% to 33.3% (+13.7%). Well-structured stories, containing the chain-of-thought and grounded story, increased from 79.1% to 97.5% (+23.3%).
Related papers
- Synthetic Visual Genome [88.00433979509218]
We introduce ROBIN: an instruction-tuned with densely annotated relationships capable of constructing high-quality dense graphs at scale.<n>In total, our dataset contains 146K images and 5.6M relationships for 2.6M objects.<n> ROBIN-3B model, despite being trained on less than 3 million instances, outperforms similar-size models trained on over 300 million instances on relationship understanding benchmarks.
arXiv Detail & Related papers (2025-06-09T11:09:10Z) - Do It Yourself: Learning Semantic Correspondence from Pseudo-Labels [69.58063088519852]
We propose to improve semantic correspondence estimation via 3D-aware pseudo-labeling.<n>Specifically, we train an adapter to refine off-the-shelf features using pseudo-labels obtained via 3D-aware chaining.<n>While reducing the need for dataset specific annotations, we set a new state-of-the-art on SPair-71k by over 4% absolute gain.
arXiv Detail & Related papers (2025-06-05T17:54:33Z) - StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation [0.2455468619225742]
Visual storytelling systems struggle to maintain character identity across frames and link actions to appropriate subjects.<n>We propose StoryReasoning, a dataset containing 4,178 stories derived from 52,016 movie images.<n>We create Qwen Storyteller, which performs end-to-end object detection, re-identification, and landmark detection while maintaining consistent object references throughout the story.
arXiv Detail & Related papers (2025-05-15T13:42:14Z) - Barking Up The Syntactic Tree: Enhancing VLM Training with Syntactic Losses [31.85977999591524]
Vision-Language Models implicitly learn to associate image regions with words from large-scale training data.<n>Rich semantic and syntactic structures within the text modality have been overlooked as sources of supervision.<n>Hierarchically STructured Learning (HIST) enhances spatial vision-language alignment without using additional human annotations.
arXiv Detail & Related papers (2024-12-11T05:36:18Z) - Learning from Synthetic Data for Visual Grounding [55.21937116752679]
We show that SynGround can improve the localization capabilities of off-the-shelf vision-and-language models.<n>Data generated with SynGround improves the pointing game accuracy of a pretrained ALBEF and BLIP models by 4.81% and 17.11% absolute percentage points, respectively.
arXiv Detail & Related papers (2024-03-20T17:59:43Z) - Improved Visual Grounding through Self-Consistent Explanations [58.51131933246332]
We propose a strategy for augmenting existing text-image datasets with paraphrases using a large language model.
SelfEQ is a weakly-supervised strategy on visual explanation maps for paraphrases that encourages self-consistency.
arXiv Detail & Related papers (2023-12-07T18:59:22Z) - Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions [6.231370972617915]
Zero-shot referring expression comprehension aims at localizing bounding boxes in an image corresponding to provided textual prompts.
Existing vision-language alignment models, e.g., CLIP, struggle with both aspects so cannot be directly used for this task.
We leverage large foundation models to disentangle both images and texts into triplets in the format of (subject, predicate, object)
arXiv Detail & Related papers (2023-11-28T18:55:37Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z) - Visually Grounded Compound PCFGs [65.04669567781634]
Exploiting visual groundings for language understanding has recently been drawing much attention.
We study visually grounded grammar induction and learn a constituency from both unlabeled text and its visual captions.
arXiv Detail & Related papers (2020-09-25T19:07:00Z) - Tackling the Unannotated: Scene Graph Generation with Bias-Reduced
Models [8.904910414410855]
State-of-the-art results are still far from satisfactory, e.g. models can obtain 31% in overall recall R@100.
We propose a novel SGG training scheme that capitalizes on self-learned knowledge.
arXiv Detail & Related papers (2020-08-18T10:04:51Z) - Disentangled Graph Collaborative Filtering [100.26835145396782]
Disentangled Graph Collaborative Filtering (DGCF) is a new model for learning informative representations of users and items from interaction data.
By modeling a distribution over intents for each user-item interaction, we iteratively refine the intent-aware interaction graphs and representations.
DGCF achieves significant improvements over several state-of-the-art models like NGCF, DisenGCN, and MacridVAE.
arXiv Detail & Related papers (2020-07-03T15:37:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.