Detection and Description of Change in Visual Streams
- URL: http://arxiv.org/abs/2003.12633v2
- Date: Thu, 9 Apr 2020 20:32:19 GMT
- Title: Detection and Description of Change in Visual Streams
- Authors: Davis Gilton, Ruotian Luo, Rebecca Willett, Greg Shakhnarovich
- Abstract summary: We propose a new approach to incorporating unlabeled data into training to generate natural language descriptions of change.
We also develop a framework for estimating the time of change in visual stream.
We use learned representations for change evidence and consistency of perceived change, and combine these in a regularized graph cut based change detector.
- Score: 20.62923173347949
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a framework for the analysis of changes in visual
streams: ordered sequences of images, possibly separated by significant time
gaps. We propose a new approach to incorporating unlabeled data into training
to generate natural language descriptions of change. We also develop a
framework for estimating the time of change in visual stream. We use learned
representations for change evidence and consistency of perceived change, and
combine these in a regularized graph cut based change detector. Experimental
evaluation on visual stream datasets, which we release as part of our
contribution, shows that representation learning driven by natural language
descriptions significantly improves change detection accuracy, compared to
methods that do not rely on language.
Related papers
- Enhancing Perception of Key Changes in Remote Sensing Image Change Captioning [49.24306593078429]
We propose a novel framework for remote sensing image change captioning, guided by Key Change Features and Instruction-tuned (KCFI)
KCFI includes a ViTs encoder for extracting bi-temporal remote sensing image features, a key feature perceiver for identifying critical change areas, and a pixel-level change detection decoder.
To validate the effectiveness of our approach, we compare it against several state-of-the-art change captioning methods on the LEVIR-CC dataset.
arXiv Detail & Related papers (2024-09-19T09:33:33Z) - MS-Former: Memory-Supported Transformer for Weakly Supervised Change
Detection with Patch-Level Annotations [50.79913333804232]
We propose a memory-supported transformer (MS-Former) for weakly supervised change detection.
MS-Former consists of a bi-directional attention block (BAB) and a patch-level supervision scheme (PSS)
Experimental results on three benchmark datasets demonstrate the effectiveness of our proposed method in the change detection task.
arXiv Detail & Related papers (2023-11-16T09:57:29Z) - VcT: Visual change Transformer for Remote Sensing Image Change Detection [16.778418602705287]
We propose a novel Visual change Transformer (VcT) model for visual change detection problem.
Top-K reliable tokens can be mined from the map and refined by using the clustering algorithm.
Extensive experiments on multiple benchmark datasets validated the effectiveness of our proposed VcT model.
arXiv Detail & Related papers (2023-10-17T17:25:31Z) - Neighborhood Contrastive Transformer for Change Captioning [80.10836469177185]
We propose a neighborhood contrastive transformer to improve the model's perceiving ability for various changes under different scenes.
The proposed method achieves the state-of-the-art performance on three public datasets with different change scenarios.
arXiv Detail & Related papers (2023-03-06T14:39:54Z) - SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for
Few-shot Image Classification [84.05253637260743]
We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models.
SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation.
State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
arXiv Detail & Related papers (2022-11-28T14:58:15Z) - Integrating Visuospatial, Linguistic and Commonsense Structure into
Story Visualization [81.26077816854449]
We first explore the use of constituency parse trees for encoding structured input.
Second, we augment the structured input with commonsense information and study the impact of this external knowledge on the generation of visual story.
Third, we incorporate visual structure via bounding boxes and dense captioning to provide feedback about the characters/objects in generated images.
arXiv Detail & Related papers (2021-10-21T00:16:02Z) - From Two to One: A New Scene Text Recognizer with Visual Language
Modeling Network [70.47504933083218]
We propose a Visual Language Modeling Network (VisionLAN), which views the visual and linguistic information as a union.
VisionLAN significantly improves the speed by 39% and adaptively considers the linguistic information to enhance the visual features for accurate recognition.
arXiv Detail & Related papers (2021-08-22T07:56:24Z) - Finding It at Another Side: A Viewpoint-Adapted Matching Encoder for
Change Captioning [41.044241265804125]
We propose a novel visual encoder to explicitly distinguish viewpoint changes from semantic changes in the change captioning task.
We also propose a novel reinforcement learning process to fine-tune the attention directly with language evaluation rewards.
Our method outperforms the state-of-the-art approaches by a large margin in both Spot-the-Diff and CLEVR-Change datasets.
arXiv Detail & Related papers (2020-09-30T00:13:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.