Dependent Multi-Task Learning with Causal Intervention for Image
Captioning
- URL: http://arxiv.org/abs/2105.08573v1
- Date: Tue, 18 May 2021 14:57:33 GMT
- Title: Dependent Multi-Task Learning with Causal Intervention for Image
Captioning
- Authors: Wenqing Chen, Jidong Tian, Caoyun Fan, Hao He, and Yaohui Jin
- Abstract summary: In this paper, we propose a dependent multi-task learning framework with the causal intervention (DMTCI)
Firstly, we involve an intermediate task, bag-of-categories generation, before the final task, image captioning.
Secondly, we apply Pearl's do-calculus on the model, cutting off the link between the visual features and possible confounders.
Finally, we use a multi-agent reinforcement learning strategy to enable end-to-end training and reduce the inter-task error accumulations.
- Score: 10.6405791176668
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent work for image captioning mainly followed an extract-then-generate
paradigm, pre-extracting a sequence of object-based features and then
formulating image captioning as a single sequence-to-sequence task. Although
promising, we observed two problems in generated captions: 1) content
inconsistency where models would generate contradicting facts; 2) not
informative enough where models would miss parts of important information. From
a causal perspective, the reason is that models have captured spurious
statistical correlations between visual features and certain expressions (e.g.,
visual features of "long hair" and "woman"). In this paper, we propose a
dependent multi-task learning framework with the causal intervention (DMTCI).
Firstly, we involve an intermediate task, bag-of-categories generation, before
the final task, image captioning. The intermediate task would help the model
better understand the visual features and thus alleviate the content
inconsistency problem. Secondly, we apply Pearl's do-calculus on the model,
cutting off the link between the visual features and possible confounders and
thus letting models focus on the causal visual features. Specifically, the
high-frequency concept set is considered as the proxy confounders where the
real confounders are inferred in the continuous space. Finally, we use a
multi-agent reinforcement learning (MARL) strategy to enable end-to-end
training and reduce the inter-task error accumulations. The extensive
experiments show that our model outperforms the baseline models and achieves
competitive performance with state-of-the-art models.
Related papers
- Comprehensive Generative Replay for Task-Incremental Segmentation with Concurrent Appearance and Semantic Forgetting [49.87694319431288]
Generalist segmentation models are increasingly favored for diverse tasks involving various objects from different image sources.
We propose a Comprehensive Generative (CGR) framework that restores appearance and semantic knowledge by synthesizing image-mask pairs.
Experiments on incremental tasks (cardiac, fundus and prostate segmentation) show its clear advantage for alleviating concurrent appearance and semantic forgetting.
arXiv Detail & Related papers (2024-06-28T10:05:58Z) - Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks.
We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception.
Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z) - LOIS: Looking Out of Instance Semantics for Visual Question Answering [17.076621453814926]
We propose a model framework without bounding boxes to understand the causal nexus of object semantics in images.
We implement a mutual relation attention module to model sophisticated and deeper visual semantic relations between instance objects and background information.
Our proposed attention model can further analyze salient image regions by focusing on important word-related questions.
arXiv Detail & Related papers (2023-07-26T12:13:00Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - A Better Loss for Visual-Textual Grounding [74.81353762517979]
Given a textual phrase and an image, the visual grounding problem is defined as the task of locating the content of the image referenced by the sentence.
It is a challenging task that has several real-world applications in human-computer interaction, image-text reference resolution, and video-text reference resolution.
We propose a model that is able to achieve a higher accuracy than state-of-the-art models thanks to the adoption of a more effective loss function.
arXiv Detail & Related papers (2021-08-11T16:26:54Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.