Are you doing what I say? On modalities alignment in ALFRED
- URL: http://arxiv.org/abs/2110.05665v1
- Date: Tue, 12 Oct 2021 01:05:37 GMT
- Title: Are you doing what I say? On modalities alignment in ALFRED
- Authors: Ting-Rui Chiang, Yi-Ting Yeh, Ta-Chung Chi, Yau-Shian Wang
- Abstract summary: ALFRED requires a model to complete tasks in simulated house environments specified by instructions in natural language.
Key modality to success is accurately aligning the text with visual inputs.
We introduce approaches aimed at improving model alignment and demonstrate how improved alignment, improves end task performance.
- Score: 6.46147328920679
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: ALFRED is a recently proposed benchmark that requires a model to complete
tasks in simulated house environments specified by instructions in natural
language. We hypothesize that key to success is accurately aligning the text
modality with visual inputs. Motivated by this, we inspect how well existing
models can align these modalities using our proposed intrinsic metric, boundary
adherence score (BAS). The results show the previous models are indeed failing
to perform proper alignment. To address this issue, we introduce approaches
aimed at improving model alignment and demonstrate how improved alignment,
improves end task performance.
Related papers
- Fine-Grained Verifiers: Preference Modeling as Next-token Prediction in Vision-Language Alignment [57.0121616203175]
We propose FiSAO, a novel self-alignment method that utilizes the model's own visual encoder as a fine-grained verifier to improve vision-language alignment.
By leveraging token-level feedback from the vision encoder, FiSAO significantly improves vision-language alignment, even surpassing traditional preference tuning methods that require additional data.
arXiv Detail & Related papers (2024-10-18T03:34:32Z) - Removing Distributional Discrepancies in Captions Improves Image-Text Alignment [76.31530836622694]
We introduce a model designed to improve the prediction of image-text alignment.
Our approach focuses on generating high-quality training datasets for the alignment task.
We also demonstrate the applicability of our model by ranking the images generated by text-to-image models based on text alignment.
arXiv Detail & Related papers (2024-10-01T17:50:17Z) - Hard Prompts Made Interpretable: Sparse Entropy Regularization for Prompt Tuning with RL [29.01858866450715]
We present RLPrompt, which aims to find optimal prompt tokens leveraging soft Q-learning.
While the results show promise, we have observed that the prompts frequently appear unnatural, which impedes their interpretability.
We address this limitation by using sparse Tsallis entropy regularization, a principled approach to filtering out unlikely tokens from consideration.
arXiv Detail & Related papers (2024-07-20T03:10:19Z) - Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models [63.36637269634553]
We present a novel method of further improving performance by requiring models to compare multiple reasoning chains.
We find that instruction tuning on DCoT datasets boosts the performance of even smaller, and therefore more accessible, language models.
arXiv Detail & Related papers (2024-07-03T15:01:18Z) - Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement [102.22911097049953]
SIMA is a framework that enhances visual and language modality alignment through self-improvement.
It employs an in-context self-critic mechanism to select response pairs for preference tuning.
We demonstrate that SIMA achieves superior modality alignment, outperforming previous approaches.
arXiv Detail & Related papers (2024-05-24T23:09:27Z) - Calibrated Self-Rewarding Vision Language Models [27.686545023186852]
Large Vision-Language Models (LVLMs) have made substantial progress by integrating pre-trained large language models (LLMs) and vision models through instruction tuning.
LVLMs often exhibit the hallucination phenomenon, where generated text responses appear linguistically plausible but contradict the input image.
We propose the Calibrated Self-Rewarding (CSR) approach, which enables the model to self-improve by iteratively generating candidate responses, evaluating the reward for each response, and curating preference data for fine-tuning.
arXiv Detail & Related papers (2024-05-23T14:30:33Z) - Instruction Position Matters in Sequence Generation with Large Language
Models [67.87516654892343]
Large language models (LLMs) are capable of performing conditional sequence generation tasks, such as translation or summarization.
We propose enhancing the instruction-following capability of LLMs by shifting the position of task instructions after the input sentences.
arXiv Detail & Related papers (2023-08-23T12:36:57Z) - Preserving Pre-trained Features Helps Calibrate Fine-tuned Language
Models [23.881825575095945]
Large pre-trained language models (PLMs) have demonstrated strong performance on natural language understanding (NLU) tasks through fine-tuning.
However, fine-tuned models still suffer from overconfident predictions, especially in out-of-domain settings.
We demonstrate that the PLMs are well-calibrated on the masked language modeling task with robust predictive confidence under domain shift.
We show that preserving pre-trained features can improve the calibration of fine-tuned language models.
arXiv Detail & Related papers (2023-05-30T17:35:31Z) - On the Limits of Evaluating Embodied Agent Model Generalization Using
Validation Sets [101.28658250723804]
This paper experiments with augmenting a transformer model with modules that effectively utilize a wider field of view and learn to choose whether the next step requires a navigation or manipulation action.
We observe that the proposed modules resulted in improved, and in fact state-of-the-art performance on an unseen validation set of a popular benchmark dataset, ALFRED.
We highlight this result as we believe it may be a wider phenomenon in machine learning tasks but primarily noticeable only in benchmarks that limit evaluations on test splits.
arXiv Detail & Related papers (2022-05-18T23:52:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.