Related papers: CMFeed: A Benchmark Dataset for Controllable Multimodal Feedback Synthesis

CMFeed: A Benchmark Dataset for Controllable Multimodal Feedback Synthesis

URL: http://arxiv.org/abs/2402.07640v2
Date: Thu, 6 Jun 2024 00:26:26 GMT
Title: CMFeed: A Benchmark Dataset for Controllable Multimodal Feedback Synthesis
Authors: Puneet Kumar, Sarthak Malik, Balasubramanian Raman, Xiaobai Li,
Abstract summary: The Controllable Multimodal Feedback Synthesis (CMFeed) dataset enables the generation of sentiment-controlled feedback from multimodal inputs. We propose a benchmark feedback synthesis system comprising encoder, decoder, and controllability modules. It employs transformer and Faster R-CNN networks to extract features and generate sentiment-specific feedback, achieving a sentiment classification accuracy of 77.23%.
Score: 21.247650660908484
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The Controllable Multimodal Feedback Synthesis (CMFeed) dataset enables the generation of sentiment-controlled feedback from multimodal inputs. It contains images, text, human comments, comments' metadata and sentiment labels. Existing datasets for related tasks such as multimodal summarization, visual question answering, visual dialogue, and sentiment-aware text generation do not incorporate training models using human-generated outputs and their metadata, a gap that CMFeed addresses. This capability is critical for developing feedback systems that understand and replicate human-like spontaneous responses. Based on the CMFeed dataset, we define a novel task of controllable feedback synthesis to generate context-aware feedback aligned with the desired sentiment. We propose a benchmark feedback synthesis system comprising encoder, decoder, and controllability modules. It employs transformer and Faster R-CNN networks to extract features and generate sentiment-specific feedback, achieving a sentiment classification accuracy of 77.23%, which is 18.82% higher than models not leveraging the dataset's unique controllability features. Additionally, we incorporate a similarity module for relevance assessment through rank-based metrics.

Related papers

Visual Fidelity Index for Generative Semantic Communications with Critical Information Embedding [29.28886512743758]
We develop a hybrid Gen-SemCom system, where both text prompts and semantically critical features are extracted for transmissions.<n>By integrating the text prompt and critical features, the receiver reconstructs high-fidelity images using a diffusion-based generative model.<n> Experimental results validate the GVIF metric's sensitivity to visual fidelity, correlating with both the PSNR and critical information volume.
arXiv Detail & Related papers (2025-05-15T15:28:32Z)
LatteReview: A Multi-Agent Framework for Systematic Review Automation Using Large Language Models [0.0]
LatteReview is a Python-based framework that leverages large language models (LLMs) and multi-agent systems to automate key elements of the systematic review process. The framework supports features such as Retrieval-Augmented Generation (RAG) for incorporating external context, multimodal reviews, Pydantic-based validation for structured inputs and outputs, and asynchronous programming for handling large-scale datasets.
arXiv Detail & Related papers (2025-01-05T17:53:00Z)
See then Tell: Enhancing Key Information Extraction with Vision Grounding [54.061203106565706]
We introduce STNet (See then Tell Net), a novel end-to-end model designed to deliver precise answers with relevant vision grounding. To enhance the model's seeing capabilities, we collect extensive structured table recognition datasets.
arXiv Detail & Related papers (2024-09-29T06:21:05Z)
Beyond Thumbs Up/Down: Untangling Challenges of Fine-Grained Feedback for Text-to-Image Generation [67.88747330066049]
Fine-grained feedback captures nuanced distinctions in image quality and prompt-alignment. We show that demonstrating its superiority to coarse-grained feedback is not automatic. We identify key challenges in eliciting and utilizing fine-grained feedback.
arXiv Detail & Related papers (2024-06-24T17:19:34Z)
TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture. TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling. It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z)
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation [26.933683814025475]
We introduce two novel multimodal datasets: the synthetic CLEVR-ATVC dataset (620K) and the manually pictured Fruit-ATVC dataset (50K). These datasets incorporate both visual and text-based inputs and outputs. To facilitate the accountability of multimodal systems in rejecting human requests, similar to language-based ChatGPT conversations, we introduce specific rules as supervisory signals within the datasets.
arXiv Detail & Related papers (2023-03-10T15:35:11Z)
PENTATRON: PErsonalized coNText-Aware Transformer for Retrieval-based cOnversational uNderstanding [18.788620612619823]
In a large fraction of the global traffic from customers using smart digital assistants, frictions in dialogues may be attributed to incorrect understanding. We build and evaluate a scalable entity correction system, PENTATRON. We show a significant upward movement of the key metric (Exact Match) by up to 500.97%.
arXiv Detail & Related papers (2022-10-22T00:14:47Z)
FlexLip: A Controllable Text-to-Lip System [6.15560473113783]
We tackle a subissue of the text-to-video generation problem, by converting the text into lip landmarks. Our system, entitled FlexLip, is split into two separate modules: text-to-speech and speech-to-lip. We show that by using as little as 20 min of data for the audio generation component, and as little as 5 min for the speech-to-lip component, the objective measures of the generated lip landmarks are comparable with those obtained when using a larger set of training samples.
arXiv Detail & Related papers (2022-06-07T11:51:58Z)
Affective Feedback Synthesis Towards Multimodal Text and Image Data [12.768277167508208]
We have defined a novel task of affective feedback synthesis that deals with generating feedback for input text & corresponding image. A feedback synthesis system has been proposed and trained using ground-truth human comments along with image-text input. The generated feedbacks have been analyzed using automatic and human evaluation.
arXiv Detail & Related papers (2022-03-23T19:28:20Z)
SIFN: A Sentiment-aware Interactive Fusion Network for Review-based Item Recommendation [48.1799451277808]
We propose a Sentiment-aware Interactive Fusion Network (SIFN) for review-based item recommendation. We first encode user/item reviews via BERT and propose a light-weighted sentiment learner to extract semantic features of each review. Then, we propose a sentiment prediction task that guides the sentiment learner to extract sentiment-aware features via explicit sentiment labels.
arXiv Detail & Related papers (2021-08-18T08:04:38Z)
Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers [89.00926092864368]
We present a semantics-controlled multi-modal shuffled Transformer reasoning framework for the audio-visual scene aware dialog task. We also present a novel dynamic scene graph representation learning pipeline that consists of an intra-frame reasoning layer producing-semantic graph representations for every frame. Our results demonstrate state-of-the-art performances on all evaluation metrics.
arXiv Detail & Related papers (2020-07-08T02:00:22Z)
Exploring Explainable Selection to Control Abstractive Summarization [51.74889133688111]
We develop a novel framework that focuses on explainability. A novel pair-wise matrix captures the sentence interactions, centrality, and attribute scores. A sentence-deployed attention mechanism in the abstractor ensures the final summary emphasizes the desired content.
arXiv Detail & Related papers (2020-04-24T14:39:34Z)
Transformer Reasoning Network for Image-Text Matching and Retrieval [14.238818604272751]
We consider the problem of accurate image-text matching for the task of multi-modal large-scale information retrieval. We introduce the Transformer Reasoning Network (TERN), an architecture built upon one of the modern relationship-aware self-attentive, the Transformer. TERN is able to separately reason on the two different modalities and to enforce a final common abstract concept space.
arXiv Detail & Related papers (2020-04-20T09:09:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.