Related papers: Synthesizing Sentiment-Controlled Feedback For Multimodal Text and Image Data

Synthesizing Sentiment-Controlled Feedback For Multimodal Text and Image Data

URL: http://arxiv.org/abs/2402.07640v3
Date: Fri, 18 Oct 2024 02:50:53 GMT
Title: Synthesizing Sentiment-Controlled Feedback For Multimodal Text and Image Data
Authors: Puneet Kumar, Sarthak Malik, Balasubramanian Raman, Xiaobai Li,
Abstract summary: We have constructed a large-scale Controllable Multimodal Feedback Synthesis dataset and propose a controllable feedback synthesis system. The system features an encoder, decoder, and controllability block for textual and visual inputs. The CMFeed dataset includes images, texts, reactions to the posts, human comments with relevance scores, and reactions to these comments. These reactions train the model to produce feedback with specified sentiments, achieving a sentiment classification accuracy of 77.23%, which is 18.82% higher than the accuracy without controllability.
Score: 21.247650660908484
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The ability to generate sentiment-controlled feedback in response to multimodal inputs comprising text and images addresses a critical gap in human-computer interaction. This capability allows systems to provide empathetic, accurate, and engaging responses, with useful applications in education, healthcare, marketing, and customer service. To this end, we have constructed a large-scale Controllable Multimodal Feedback Synthesis (CMFeed) dataset and propose a controllable feedback synthesis system. The system features an encoder, decoder, and controllability block for textual and visual inputs. It extracts features using a transformer and Faster R-CNN networks, combining them to generate feedback. The CMFeed dataset includes images, texts, reactions to the posts, human comments with relevance scores, and reactions to these comments. These reactions train the model to produce feedback with specified sentiments, achieving a sentiment classification accuracy of 77.23\%, which is 18.82\% higher than the accuracy without controllability. The system also incorporates a similarity module for assessing feedback relevance through rank-based metrics and an interpretability technique to analyze the contributions of textual and visual features during feedback generation. Access to the CMFeed dataset and the system's code is available at https://github.com/MIntelligence-Group/CMFeed.

Related papers

LatteReview: A Multi-Agent Framework for Systematic Review Automation Using Large Language Models [0.0]
LatteReview is a Python-based framework that leverages large language models (LLMs) and multi-agent systems to automate key elements of the systematic review process. The framework supports features such as Retrieval-Augmented Generation (RAG) for incorporating external context, multimodal reviews, Pydantic-based validation for structured inputs and outputs, and asynchronous programming for handling large-scale datasets.
arXiv Detail & Related papers (2025-01-05T17:53:00Z)
See then Tell: Enhancing Key Information Extraction with Vision Grounding [54.061203106565706]
We introduce STNet (See then Tell Net), a novel end-to-end model designed to deliver precise answers with relevant vision grounding. To enhance the model's seeing capabilities, we collect extensive structured table recognition datasets.
arXiv Detail & Related papers (2024-09-29T06:21:05Z)
Beyond Thumbs Up/Down: Untangling Challenges of Fine-Grained Feedback for Text-to-Image Generation [67.88747330066049]
Fine-grained feedback captures nuanced distinctions in image quality and prompt-alignment. We show that demonstrating its superiority to coarse-grained feedback is not automatic. We identify key challenges in eliciting and utilizing fine-grained feedback.
arXiv Detail & Related papers (2024-06-24T17:19:34Z)
TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture. TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling. It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z)
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation [26.933683814025475]
We introduce two novel multimodal datasets: the synthetic CLEVR-ATVC dataset (620K) and the manually pictured Fruit-ATVC dataset (50K). These datasets incorporate both visual and text-based inputs and outputs. To facilitate the accountability of multimodal systems in rejecting human requests, similar to language-based ChatGPT conversations, we introduce specific rules as supervisory signals within the datasets.
arXiv Detail & Related papers (2023-03-10T15:35:11Z)
PENTATRON: PErsonalized coNText-Aware Transformer for Retrieval-based cOnversational uNderstanding [18.788620612619823]
In a large fraction of the global traffic from customers using smart digital assistants, frictions in dialogues may be attributed to incorrect understanding. We build and evaluate a scalable entity correction system, PENTATRON. We show a significant upward movement of the key metric (Exact Match) by up to 500.97%.
arXiv Detail & Related papers (2022-10-22T00:14:47Z)
FlexLip: A Controllable Text-to-Lip System [6.15560473113783]
We tackle a subissue of the text-to-video generation problem, by converting the text into lip landmarks. Our system, entitled FlexLip, is split into two separate modules: text-to-speech and speech-to-lip. We show that by using as little as 20 min of data for the audio generation component, and as little as 5 min for the speech-to-lip component, the objective measures of the generated lip landmarks are comparable with those obtained when using a larger set of training samples.
arXiv Detail & Related papers (2022-06-07T11:51:58Z)
Affective Feedback Synthesis Towards Multimodal Text and Image Data [12.768277167508208]
We have defined a novel task of affective feedback synthesis that deals with generating feedback for input text & corresponding image. A feedback synthesis system has been proposed and trained using ground-truth human comments along with image-text input. The generated feedbacks have been analyzed using automatic and human evaluation.
arXiv Detail & Related papers (2022-03-23T19:28:20Z)
SIFN: A Sentiment-aware Interactive Fusion Network for Review-based Item Recommendation [48.1799451277808]
We propose a Sentiment-aware Interactive Fusion Network (SIFN) for review-based item recommendation. We first encode user/item reviews via BERT and propose a light-weighted sentiment learner to extract semantic features of each review. Then, we propose a sentiment prediction task that guides the sentiment learner to extract sentiment-aware features via explicit sentiment labels.
arXiv Detail & Related papers (2021-08-18T08:04:38Z)
Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers [89.00926092864368]
We present a semantics-controlled multi-modal shuffled Transformer reasoning framework for the audio-visual scene aware dialog task. We also present a novel dynamic scene graph representation learning pipeline that consists of an intra-frame reasoning layer producing-semantic graph representations for every frame. Our results demonstrate state-of-the-art performances on all evaluation metrics.
arXiv Detail & Related papers (2020-07-08T02:00:22Z)
Exploring Explainable Selection to Control Abstractive Summarization [51.74889133688111]
We develop a novel framework that focuses on explainability. A novel pair-wise matrix captures the sentence interactions, centrality, and attribute scores. A sentence-deployed attention mechanism in the abstractor ensures the final summary emphasizes the desired content.
arXiv Detail & Related papers (2020-04-24T14:39:34Z)
Transformer Reasoning Network for Image-Text Matching and Retrieval [14.238818604272751]
We consider the problem of accurate image-text matching for the task of multi-modal large-scale information retrieval. We introduce the Transformer Reasoning Network (TERN), an architecture built upon one of the modern relationship-aware self-attentive, the Transformer. TERN is able to separately reason on the two different modalities and to enforce a final common abstract concept space.
arXiv Detail & Related papers (2020-04-20T09:09:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.