MultiSChuBERT: Effective Multimodal Fusion for Scholarly Document
Quality Prediction
- URL: http://arxiv.org/abs/2308.07971v1
- Date: Tue, 15 Aug 2023 18:18:34 GMT
- Title: MultiSChuBERT: Effective Multimodal Fusion for Scholarly Document
Quality Prediction
- Authors: Gideon Maillette de Buy Wenniger, Thomas van Dongen, Lambert Schomaker
- Abstract summary: Multimodality has been shown to improve the performance on scholarly document quality prediction tasks.
We propose the multimodal predictive model MultiSChuBERT.
We show that gradual-unfreezing of the weights of the visual sub-model, reduces its tendency to ovefit the data.
- Score: 2.900522306460408
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic assessment of the quality of scholarly documents is a difficult
task with high potential impact. Multimodality, in particular the addition of
visual information next to text, has been shown to improve the performance on
scholarly document quality prediction (SDQP) tasks. We propose the multimodal
predictive model MultiSChuBERT. It combines a textual model based on chunking
full paper text and aggregating computed BERT chunk-encodings (SChuBERT), with
a visual model based on Inception V3.Our work contributes to the current
state-of-the-art in SDQP in three ways. First, we show that the method of
combining visual and textual embeddings can substantially influence the
results. Second, we demonstrate that gradual-unfreezing of the weights of the
visual sub-model, reduces its tendency to ovefit the data, improving results.
Third, we show the retained benefit of multimodality when replacing standard
BERT$_{\textrm{BASE}}$ embeddings with more recent state-of-the-art text
embedding models.
Using BERT$_{\textrm{BASE}}$ embeddings, on the (log) number of citations
prediction task with the ACL-BiblioMetry dataset, our MultiSChuBERT
(text+visual) model obtains an $R^{2}$ score of 0.454 compared to 0.432 for the
SChuBERT (text only) model. Similar improvements are obtained on the PeerRead
accept/reject prediction task. In our experiments using SciBERT, scincl,
SPECTER and SPECTER2.0 embeddings, we show that each of these tailored
embeddings adds further improvements over the standard BERT$_{\textrm{BASE}}$
embeddings, with the SPECTER2.0 embeddings performing best.
Related papers
- NeKo: Toward Post Recognition Generative Correction Large Language Models with Task-Oriented Experts [57.53692236201343]
We propose a Multi-Task Correction MoE, where we train the experts to become an expert'' of speech-to-text, language-to-text and vision-to-text datasets.
NeKo performs competitively on grammar and post-OCR correction as a multi-task model.
arXiv Detail & Related papers (2024-11-08T20:11:24Z) - Multimodality Helps Few-Shot 3D Point Cloud Semantic Segmentation [61.91492500828508]
Few-shot 3D point cloud segmentation (FS-PCS) aims at generalizing models to segment novel categories with minimal support samples.
We introduce a cost-free multimodal FS-PCS setup, utilizing textual labels and the potentially available 2D image modality.
We propose a simple yet effective Test-time Adaptive Cross-modal Seg (TACC) technique to mitigate training bias.
arXiv Detail & Related papers (2024-10-29T19:28:41Z) - Triple Modality Fusion: Aligning Visual, Textual, and Graph Data with Large Language Models for Multi-Behavior Recommendations [12.154043062308201]
This paper introduces a novel framework for multi-behavior recommendations, leveraging the fusion of triple-modality.
Our proposed model called Triple Modality Fusion (TMF) utilizes the power of large language models (LLMs) to align and integrate these three modalities.
Extensive experiments demonstrate the effectiveness of our approach in improving recommendation accuracy.
arXiv Detail & Related papers (2024-10-16T04:44:15Z) - Multi-BERT for Embeddings for Recommendation System [0.0]
We propose a novel approach for generating document embeddings using a combination of Sentence-BERT and RoBERTa.
Our approach treats sentences as tokens and generates embeddings for them, allowing the model to capture both intra-sentence and inter-sentence relations within a document.
We evaluate our model on a book recommendation task and demonstrate its effectiveness in generating more semantically rich and accurate document embeddings.
arXiv Detail & Related papers (2023-08-24T19:36:05Z) - Exploring Multimodal Sentiment Analysis via CBAM Attention and
Double-layer BiLSTM Architecture [3.9850392954445875]
In our model, we use BERT + BiLSTM as new feature extractor to capture the long-distance dependencies in sentences.
To remove redundant information, CNN and CBAM attention are added after splicing text features and picture features.
The experimental results show that our model achieves a sound effect, similar to the advanced model.
arXiv Detail & Related papers (2023-03-26T12:34:01Z) - Team Triple-Check at Factify 2: Parameter-Efficient Large Foundation
Models with Feature Representations for Multi-Modal Fact Verification [5.552606716659022]
Multi-modal fact verification has become an important but challenging issue on social media.
In this paper, we propose the Pre-CoFactv2 framework for modeling fine-grained text and input embeddings with lightening parameters.
We show that Pre-CoFactv2 outperforms Pre-CoFact by a large margin and achieved new state-of-the-art results at the Factify challenge at AAAI 2023.
arXiv Detail & Related papers (2023-02-12T18:08:54Z) - Beyond Triplet: Leveraging the Most Data for Multimodal Machine
Translation [53.342921374639346]
Multimodal machine translation aims to improve translation quality by incorporating information from other modalities, such as vision.
Previous MMT systems mainly focus on better access and use of visual information and tend to validate their methods on image-related datasets.
This paper establishes new methods and new datasets for MMT.
arXiv Detail & Related papers (2022-12-20T15:02:38Z) - Long Document Summarization with Top-down and Bottom-up Inference [113.29319668246407]
We propose a principled inference framework to improve summarization models on two aspects.
Our framework assumes a hierarchical latent structure of a document where the top-level captures the long range dependency.
We demonstrate the effectiveness of the proposed framework on a diverse set of summarization datasets.
arXiv Detail & Related papers (2022-03-15T01:24:51Z) - BERMo: What can BERT learn from ELMo? [6.417011237981518]
We use linear combination scheme proposed in Embeddings from Language Models (ELMo) to combine the scaled internal representations from different network depths.
Our approach has two-fold benefits: (1) improved gradient flow for the downstream task and (2) increased representative power.
arXiv Detail & Related papers (2021-10-18T17:35:41Z) - LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document
Understanding [49.941806975280045]
Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks.
We present text-bfLMv2 by pre-training text, layout and image in a multi-modal framework.
arXiv Detail & Related papers (2020-12-29T13:01:52Z) - Pre-training for Abstractive Document Summarization by Reinstating
Source Text [105.77348528847337]
This paper presents three pre-training objectives which allow us to pre-train a Seq2Seq based abstractive summarization model on unlabeled text.
Experiments on two benchmark summarization datasets show that all three objectives can improve performance upon baselines.
arXiv Detail & Related papers (2020-04-04T05:06:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.