Related papers: HER2 Expression Prediction with Flexible Multi-Modal Inputs via Dynamic Bidirectional Reconstruction

Related papers

BHaRNet: Reliability-Aware Body-Hand Modality Expertized Networks for Fine-grained Skeleton Action Recognition [12.973657570368317]
This work presents a probabilistic dual-stream framework that unifies reliability modeling and multi-modal integration, generalizing expertized learning under uncertainty across both intra-skeleton and cross-modal domains.<n>The framework comprises three key components: (1) a calibration-free preprocessing pipeline that removes canonical-space transformations and learns directly from native coordinates; (2) a probabilistic Noisy-OR fusion that stabilizes reliability-aware dual-stream learning without requiring explicit confidence supervision; and (3) an intra- to cross-modal ensemble that couples four skeleton modalities (Joint, Bone, Joint Motion, and Bone Motion) to RGB representations, bridging structural and visual motion
arXiv Detail & Related papers (2026-01-01T15:13:11Z)
DSTED: Decoupling Temporal Stabilization and Discriminative Enhancement for Surgical Workflow Recognition [13.734575975699963]
Surgical workflow recognition enables context-aware assistance and skill assessment in computer-assisted interventions.<n>Current methods suffer from two critical challenges: prediction jitter across consecutive frames and poor discrimination of ambiguous phases.<n>This paper aims to develop a stable framework by selectively propagating reliable historical information and explicitly modeling uncertainty for hard sample enhancement.
arXiv Detail & Related papers (2025-12-22T13:36:26Z)
Dual-level Modality Debiasing Learning for Unsupervised Visible-Infrared Person Re-Identification [59.59359638389348]
We propose a Dual-level Modality Debiasing Learning framework that implements debiasing at both the model and optimization levels.<n>Experiments on benchmark datasets demonstrate that DMDL could enable modality-invariant feature learning and a more generalized model.
arXiv Detail & Related papers (2025-12-03T12:43:16Z)
Transparent Early ICU Mortality Prediction with Clinical Transformer and Per-Case Modality Attribution [42.85462513661566]
We present a lightweight, transparent multimodal ensemble that fuses physiological time-series measurements with unstructured clinical notes from the first 48 hours of an ICU stay.<n>A logistic regression model combines predictions from two modality-specific models: a bidirectional LSTM for vitals and a finetuned ClinicalModernBERT transformer for notes.<n>On the MIMIC-III benchmark, our late-fusion ensemble improves discrimination over the best single model while maintaining well-calibrated predictions.
arXiv Detail & Related papers (2025-11-19T20:11:49Z)
scMRDR: A scalable and flexible framework for unpaired single-cell multi-omics data integration [53.683726781791385]
We introduce a scalable and flexible generative framework called single-cell Multi-omics Regularized Disentangled Representations (scMRDR) for unpaired multi-omics integration.<n>Our method achieves excellent performance on benchmark datasets in terms of batch correction, modality alignment, and biological signal preservation.
arXiv Detail & Related papers (2025-10-28T21:28:39Z)
DRBD-Mamba for Robust and Efficient Brain Tumor Segmentation with Analytical Insights [54.87947751720332]
Accurate brain tumor segmentation is significant for clinical diagnosis and treatment.<n>Mamba-based State Space Models have demonstrated promising performance.<n>We propose a dual-resolution bi-directional Mamba that captures multi-scale long-range dependencies with minimal computational overhead.
arXiv Detail & Related papers (2025-10-16T07:31:21Z)
Optimization of bi-directional gated loop cell based on multi-head attention mechanism for SSD health state classification model [2.5670390559986442]
This study proposes a hybrid BiGRU-MHA model that incorporates a multi-head attention mechanism to enhance the accuracy and stability of storage device health classification.<n> Experimental results show that the proposed model achieves classification accuracies of 92.70% on the training set and 92.44% on the test set, with a minimal performance gap of only 0.26%.
arXiv Detail & Related papers (2025-06-13T22:01:57Z)
Boosting LLM Reasoning via Spontaneous Self-Correction [43.4980625253775]
One of the approaches for improving math reasoning is self-correction.<n>Existing self-correction approaches treat corrections as standalone post-generation refinements.<n>We propose SPOC, a spontaneous self-correction approach that enables LLMs to generate interleaved solutions and verifications in a single inference pass.
arXiv Detail & Related papers (2025-06-07T21:23:00Z)
BiXFormer: A Robust Framework for Maximizing Modality Effectiveness in Multi-Modal Semantic Segmentation [55.486872677160015]
We reformulate multi-modal semantic segmentation as a mask-level classification task.<n>We propose BiXFormer, which integrates Unified Modality Matching (UMM) and Cross Modality Alignment (CMA)<n> Experiments on both synthetic and real-world multi-modal benchmarks demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2025-06-04T08:04:58Z)
Is Architectural Complexity Overrated? Competitive and Interpretable Knowledge Graph Completion with RelatE [6.959701672059059]
RelatE is an interpretable and modular method that efficiently integrates dual representations for entities and relations.<n>It achieves competitive or superior performance on standard benchmarks.<n>Perturbation studies demonstrate improved robustness, with MRR reduced by up to 61% relative to TransE and by up to 19% compared to RotatE.
arXiv Detail & Related papers (2025-05-25T04:36:52Z)
SAMba-UNet: Synergizing SAM2 and Mamba in UNet with Heterogeneous Aggregation for Cardiac MRI Segmentation [6.451534509235736]
This study proposes an innovative dual-encoder architecture named SAMba-UNet.<n>The framework achieves cross-modal feature collaborative learning by integrating the vision foundation model SAM2, the state-space model Mamba, and the classical UNet.<n> Experiments on the ACDC cardiac MRI dataset demonstrate that the proposed model achieves a Dice coefficient of 0.9103 and an HD95 boundary error of 1.0859 mm.
arXiv Detail & Related papers (2025-05-22T06:57:03Z)
Decoupling Multi-Contrast Super-Resolution: Pairing Unpaired Synthesis with Implicit Representations [6.255537948555454]
Multi-Contrast Super-Resolution techniques can boost the quality of their low-resolution counterparts.<n>Existing MCSR methods often assume fixed resolution settings and all require large, perfectly paired training datasets.<n>We propose a novel Modular Multi-Contrast Super-Resolution framework that eliminates the need for paired training data and supports arbitrary upscaling.
arXiv Detail & Related papers (2025-05-09T07:48:52Z)
ReCoM: Realistic Co-Speech Motion Generation with Recurrent Embedded Transformer [58.49950218437718]
We present ReCoM, an efficient framework for generating high-fidelity and generalizable human body motions synchronized with speech.<n>The core innovation lies in the Recurrent Embedded Transformer (RET), which integrates Dynamic Embedding Regularization (DER) into a Vision Transformer (ViT) core architecture.<n>To enhance model robustness, we incorporate the proposed DER strategy, which equips the model with dual capabilities of noise resistance and cross-domain generalization.
arXiv Detail & Related papers (2025-03-27T16:39:40Z)
Benchmarking Multi-modal Semantic Segmentation under Sensor Failures: Missing and Noisy Modality Robustness [61.87055159919641]
Multi-modal semantic segmentation (MMSS) addresses the limitations of single-modality data by integrating complementary information across modalities.<n>Despite notable progress, a significant gap persists between research and real-world deployment due to variability and uncertainty in multi-modal data quality.<n>We introduce a robustness benchmark that evaluates MMSS models under three scenarios: Entire-Missing Modality (EMM), Random-Missing Modality (RMM), and Noisy Modality (NM)
arXiv Detail & Related papers (2025-03-24T08:46:52Z)
Towards Clinical Practice in CT-Based Pulmonary Disease Screening: An Efficient and Reliable Framework [16.98886836566185]
Cluster-based Sub-Sampling (CSS) method efficiently selects a compact yet comprehensive subset of CT slices.<n>Hybrid Uncertainty Quantification (HUQ) mechanism assesses both Aleatoric Uncertainty (AU) and Epistemic Uncertainty (EU) with minimal computational overhead.
arXiv Detail & Related papers (2024-12-02T14:18:17Z)
Hybrid-Segmentor: A Hybrid Approach to Automated Fine-Grained Crack Segmentation in Civil Infrastructure [52.2025114590481]
We introduce Hybrid-Segmentor, an encoder-decoder based approach that is capable of extracting both fine-grained local and global crack features. This allows the model to improve its generalization capabilities in distinguish various type of shapes, surfaces and sizes of cracks. The proposed model outperforms existing benchmark models across 5 quantitative metrics (accuracy 0.971, precision 0.804, recall 0.744, F1-score 0.770, and IoU score 0.630), achieving state-of-the-art status.
arXiv Detail & Related papers (2024-09-04T16:47:16Z)
Enhancing Emotion Recognition in Incomplete Data: A Novel Cross-Modal Alignment, Reconstruction, and Refinement Framework [11.278202284982209]
We present the Cross-Modal Alignment, Reconstruction, and Refinement (CM-ARR) framework. This framework engages in cross-modal alignment, reconstruction, and refinement phases to handle missing modalities. Experiments on the IEMOCAP and MSP-IMPROV datasets confirm the superior performance of CM-ARR under conditions of both missing and complete modalities.
arXiv Detail & Related papers (2024-07-12T06:44:42Z)
Multi-modal MRI Translation via Evidential Regression and Distribution Calibration [29.56726531611307]
We propose a novel framework that reformulates multi-modal MRI translation as a multi-modal evidential regression problem with distribution calibration.<n>Our approach incorporates two key components: 1) an evidential regression module that estimates uncertainties from different source modalities and an explicit distribution mixture strategy for transparent multi-modal fusion, and 2) a distribution calibration mechanism that adapts to source-target mapping shifts.
arXiv Detail & Related papers (2024-07-10T05:17:01Z)
Expert-Token Resonance MoE: Bidirectional Routing with Efficiency Affinity-Driven Active Selection [16.062265609569003]
Mixture-of-Experts (MoE) architectures have emerged as a paradigm-shifting approach for large language models (LLMs)<n>We propose a novel expert routing framework that incorporates: (1) An efficient routing mechanism with lightweight computation; (2) An adaptive bidirectional selection mechanism leveraging resonance between experts and tokens; and (3) A module that determines the lower bounds of expert capacity based on dynamic token distribution analysis.
arXiv Detail & Related papers (2024-05-24T02:50:44Z)
Joint Multimodal Transformer for Emotion Recognition in the Wild [49.735299182004404]
Multimodal emotion recognition (MMER) systems typically outperform unimodal systems. This paper proposes an MMER method that relies on a joint multimodal transformer (JMT) for fusion with key-based cross-attention.
arXiv Detail & Related papers (2024-03-15T17:23:38Z)
Modality-Collaborative Transformer with Hybrid Feature Reconstruction for Robust Emotion Recognition [35.15390769958969]
We propose a unified framework, Modality-Collaborative Transformer with Hybrid Feature Reconstruction (MCT-HFR) MCT-HFR consists of a novel attention-based encoder which concurrently extracts and dynamically balances the intra- and inter-modality relations. During model training, LFI leverages complete features as supervisory signals to recover local missing features, while GFA is designed to reduce the global semantic gap between pairwise complete and incomplete representations.
arXiv Detail & Related papers (2023-12-26T01:59:23Z)
Towards Calibrated Robust Fine-Tuning of Vision-Language Models [97.19901765814431]
This work proposes a robust fine-tuning method that improves both OOD accuracy and confidence calibration simultaneously in vision language models. We show that both OOD classification and OOD calibration errors have a shared upper bound consisting of two terms of ID data. Based on this insight, we design a novel framework that conducts fine-tuning with a constrained multimodal contrastive loss enforcing a larger smallest singular value.
arXiv Detail & Related papers (2023-11-03T05:41:25Z)
Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion [56.38386580040991]
Consistency Trajectory Model (CTM) is a generalization of Consistency Models (CM) CTM enables the efficient combination of adversarial training and denoising score matching loss to enhance performance. Unlike CM, CTM's access to the score function can streamline the adoption of established controllable/conditional generation methods.
arXiv Detail & Related papers (2023-10-01T05:07:17Z)
Exploiting modality-invariant feature for robust multimodal emotion recognition with missing modalities [76.08541852988536]
We propose to use invariant features for a missing modality imagination network (IF-MMIN) We show that the proposed model outperforms all baselines and invariantly improves the overall emotion recognition performance under uncertain missing-modality conditions.
arXiv Detail & Related papers (2022-10-27T12:16:25Z)
Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations. Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z)
Performance of Dual-Augmented Lagrangian Method and Common Spatial Patterns applied in classification of Motor-Imagery BCI [68.8204255655161]
Motor-imagery based brain-computer interfaces (MI-BCI) have the potential to become ground-breaking technologies for neurorehabilitation. Due to the noisy nature of the used EEG signal, reliable BCI systems require specialized procedures for features optimization and extraction.
arXiv Detail & Related papers (2020-10-13T20:50:13Z)
Towards a Competitive End-to-End Speech Recognition for CHiME-6 Dinner Party Transcription [73.66530509749305]
In this paper, we argue that, even in difficult cases, some end-to-end approaches show performance close to the hybrid baseline. We experimentally compare and analyze CTC-Attention versus RNN-Transducer approaches along with RNN versus Transformer architectures. Our best end-to-end model based on RNN-Transducer, together with improved beam search, reaches quality by only 3.8% WER abs. worse than the LF-MMI TDNN-F CHiME-6 Challenge baseline.
arXiv Detail & Related papers (2020-04-22T19:08:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.