Related papers: Reshaping Representation Space to Balance the Safety and Over-rejection in Large Audio Language Models

Reshaping Representation Space to Balance the Safety and Over-rejection in Large Audio Language Models

URL: http://arxiv.org/abs/2505.19670v1
Date: Mon, 26 May 2025 08:25:25 GMT
Title: Reshaping Representation Space to Balance the Safety and Over-rejection in Large Audio Language Models
Authors: Hao Yang, Lizhen Qu, Ehsan Shareghi, Gholamreza Haffari,
Abstract summary: Large Audio Language Models (LALMs) have extended the capabilities of Large Language Models (LLMs)<n>Recent research has revealed that LALMs remain vulnerable to harmful queries due to insufficient safety-alignment.
Score: 50.89022445197919
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Audio Language Models (LALMs) have extended the capabilities of Large Language Models (LLMs) by enabling audio-based human interactions. However, recent research has revealed that LALMs remain vulnerable to harmful queries due to insufficient safety-alignment. Despite advances in defence measures for text and vision LLMs, effective safety-alignment strategies and audio-safety dataset specifically targeting LALMs are notably absent. Meanwhile defence measures based on Supervised Fine-tuning (SFT) struggle to address safety improvement while avoiding over-rejection issues, significantly compromising helpfulness. In this work, we propose an unsupervised safety-fine-tuning strategy as remedy that reshapes model's representation space to enhance existing LALMs safety-alignment while balancing the risk of over-rejection. Our experiments, conducted across three generations of Qwen LALMs, demonstrate that our approach significantly improves LALMs safety under three modality input conditions (audio-text, text-only, and audio-only) while increasing over-rejection rate by only 0.88% on average. Warning: this paper contains harmful examples.

Related papers

LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning [61.594212398272184]
Low-Rank Extrapolation (LoX) improves robustness against benign and malicious fine-tuning attacks.<n>LoX leads to 11% to 54% absolute reductions in attack success rates.
arXiv Detail & Related papers (2025-06-18T16:30:02Z)
Almost Surely Safe Alignment of Large Language Models at Inference-Time [20.5164976103514]
Even highly capable large language models (LLMs) can produce biased or unsafe responses.<n>This paper introduces a novel inference-time alignment approach.<n>We achieve this by framing the safe generation of inference-time responses as a constrained Markov decision process.
arXiv Detail & Related papers (2025-02-03T09:59:32Z)
Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging [43.44112117935541]
Fine-tuning large language models (LLMs) for downstream tasks often leads to safety degradation in safety-aligned LLMs.<n>We propose a method that maintains the inherent safety of LLMs while enhancing their downstream task performance.
arXiv Detail & Related papers (2024-12-27T08:03:22Z)
Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models [8.024771725860127]
Jailbreak attacks manipulate large language models into generating harmful content.<n>Jailbreak Antidote enables real-time adjustment of safety preferences by manipulating a sparse subset of the model's internal states.<n>Our analysis reveals that safety-related information in LLMs is sparsely distributed.
arXiv Detail & Related papers (2024-10-03T08:34:17Z)
CoCA: Regaining Safety-awareness of Multimodal Large Language Models with Constitutional Calibration [90.36429361299807]
multimodal large language models (MLLMs) have demonstrated remarkable success in engaging in conversations involving visual inputs. The integration of visual modality has introduced a unique vulnerability: the MLLM becomes susceptible to malicious visual inputs. We introduce a technique termed CoCA, which amplifies the safety-awareness of the MLLM by calibrating its output distribution.
arXiv Detail & Related papers (2024-09-17T17:14:41Z)
SCANS: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering [56.92068213969036]
Safety alignment is indispensable for Large Language Models (LLMs) to defend threats from malicious instructions.<n>Recent researches reveal safety-aligned LLMs prone to reject benign queries due to the exaggerated safety issue.<n>We propose a Safety-Conscious Activation Steering (SCANS) method to mitigate the exaggerated safety concerns.
arXiv Detail & Related papers (2024-08-21T10:01:34Z)
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position.<n>DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful response sequence.
arXiv Detail & Related papers (2024-07-12T09:36:33Z)
Towards Comprehensive Post Safety Alignment of Large Language Models via Safety Patching [74.62818936088065]
textscSafePatching is a novel framework for comprehensive PSA.<n>textscSafePatching achieves a more comprehensive PSA than baseline methods.<n>textscSafePatching demonstrates its superiority in continual PSA scenarios.
arXiv Detail & Related papers (2024-05-22T16:51:07Z)
Uncovering Safety Risks of Large Language Models through Concept Activation Vector [13.804245297233454]
We introduce a Safety Concept Activation Vector (SCAV) framework to guide attacks on large language models (LLMs)<n>We then develop an SCAV-guided attack method that can generate both attack prompts and embedding-level attacks.<n>Our attack method significantly improves the attack success rate and response quality while requiring less training data.
arXiv Detail & Related papers (2024-04-18T09:46:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.