Counterfactual Multi-Token Fairness in Text Classification
- URL: http://arxiv.org/abs/2202.03792v2
- Date: Wed, 9 Feb 2022 04:29:13 GMT
- Title: Counterfactual Multi-Token Fairness in Text Classification
- Authors: Pranay Lohia
- Abstract summary: The concept of Counterfactual Generation has been extended to multi-token support valid over all forms of texts and documents.
We define the method of generating counterfactuals by perturbing multiple sensitive tokens as Counterfactual Multi-token Generation.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The counterfactual token generation has been limited to perturbing only a
single token in texts that are generally short and single sentences. These
tokens are often associated with one of many sensitive attributes. With limited
counterfactuals generated, the goal to achieve invariant nature for machine
learning classification models towards any sensitive attribute gets bounded,
and the formulation of Counterfactual Fairness gets narrowed. In this paper, we
overcome these limitations by solving root problems and opening bigger domains
for understanding. We have curated a resource of sensitive tokens and their
corresponding perturbation tokens, even extending the support beyond
traditionally used sensitive attributes like Age, Gender, Race to Nationality,
Disability, and Religion. The concept of Counterfactual Generation has been
extended to multi-token support valid over all forms of texts and documents. We
define the method of generating counterfactuals by perturbing multiple
sensitive tokens as Counterfactual Multi-token Generation. The method has been
conceptualized to showcase significant performance improvement over
single-token methods and validated over multiple benchmark datasets. The
emendation in counterfactual generation propagates in achieving improved
Counterfactual Multi-token Fairness.
Related papers
- LiteToken: Removing Intermediate Merge Residues From BPE Tokenizers [76.59130257385826]
Intermediate merge residues in BPE vocabularies are frequent during merge learning so that retained in the final vocabulary, but are mostly further merged and rarely emitted when tokenizing the corpus during tokenizer usage.<n>We present a systematic empirical characterization of this phenomenon across commonly used tokenizers and introduce LiteToken, a simple method for removing residue tokens.<n>Experiments show that LiteToken reduces token fragmentation, reduces parameters, and improves robustness to noisy or misspelled inputs, while preserving overall performance.
arXiv Detail & Related papers (2026-02-04T16:19:05Z) - Are you going to finish that? A Practical Study of the Partial Token Problem [85.49816027251013]
Language models (LMs) are trained over sequences of tokens, whereas users interact with LMs via text.<n>This mismatch gives rise to the partial token problem, which occurs when a user ends their prompt in the middle of the expected next-token.<n>In this work, we identify three domains where token and "word" boundaries often do not line up.
arXiv Detail & Related papers (2026-01-30T17:47:16Z) - Training Language Models with homotokens Leads to Delayed Overfitting [2.531076482407163]
Subword tokenization introduces a computational layer in language models where many distinct token sequences decode to the same surface form and preserve meaning.<n>We formalize homotoken-as a strictly meaning-preserving form of data augmentation.<n>In data-constrained pretraining, homotoken augmentation consistently delays overfitting under repeated data exposure.<n>In multilingual fine-tuning, we find that the effectiveness of homotokens depends on tokenizer quality.
arXiv Detail & Related papers (2026-01-06T09:57:00Z) - Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer [50.69959748410398]
We introduce MingTok, a new family of visual tokenizers with a continuous latent space for unified autoregressive generation and understanding.<n>MingTok adopts a three-stage sequential architecture involving low-level encoding, semantic expansion, and visual reconstruction.<n>Built on top of it, Ming-UniVision eliminates the need for task-specific visual representations, and unifies diverse vision-language tasks under a single autoregrsssive prediction paradigm.
arXiv Detail & Related papers (2025-10-08T02:50:14Z) - TrimTokenator: Towards Adaptive Visual Token Pruning for Large Multimodal Models [4.779482139419908]
We introduce a mutual information-based token pruning strategy that removes visual tokens semantically with textual tokens.<n>Our method maintains strong performance while reducing textual tokens by 88.9% on models such as LLaVA-15-7B and LLaVA--7B.
arXiv Detail & Related papers (2025-08-30T02:43:50Z) - SemToken: Semantic-Aware Tokenization for Efficient Long-Context Language Modeling [6.185573921868495]
SemToken is a semantic-aware tokenization framework that reduces token redundancy and improves efficiency.<n>It can be seamlessly integrated with modern language models and attention acceleration methods.<n>Our findings suggest that semantic structure offers a promising new axis for optimizing tokenization and computation in large language models.
arXiv Detail & Related papers (2025-08-21T03:01:53Z) - Improving Large Language Models with Concept-Aware Fine-Tuning [55.59287380665864]
Concept-Aware Fine-Tuning (CAFT) is a novel multi-token training method for large language models (LLMs)<n>CAFT enables the learning of sequences that span multiple tokens, fostering stronger concept-aware learning.<n>Experiments demonstrate significant improvements compared to conventional next-token finetuning methods.
arXiv Detail & Related papers (2025-06-09T14:55:00Z) - Think Before You Accept: Semantic Reflective Verification for Faster Speculative Decoding [48.52389201779425]
Speculative decoding accelerates inference by generating multiple draft tokens using a lightweight model and verifying them in parallel.<n>Existing verification methods rely heavily on distributional consistency while overlooking semantic correctness.<n>We propose Reflective Verification, a training-free and semantics-aware approach that achieves a better trade-off between correctness and efficiency.
arXiv Detail & Related papers (2025-05-24T10:26:27Z) - Achieving Tokenizer Flexibility in Language Models through Heuristic Adaptation and Supertoken Learning [0.0]
Pretrained language models (LLMs) are often constrained by their fixed tokenization schemes.<n> Tokenadapt is a model-agnostic tokenizer transplantation method.<n>Our framework introduces two innovations: first, Tokenadapt, a model-agnostic tokenizer transplantation method, and second, novel pre-tokenization for multi-word Supertokens.
arXiv Detail & Related papers (2025-05-14T19:00:27Z) - Disentangling Reasoning Tokens and Boilerplate Tokens For Language Model Fine-tuning [46.43130011147807]
We argue that tokens serving different roles - specifically, reasoning tokens versus boilerplate tokens - differ significantly in importance and learning complexity.
We propose a novel Shuffle-Aware Discriminator (SHAD) for adaptive token discrimination.
Using SHAD, we propose the Reasoning-highlighted Fine-Tuning (RFT) method, which adaptively emphasizes reasoning tokens during fine-tuning.
arXiv Detail & Related papers (2024-12-19T12:06:24Z) - ElasticTok: Adaptive Tokenization for Image and Video [109.75935878130582]
We introduce ElasticTok, a method that conditions on prior frames to adaptively encode a frame into a variable number of tokens.
During inference, ElasticTok can dynamically allocate tokens when needed.
Our evaluations on images and video demonstrate the effectiveness of our approach in efficient token usage.
arXiv Detail & Related papers (2024-10-10T20:54:15Z) - STORE: Streamlining Semantic Tokenization and Generative Recommendation with A Single LLM [59.08493154172207]
We propose a unified framework to streamline the semantic tokenization and generative recommendation process.
We formulate semantic tokenization as a text-to-token task and generative recommendation as a token-to-token task, supplemented by a token-to-text reconstruction task and a text-to-token auxiliary task.
All these tasks are framed in a generative manner and trained using a single large language model (LLM) backbone.
arXiv Detail & Related papers (2024-09-11T13:49:48Z) - SEP: Self-Enhanced Prompt Tuning for Visual-Language Model [93.94454894142413]
We introduce a novel approach named Self-Enhanced Prompt Tuning (SEP)
SEP explicitly incorporates discriminative prior knowledge to enhance both textual-level and visual-level embeddings.
Comprehensive evaluations across various benchmarks and tasks confirm SEP's efficacy in prompt tuning.
arXiv Detail & Related papers (2024-05-24T13:35:56Z) - Token Alignment via Character Matching for Subword Completion [34.76794239097628]
This paper examines a technique to alleviate the tokenization artifact on text completion in generative models.
The method, termed token alignment, involves backtracking to the last complete tokens and ensuring the model's generation aligns with the prompt.
arXiv Detail & Related papers (2024-03-13T16:44:39Z) - Token Fusion: Bridging the Gap between Token Pruning and Token Merging [71.84591084401458]
Vision Transformers (ViTs) have emerged as powerful backbones in computer vision, outperforming many traditional CNNs.
computational overhead, largely attributed to the self-attention mechanism, makes deployment on resource-constrained edge devices challenging.
We introduce "Token Fusion" (ToFu), a method that amalgamates the benefits of both token pruning and token merging.
arXiv Detail & Related papers (2023-12-02T04:29:19Z) - mCL-NER: Cross-Lingual Named Entity Recognition via Multi-view
Contrastive Learning [54.523172171533645]
Cross-lingual named entity recognition (CrossNER) faces challenges stemming from uneven performance due to the scarcity of multilingual corpora.
We propose Multi-view Contrastive Learning for Cross-lingual Named Entity Recognition (mCL-NER)
Our experiments on the XTREME benchmark, spanning 40 languages, demonstrate the superiority of mCL-NER over prior data-driven and model-based approaches.
arXiv Detail & Related papers (2023-08-17T16:02:29Z) - Beyond Attentive Tokens: Incorporating Token Importance and Diversity
for Efficient Vision Transformers [32.972945618608726]
Vision transformers have achieved significant improvements on various vision tasks but their quadratic interactions between tokens significantly reduce computational efficiency.
We propose an efficient token decoupling and merging method that can jointly consider the token importance and diversity for token pruning.
Our method can even improve the accuracy of DeiT-T by 0.1% after reducing its FLOPs by 40%.
arXiv Detail & Related papers (2022-11-21T09:57:11Z) - Practical Approaches for Fair Learning with Multitype and Multivariate
Sensitive Attributes [70.6326967720747]
It is important to guarantee that machine learning algorithms deployed in the real world do not result in unfairness or unintended social consequences.
We introduce FairCOCCO, a fairness measure built on cross-covariance operators on reproducing kernel Hilbert Spaces.
We empirically demonstrate consistent improvements against state-of-the-art techniques in balancing predictive power and fairness on real-world datasets.
arXiv Detail & Related papers (2022-11-11T11:28:46Z) - Flexible text generation for counterfactual fairness probing [8.262741696221143]
A common approach for testing fairness issues in text-based classifiers is through the use of counterfactuals.
Existing counterfactual generation methods rely on wordlists or templates, producing simple counterfactuals that don't take into account grammar, context, or subtle sensitive attribute references.
In this paper, we introduce a task for generating counterfactuals that overcomes these shortcomings, and demonstrate how large language models (LLMs) can be leveraged to make progress on this task.
arXiv Detail & Related papers (2022-06-28T05:07:20Z) - Token Manipulation Generative Adversarial Network for Text Generation [0.0]
We decompose conditional text generation problem into two tasks, make-a-blank and fill-in-the-blank, and extend the former to handle more complex manipulations on the given tokens.
We show that the proposed model not only addresses the limitations but also provides good results without compromising the performance in terms of quality and diversity.
arXiv Detail & Related papers (2020-05-06T13:10:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.