Language Modeling and Understanding Through Paraphrase Generation and Detection
- URL: http://arxiv.org/abs/2602.08274v2
- Date: Sun, 15 Feb 2026 10:09:40 GMT
- Title: Language Modeling and Understanding Through Paraphrase Generation and Detection
- Authors: Jan Philip Wahle,
- Abstract summary: We can express the same thoughts in virtually infinite ways using different words and structures.<n> Modeling paraphrases is a keystone to meaning in computational language models.<n>I propose that decomposing paraphrases into their constituent linguistic aspects offers a more cognitively grounded view of semantic equivalence.
- Score: 4.080540555071174
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Language enables humans to share knowledge, reason about the world, and pass on strategies for survival and innovation across generations. At the heart of this process is not just the ability to communicate but also the remarkable flexibility in how we can express ourselves. We can express the same thoughts in virtually infinite ways using different words and structures - this ability to rephrase and reformulate expressions is known as paraphrase. Modeling paraphrases is a keystone to meaning in computational language models; being able to construct different variations of texts that convey the same meaning or not shows strong abilities of semantic understanding. If computational language models are to represent meaning, they must understand and control the different aspects that construct the same meaning as opposed to different meanings at a fine granularity. Yet most existing approaches reduce paraphrasing to a binary decision between two texts or to producing a single rewrite of a source, obscuring which linguistic factors are responsible for meaning preservation. In this thesis, I propose that decomposing paraphrases into their constituent linguistic aspects (paraphrase types) offers a more fine-grained and cognitively grounded view of semantic equivalence. I show that even advanced machine learning models struggle with this task. Yet, when explicitly trained on paraphrase types, models achieve stronger performance on related paraphrase tasks and downstream applications. For example, in plagiarism detection, language models trained on paraphrase types surpass human baselines: 89.6% accuracy compared to 78.4% for plagiarism cases from Wikipedia, and 66.5% compared to 55.7% for plagiarism of scientific papers from arXiv. In identifying duplicate questions on Quora, models trained with paraphrase types improve over models trained on binary pairs. Furthermore, I demonstrate that...
Related papers
- ChatGPT-generated texts show authorship traits that identify them as non-human [0.6741942263052466]
This work examines whether a language model can also be linked to a specific fingerprint.<n>We find that the model can successfully adapt its style depending on whether it is prompted to produce a Wikipedia entry vs. a college essay.<n>Our results suggest that the model prefers nouns to verbs, thus showing a distinct linguistic backbone from humans.
arXiv Detail & Related papers (2025-08-22T13:38:58Z) - A Distributional Perspective on Word Learning in Neural Language Models [57.41607944290822]
There are no widely agreed-upon metrics for word learning in language models.<n>We argue that distributional signatures studied in prior work fail to capture key distributional information.<n>We obtain learning trajectories for a selection of small language models we train from scratch.
arXiv Detail & Related papers (2025-02-09T13:15:59Z) - Mitigating Paraphrase Attacks on Machine-Text Detectors via Paraphrase Inversion [4.148732457277201]
High-quality paraphrases are easy to produce using instruction-tuned language models.<n>$$x2013$are known to significantly degrade the performance of machine-text detectors.<n>We propose an approach which frames the problem as paraphrasing from paraphrased text back to the original text.
arXiv Detail & Related papers (2024-10-29T00:46:24Z) - Paraphrase Types for Generation and Detection [7.800428507692341]
We name these tasks Paraphrase Type Generation and Paraphrase Type Detection.
Our results suggest that while current techniques perform well in a binary classification scenario, the inclusion of fine-grained paraphrase types poses a significant challenge.
We believe paraphrase types can unlock a new paradigm for developing paraphrase models and solving tasks in the future.
arXiv Detail & Related papers (2023-10-23T12:32:41Z) - Physics of Language Models: Part 3.2, Knowledge Manipulation [51.68385617116854]
This paper investigates four fundamental knowledge manipulation tasks.
We show that language models excel in knowledge retrieval but struggle even in the simplest classification or comparison tasks.
Our findings also apply to modern pretrained language models such as GPT-4.
arXiv Detail & Related papers (2023-09-25T17:50:41Z) - Transparency Helps Reveal When Language Models Learn Meaning [71.96920839263457]
Our systematic experiments with synthetic data reveal that, with languages where all expressions have context-independent denotations, both autoregressive and masked language models learn to emulate semantic relations between expressions.
Turning to natural language, our experiments with a specific phenomenon -- referential opacity -- add to the growing body of evidence that current language models do not well-represent natural language semantics.
arXiv Detail & Related papers (2022-10-14T02:35:19Z) - Psychologically-informed chain-of-thought prompts for metaphor
understanding in large language models [29.993190226231793]
We use chain-of-thought prompts to introduce structures from probabilistic models into large language models.
Our prompts lead language models to infer latent variables and reason about their relationships in order to choose appropriate paraphrases for metaphors.
arXiv Detail & Related papers (2022-09-16T19:23:13Z) - Do Language Models Plagiarize? [22.02731537718498]
We investigate whether language models memorize but also plagiarize training samples when generating artificial texts.
Our findings support that they, especially GPT-2, reuse particular pieces of texts from the training corpus with or without obfuscation.
Our work implies that future research on neural language models should take precautions to avoid models plagiarizing their training datasets.
arXiv Detail & Related papers (2022-03-15T03:11:11Z) - Provable Limitations of Acquiring Meaning from Ungrounded Form: What
will Future Language Models Understand? [87.20342701232869]
We investigate the abilities of ungrounded systems to acquire meaning.
We study whether assertions enable a system to emulate representations preserving semantic relations like equivalence.
We find that assertions enable semantic emulation if all expressions in the language are referentially transparent.
However, if the language uses non-transparent patterns like variable binding, we show that emulation can become an uncomputable problem.
arXiv Detail & Related papers (2021-04-22T01:00:17Z) - Read Like Humans: Autonomous, Bidirectional and Iterative Language
Modeling for Scene Text Recognition [80.446770909975]
Linguistic knowledge is of great benefit to scene text recognition.
How to effectively model linguistic rules in end-to-end deep networks remains a research challenge.
We propose an autonomous, bidirectional and iterative ABINet for scene text recognition.
arXiv Detail & Related papers (2021-03-11T06:47:45Z) - My Teacher Thinks The World Is Flat! Interpreting Automatic Essay
Scoring Mechanism [71.34160809068996]
Recent work shows that automated scoring systems are prone to even common-sense adversarial samples.
We utilize recent advances in interpretability to find the extent to which features such as coherence, content and relevance are important for automated scoring mechanisms.
We also find that since the models are not semantically grounded with world-knowledge and common sense, adding false facts such as the world is flat'' actually increases the score instead of decreasing it.
arXiv Detail & Related papers (2020-12-27T06:19:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.