Exploiting Language Model for Efficient Linguistic Steganalysis: An
Empirical Study
- URL: http://arxiv.org/abs/2107.12168v1
- Date: Mon, 26 Jul 2021 12:37:18 GMT
- Title: Exploiting Language Model for Efficient Linguistic Steganalysis: An
Empirical Study
- Authors: Biao Yi, Hanzhou Wu, Guorui Feng and Xinpeng Zhang
- Abstract summary: We present two methods to efficient linguistic steganalysis.
One is to pre-train a language model based on RNN, and the other is to pre-train a sequence autoencoder.
- Score: 23.311007481830647
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in linguistic steganalysis have successively applied CNNs,
RNNs, GNNs and other deep learning models for detecting secret information in
generative texts. These methods tend to seek stronger feature extractors to
achieve higher steganalysis effects. However, we have found through experiments
that there actually exists significant difference between automatically
generated steganographic texts and carrier texts in terms of the conditional
probability distribution of individual words. Such kind of statistical
difference can be naturally captured by the language model used for generating
steganographic texts, which drives us to give the classifier a priori knowledge
of the language model to enhance the steganalysis ability. To this end, we
present two methods to efficient linguistic steganalysis in this paper. One is
to pre-train a language model based on RNN, and the other is to pre-train a
sequence autoencoder. Experimental results show that the two methods have
different degrees of performance improvement when compared to the randomly
initialized RNN classifier, and the convergence speed is significantly
accelerated. Moreover, our methods have achieved the best detection results.
Related papers
- Towards Next-Generation Steganalysis: LLMs Unleash the Power of Detecting Steganography [18.7168443402118]
Linguistic steganography provides convenient implementation to hide messages, particularly with the emergence of AI generation technology.
Existing methods are limited to finding distribution differences between steganographic texts and normal texts from the aspect of symbolic statistics.
This paper propose to employ human-like text processing abilities of large language models (LLMs) to realize the difference from the aspect of human perception.
arXiv Detail & Related papers (2024-05-15T04:52:09Z) - Transparency at the Source: Evaluating and Interpreting Language Models
With Access to the True Distribution [4.01799362940916]
We present a setup for training, evaluating and interpreting neural language models, that uses artificial, language-like data.
The data is generated using a massive probabilistic grammar, that is itself derived from a large natural language corpus.
With access to the underlying true source, our results show striking differences and outcomes in learning dynamics between different classes of words.
arXiv Detail & Related papers (2023-10-23T12:03:01Z) - SLCNN: Sentence-Level Convolutional Neural Network for Text
Classification [0.0]
Convolutional neural network (CNN) has shown remarkable success in the task of text classification.
New baseline models have been studied for text classification using CNN.
Results have shown that the proposed models have better performance, particularly in the longer documents.
arXiv Detail & Related papers (2023-01-27T13:16:02Z) - Multi-Scales Data Augmentation Approach In Natural Language Inference
For Artifacts Mitigation And Pre-Trained Model Optimization [0.0]
We provide a variety of techniques for analyzing and locating dataset artifacts inside the crowdsourced Stanford Natural Language Inference corpus.
To mitigate dataset artifacts, we employ a unique multi-scale data augmentation technique with two distinct frameworks.
Our combination method enhances our model's resistance to perturbation testing, enabling it to continuously outperform the pre-trained baseline.
arXiv Detail & Related papers (2022-12-16T23:37:44Z) - Improving Pre-trained Language Model Fine-tuning with Noise Stability
Regularization [94.4409074435894]
We propose a novel and effective fine-tuning framework, named Layerwise Noise Stability Regularization (LNSR)
Specifically, we propose to inject the standard Gaussian noise and regularize hidden representations of the fine-tuned model.
We demonstrate the advantages of the proposed method over other state-of-the-art algorithms including L2-SP, Mixout and SMART.
arXiv Detail & Related papers (2022-06-12T04:42:49Z) - Better Language Model with Hypernym Class Prediction [101.8517004687825]
Class-based language models (LMs) have been long devised to address context sparsity in $n$-gram LMs.
In this study, we revisit this approach in the context of neural LMs.
arXiv Detail & Related papers (2022-03-21T01:16:44Z) - A Latent-Variable Model for Intrinsic Probing [93.62808331764072]
We propose a novel latent-variable formulation for constructing intrinsic probes.
We find empirical evidence that pre-trained representations develop a cross-lingually entangled notion of morphosyntax.
arXiv Detail & Related papers (2022-01-20T15:01:12Z) - Factorized Neural Transducer for Efficient Language Model Adaptation [51.81097243306204]
We propose a novel model, factorized neural Transducer, by factorizing the blank and vocabulary prediction.
It is expected that this factorization can transfer the improvement of the standalone language model to the Transducer for speech recognition.
We demonstrate that the proposed factorized neural Transducer yields 15% to 20% WER improvements when out-of-domain text data is used for language model adaptation.
arXiv Detail & Related papers (2021-09-27T15:04:00Z) - Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language
Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements.
We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations.
Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z) - Exploring Software Naturalness through Neural Language Models [56.1315223210742]
The Software Naturalness hypothesis argues that programming languages can be understood through the same techniques used in natural language processing.
We explore this hypothesis through the use of a pre-trained transformer-based language model to perform code analysis tasks.
arXiv Detail & Related papers (2020-06-22T21:56:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.