Related papers: StylOch at PAN: Gradient-Boosted Trees with Frequency-Based Stylometric Features

StylOch at PAN: Gradient-Boosted Trees with Frequency-Based Stylometric Features

URL: http://arxiv.org/abs/2507.12064v1
Date: Wed, 16 Jul 2025 09:21:20 GMT
Title: StylOch at PAN: Gradient-Boosted Trees with Frequency-Based Stylometric Features
Authors: Jeremi K. Ochab, Mateusz Matias, Tymoteusz Boba, Tomasz Walkowiak,
Abstract summary: This submission to the binary AI detection task is based on a modular stylometric pipeline.<n>We collect a large corpus of more than 500 000 machine-generated texts for the classifier's training.<n>Our approach follows the non-neural, computationally inexpensive but explainable approach found effective previously.
Score: 0.1499944454332829
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This submission to the binary AI detection task is based on a modular stylometric pipeline, where: public spaCy models are used for text preprocessing (including tokenisation, named entity recognition, dependency parsing, part-of-speech tagging, and morphology annotation) and extracting several thousand features (frequencies of n-grams of the above linguistic annotations); light-gradient boosting machines are used as the classifier. We collect a large corpus of more than 500 000 machine-generated texts for the classifier's training. We explore several parameter options to increase the classifier's capacity and take advantage of that training set. Our approach follows the non-neural, computationally inexpensive but explainable approach found effective previously.

Related papers

Training-Free Tokenizer Transplantation via Orthogonal Matching Pursuit [45.18582668677648]
We present a training-free method to transplant tokenizers in large language models.<n>We approximate each out-of-vocabulary token as a sparse linear combination of shared tokens.<n>We show that OMP achieves best zero-shot preservation of the base model's performance.
arXiv Detail & Related papers (2025-06-07T00:51:27Z)
Beyond General Prompts: Automated Prompt Refinement using Contrastive Class Alignment Scores for Disambiguating Objects in Vision-Language Models [0.0]
We introduce a method for automated prompt refinement using a novel metric called the Contrastive Class Alignment Score (CCAS)<n>Our method generates diverse prompt candidates via a large language model and filters them through CCAS, computed using prompt embeddings from a sentence transformer.<n>We evaluate our approach on challenging object categories, demonstrating that our automatic selection of high-precision prompts improves object detection accuracy without the need for model training or labeled data.
arXiv Detail & Related papers (2025-05-14T04:43:36Z)
Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification [5.583948835737293]
We propose a novel evolutionary verbalizer search (EVS) algorithm to improve prompt-based tuning with the high-performance verbalizer. In this paper, we focus on automatically constructing the optimal verbalizer and propose a novelEVS algorithm to improve prompt-based tuning with the high-performance verbalizer.
arXiv Detail & Related papers (2023-06-18T10:03:11Z)
Multi-Modal Classifiers for Open-Vocabulary Object Detection [104.77331131447541]
The goal of this paper is open-vocabulary object detection (OVOD) We adopt a standard two-stage object detector architecture. We explore three ways via: language descriptions, image exemplars, or a combination of the two.
arXiv Detail & Related papers (2023-06-08T18:31:56Z)
Scalable Learning of Latent Language Structure With Logical Offline Cycle Consistency [71.42261918225773]
Conceptually, LOCCO can be viewed as a form of self-learning where the semantic being trained is used to generate annotations for unlabeled text. As an added bonus, the annotations produced by LOCCO can be trivially repurposed to train a neural text generation model.
arXiv Detail & Related papers (2023-05-31T16:47:20Z)
A pipeline and comparative study of 12 machine learning models for text classification [0.0]
Text-based communication is highly favoured as a communication method, especially in business environments. Many machine learning methods for text classification have been proposed and incorporated into the services of most email providers. However, optimising text classification algorithms and finding the right tradeoff on their aggressiveness is still a major research problem.
arXiv Detail & Related papers (2022-04-04T23:51:22Z)
Preliminary study on using vector quantization latent spaces for TTS/VC systems with consistent performance [55.10864476206503]
We investigate the use of quantized vectors to model the latent linguistic embedding. By enforcing different policies over the latent spaces in the training, we are able to obtain a latent linguistic embedding. Our experiments show that the voice cloning system built with vector quantization has only a small degradation in terms of perceptive evaluations.
arXiv Detail & Related papers (2021-06-25T07:51:35Z)
Few-shot Sequence Learning with Transformers [79.87875859408955]
Few-shot algorithms aim at learning new tasks provided only a handful of training examples. In this work we investigate few-shot learning in the setting where the data points are sequences of tokens. We propose an efficient learning algorithm based on Transformers.
arXiv Detail & Related papers (2020-12-17T12:30:38Z)
Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach. In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module. Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z)
Anchor & Transform: Learning Sparse Embeddings for Large Vocabularies [60.285091454321055]
We design a simple and efficient embedding algorithm that learns a small set of anchor embeddings and a sparse transformation matrix. On text classification, language modeling, and movie recommendation benchmarks, we show that ANT is particularly suitable for large vocabulary sizes.
arXiv Detail & Related papers (2020-03-18T13:07:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.