Is it feasible to detect FLOSS version release events from textual
messages? A case study on Stack Overflow
- URL: http://arxiv.org/abs/2003.14257v3
- Date: Sat, 19 Dec 2020 12:49:35 GMT
- Title: Is it feasible to detect FLOSS version release events from textual
messages? A case study on Stack Overflow
- Authors: A. Sokolovsky, T. Gross, J. Bacardit
- Abstract summary: The study investigates the feasibility of micro-event detection on textual data using a sample of messages from the Stack Overflow Q&A platform.
We build pipelines for detection of micro-events using three different estimators whose parameters are optimized using a grid search approach.
In our experiments we investigate whether there is a characteristic change in the topics distribution or sentiment features before or after micro-events take place.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Topic Detection and Tracking (TDT) is a very active research question within
the area of text mining, generally applied to news feeds and Twitter datasets,
where topics and events are detected. The notion of "event" is broad, but
typically it applies to occurrences that can be detected from a single post or
a message. Little attention has been drawn to what we call "micro-events",
which, due to their nature, cannot be detected from a single piece of textual
information. The study investigates the feasibility of micro-event detection on
textual data using a sample of messages from the Stack Overflow Q&A platform
and Free/Libre Open Source Software (FLOSS) version releases from Libraries.io
dataset. We build pipelines for detection of micro-events using three different
estimators whose parameters are optimized using a grid search approach. We
consider two feature spaces: LDA topic modeling with sentiment analysis, and
hSBM topics with sentiment analysis. The feature spaces are optimized using the
recursive feature elimination with cross validation (RFECV) strategy.
In our experiments we investigate whether there is a characteristic change in
the topics distribution or sentiment features before or after micro-events take
place and we thoroughly evaluate the capacity of each variant of our analysis
pipeline to detect micro-events. Additionally, we perform a detailed
statistical analysis of the models, including influential cases, variance
inflation factors, validation of the linearity assumption, pseudo R squared
measures and no-information rate. Finally, in order to study limits of
micro-event detection, we design a method for generating micro-event synthetic
datasets with similar properties to the real-world data, and use them to
identify the micro-event detectability threshold for each of the evaluated
classifiers.
Related papers
- Pretraining Data Detection for Large Language Models: A Divergence-based Calibration Method [108.56493934296687]
We introduce a divergence-based calibration method, inspired by the divergence-from-randomness concept, to calibrate token probabilities for pretraining data detection.
We have developed a Chinese-language benchmark, PatentMIA, to assess the performance of detection approaches for LLMs on Chinese text.
arXiv Detail & Related papers (2024-09-23T07:55:35Z) - Downstream-Pretext Domain Knowledge Traceback for Active Learning [138.02530777915362]
We propose a downstream-pretext domain knowledge traceback (DOKT) method that traces the data interactions of downstream knowledge and pre-training guidance.
DOKT consists of a traceback diversity indicator and a domain-based uncertainty estimator.
Experiments conducted on ten datasets show that our model outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2024-07-20T01:34:13Z) - Detecting Pretraining Data from Large Language Models [90.12037980837738]
We study the pretraining data detection problem.
Given a piece of text and black-box access to an LLM without knowing the pretraining data, can we determine if the model was trained on the provided text?
We introduce a new detection method Min-K% Prob based on a simple hypothesis.
arXiv Detail & Related papers (2023-10-25T17:21:23Z) - Joint Microseismic Event Detection and Location with a Detection Transformer [8.505271826735118]
We propose an approach to unify event detection and source location into a single framework.
The proposed network is trained on synthetic data simulating multiple microseismic events corresponding to random source locations.
arXiv Detail & Related papers (2023-07-16T10:56:46Z) - On the Possibilities of AI-Generated Text Detection [76.55825911221434]
We argue that as machine-generated text approximates human-like quality, the sample size needed for detection bounds increases.
We test various state-of-the-art text generators, including GPT-2, GPT-3.5-Turbo, Llama, Llama-2-13B-Chat-HF, and Llama-2-70B-Chat-HF, against detectors, including oBERTa-Large/Base-Detector, GPTZero.
arXiv Detail & Related papers (2023-04-10T17:47:39Z) - DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability
Curvature [143.5381108333212]
We show that text sampled from an large language model tends to occupy negative curvature regions of the model's log probability function.
We then define a new curvature-based criterion for judging if a passage is generated from a given LLM.
We find DetectGPT is more discriminative than existing zero-shot methods for model sample detection.
arXiv Detail & Related papers (2023-01-26T18:44:06Z) - Data Leakage and Evaluation Issues in Micro-Expression Analysis [45.215233522470115]
We show that data leakage and fragmented evaluation protocols are issues among the micro-expression literature.
We propose a new standardized evaluation protocol using facial action units with over 2000 micro-expression samples, and provide an open source library that implements the evaluation protocols in a standardized manner.
arXiv Detail & Related papers (2022-11-21T13:12:07Z) - Unsupervised Event Detection, Clustering, and Use Case Exposition in
Micro-PMU Measurements [0.0]
We develop an unsupervised event detection method based on the concept of Generative Adversarial Networks (GAN)
We also propose a two-step unsupervised clustering method, based on a novel linear mixed integer programming formulation.
Results show that they can outperform the prevalent methods in the literature.
arXiv Detail & Related papers (2020-07-30T05:20:29Z) - Meta Learning for Causal Direction [29.00522306460408]
We introduce a novel generative model that allows distinguishing cause and effect in the small data setting.
We demonstrate our method on various synthetic as well as real-world data and show that it is able to maintain high accuracy in detecting directions across varying dataset sizes.
arXiv Detail & Related papers (2020-07-06T15:12:05Z) - Stance Detection Benchmark: How Robust Is Your Stance Detection? [65.91772010586605]
Stance Detection (StD) aims to detect an author's stance towards a certain topic or claim.
We introduce a StD benchmark that learns from ten StD datasets of various domains in a multi-dataset learning setting.
Within this benchmark setup, we are able to present new state-of-the-art results on five of the datasets.
arXiv Detail & Related papers (2020-01-06T13:37:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.