Related papers: The Influence of Domain-Based Preprocessing on Subject-Specific Clustering

The Influence of Domain-Based Preprocessing on Subject-Specific Clustering

URL: http://arxiv.org/abs/2011.08127v1
Date: Mon, 16 Nov 2020 17:47:19 GMT
Title: The Influence of Domain-Based Preprocessing on Subject-Specific Clustering
Authors: Alexandra Gkolia, Nikhil Fernandes, Nicolas Pizzo, James Davenport and Akshar Nair
Abstract summary: The sudden change of moving the majority of teaching online at Universities has caused an increased amount of workload for academics. One way to deal with this problem is to cluster these questions depending on their topic. In this paper, we explore the realms of tagging data sets, focusing on identifying code excerpts and providing empirical results.
Score: 55.41644538483948
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The sudden change of moving the majority of teaching online at Universities due to the global Covid-19 pandemic has caused an increased amount of workload for academics. One of the contributing factors is answering a high volume of queries coming from students. As these queries are not limited to the synchronous time frame of a lecture, there is a high chance of many of them being related or even equivalent. One way to deal with this problem is to cluster these questions depending on their topic. In our previous work, we aimed to find an improved method of clustering that would give us a high efficiency, using a recurring LDA model. Our data set contained questions posted online from a Computer Science course at the University of Bath. A significant number of these questions contained code excerpts, which we found caused a problem in clustering, as certain terms were being considered as common words in the English language and not being recognised as specific code terms. To address this, we implemented tagging of these technical terms using Python, as part of preprocessing the data set. In this paper, we explore the realms of tagging data sets, focusing on identifying code excerpts and providing empirical results in order to justify our reasoning.

Related papers

A Survey of Text Classification Under Class Distribution Shift [20.204466949038284]
In daily practice, the distribution of the test data changes over time, which hinders the application of machine learning models. One domain where the distribution shift naturally occurs is text classification, since people always find new topics to discuss. We divide the methods in this area based on the constraints that define the kind of distribution shift and the corresponding problem formulation.
arXiv Detail & Related papers (2025-02-18T15:46:54Z)
Open Domain Question Answering with Conflicting Contexts [55.739842087655774]
We find that as much as 25% of unambiguous, open domain questions can lead to conflicting contexts when retrieved using Google Search. We ask our annotators to provide explanations for their selections of correct answers.
arXiv Detail & Related papers (2024-10-16T07:24:28Z)
QUITO-X: An Information Bottleneck-based Compression Algorithm with Cross-Attention [37.25151458038128]
We introduce information bottleneck theory to examine the properties required by the metric. Inspired by this, we use cross-attention in encoder-decoder architecture as a new metric. Our simple method leads to significantly better performance in smaller models with lower latency.
arXiv Detail & Related papers (2024-08-20T02:44:45Z)
Cache & Distil: Optimising API Calls to Large Language Models [82.32065572907125]
Large-scale deployment of generative AI tools often depends on costly API calls to a Large Language Model (LLM) to fulfil user queries. To curtail the frequency of these calls, one can employ a smaller language model -- a student. This student gradually gains proficiency in independently handling an increasing number of user requests.
arXiv Detail & Related papers (2023-10-20T15:01:55Z)
Can Large Language Models Infer Causation from Correlation? [104.96351414570239]
We test the pure causal inference skills of large language models (LLMs) We formulate a novel task Corr2Cause, which takes a set of correlational statements and determines the causal relationship between the variables. We show that these models achieve almost close to random performance on the task.
arXiv Detail & Related papers (2023-06-09T12:09:15Z)
Attention-based model for predicting question relatedness on Stack Overflow [0.0]
We propose an Attention-based Sentence pair Interaction Model (ASIM) to predict the relatedness between questions on Stack Overflow automatically. ASIM has made significant improvement over the baseline approaches in Precision, Recall, and Micro-F1 evaluation metrics. Our model also performs well in the duplicate question detection task of Ask Ubuntu.
arXiv Detail & Related papers (2021-03-19T12:18:03Z)
Unification of HDP and LDA Models for Optimal Topic Clustering of Subject Specific Question Banks [55.41644538483948]
An increase in the popularity of online courses would result in an increase in the number of course-related queries for academics. In order to reduce the time spent on answering each individual question, clustering them is an ideal choice. We use the Hierarchical Dirichlet Process to determine an optimal topic number input for our LDA model runs.
arXiv Detail & Related papers (2020-10-04T18:21:20Z)
ClarQ: A large-scale and diverse dataset for Clarification Question Generation [67.1162903046619]
We devise a novel bootstrapping framework that assists in the creation of a diverse, large-scale dataset of clarification questions based on postcomments extracted from stackexchange. We quantitatively demonstrate the utility of the newly created dataset by applying it to the downstream task of question-answering. We release this dataset in order to foster research into the field of clarification question generation with the larger goal of enhancing dialog and question answering systems.
arXiv Detail & Related papers (2020-06-10T17:56:50Z)
Active Learning for Skewed Data Sets [25.866341631677688]
We focus on problems with two distinguishing characteristics: severe class imbalance (skew) and small amounts of initial training data. We propose a hybrid active learning algorithm (HAL) that balances exploiting the knowledge available through the currently labeled training examples with exploring the large amount of unlabeled data.
arXiv Detail & Related papers (2020-05-23T01:50:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.