The Influence of Domain-Based Preprocessing on Subject-Specific
Clustering
- URL: http://arxiv.org/abs/2011.08127v1
- Date: Mon, 16 Nov 2020 17:47:19 GMT
- Title: The Influence of Domain-Based Preprocessing on Subject-Specific
Clustering
- Authors: Alexandra Gkolia, Nikhil Fernandes, Nicolas Pizzo, James Davenport and
Akshar Nair
- Abstract summary: The sudden change of moving the majority of teaching online at Universities has caused an increased amount of workload for academics.
One way to deal with this problem is to cluster these questions depending on their topic.
In this paper, we explore the realms of tagging data sets, focusing on identifying code excerpts and providing empirical results.
- Score: 55.41644538483948
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The sudden change of moving the majority of teaching online at Universities
due to the global Covid-19 pandemic has caused an increased amount of workload
for academics. One of the contributing factors is answering a high volume of
queries coming from students. As these queries are not limited to the
synchronous time frame of a lecture, there is a high chance of many of them
being related or even equivalent. One way to deal with this problem is to
cluster these questions depending on their topic. In our previous work, we
aimed to find an improved method of clustering that would give us a high
efficiency, using a recurring LDA model. Our data set contained questions
posted online from a Computer Science course at the University of Bath. A
significant number of these questions contained code excerpts, which we found
caused a problem in clustering, as certain terms were being considered as
common words in the English language and not being recognised as specific code
terms. To address this, we implemented tagging of these technical terms using
Python, as part of preprocessing the data set. In this paper, we explore the
realms of tagging data sets, focusing on identifying code excerpts and
providing empirical results in order to justify our reasoning.
Related papers
- Open Domain Question Answering with Conflicting Contexts [55.739842087655774]
We find that as much as 25% of unambiguous, open domain questions can lead to conflicting contexts when retrieved using Google Search.
We ask our annotators to provide explanations for their selections of correct answers.
arXiv Detail & Related papers (2024-10-16T07:24:28Z) - QUITO-X: An Information Bottleneck-based Compression Algorithm with Cross-Attention [37.25151458038128]
We introduce information bottleneck theory to examine the properties required by the metric.
Inspired by this, we use cross-attention in encoder-decoder architecture as a new metric.
Our simple method leads to significantly better performance in smaller models with lower latency.
arXiv Detail & Related papers (2024-08-20T02:44:45Z) - Cache & Distil: Optimising API Calls to Large Language Models [82.32065572907125]
Large-scale deployment of generative AI tools often depends on costly API calls to a Large Language Model (LLM) to fulfil user queries.
To curtail the frequency of these calls, one can employ a smaller language model -- a student.
This student gradually gains proficiency in independently handling an increasing number of user requests.
arXiv Detail & Related papers (2023-10-20T15:01:55Z) - Can Large Language Models Infer Causation from Correlation? [104.96351414570239]
We test the pure causal inference skills of large language models (LLMs)
We formulate a novel task Corr2Cause, which takes a set of correlational statements and determines the causal relationship between the variables.
We show that these models achieve almost close to random performance on the task.
arXiv Detail & Related papers (2023-06-09T12:09:15Z) - Attention-based model for predicting question relatedness on Stack
Overflow [0.0]
We propose an Attention-based Sentence pair Interaction Model (ASIM) to predict the relatedness between questions on Stack Overflow automatically.
ASIM has made significant improvement over the baseline approaches in Precision, Recall, and Micro-F1 evaluation metrics.
Our model also performs well in the duplicate question detection task of Ask Ubuntu.
arXiv Detail & Related papers (2021-03-19T12:18:03Z) - Unification of HDP and LDA Models for Optimal Topic Clustering of
Subject Specific Question Banks [55.41644538483948]
An increase in the popularity of online courses would result in an increase in the number of course-related queries for academics.
In order to reduce the time spent on answering each individual question, clustering them is an ideal choice.
We use the Hierarchical Dirichlet Process to determine an optimal topic number input for our LDA model runs.
arXiv Detail & Related papers (2020-10-04T18:21:20Z) - ClarQ: A large-scale and diverse dataset for Clarification Question
Generation [67.1162903046619]
We devise a novel bootstrapping framework that assists in the creation of a diverse, large-scale dataset of clarification questions based on postcomments extracted from stackexchange.
We quantitatively demonstrate the utility of the newly created dataset by applying it to the downstream task of question-answering.
We release this dataset in order to foster research into the field of clarification question generation with the larger goal of enhancing dialog and question answering systems.
arXiv Detail & Related papers (2020-06-10T17:56:50Z) - Active Learning for Skewed Data Sets [25.866341631677688]
We focus on problems with two distinguishing characteristics: severe class imbalance (skew) and small amounts of initial training data.
We propose a hybrid active learning algorithm (HAL) that balances exploiting the knowledge available through the currently labeled training examples with exploring the large amount of unlabeled data.
arXiv Detail & Related papers (2020-05-23T01:50:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.