Automatic Detection of Industry Sectors in Legal Articles Using Machine
Learning Approaches
- URL: http://arxiv.org/abs/2303.05387v1
- Date: Wed, 8 Mar 2023 12:41:56 GMT
- Title: Automatic Detection of Industry Sectors in Legal Articles Using Machine
Learning Approaches
- Authors: Hui Yang (1 and 2), Stella Hadjiantoni (1), Yunfei Long (3), Ruta
Petraityte (2), Berthold Lausen (1 and 4) ((1) Department of Mathematical
Sciences, University of Essex, Wivenhoe Park, Colchester, CO43SQ, UK, (2)
Mondaq Ltd, Bristol, UK, (3) School of Computer Science and Electronic
Engineering, University of Essex, Wivenhoe Park, Colchester, CO43SQ, UK, (4)
Institute of Medical Informatics, Biometry and Epidemiology, School of
Medicine, Friedrich-Alexander University Erlangen-Nuremberg, Waldstr. 6,
Erlangen, 91054, Germany)
- Abstract summary: A dataset consisting of over 1,700 annotated legal articles was created for the identification of six industry sectors.
The system achieved promising results with area under the receiver operating characteristic curve scores above 0.90 and F-scores above 0.81 with respect to the six industry sectors.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The ability to automatically identify industry sector coverage in articles on
legal developments, or any kind of news articles for that matter, can bring
plentiful of benefits both to the readers and the content creators themselves.
By having articles tagged based on industry coverage, readers from all around
the world would be able to get to legal news that are specific to their region
and professional industry. Simultaneously, writers would benefit from
understanding which industries potentially lack coverage or which industries
readers are currently mostly interested in and thus, they would focus their
writing efforts towards more inclusive and relevant legal news coverage. In
this paper, a Machine Learning-powered industry analysis approach which
combined Natural Language Processing (NLP) with Statistical and Machine
Learning (ML) techniques was investigated. A dataset consisting of over 1,700
annotated legal articles was created for the identification of six industry
sectors. Text and legal based features were extracted from the text. Both
traditional ML methods (e.g. gradient boosting machine algorithms, and
decision-tree based algorithms) and deep neural network (e.g. transformer
models) were applied for performance comparison of predictive models. The
system achieved promising results with area under the receiver operating
characteristic curve scores above 0.90 and F-scores above 0.81 with respect to
the six industry sectors. The experimental results show that the suggested
automated industry analysis which employs ML techniques allows the processing
of large collections of text data in an easy, efficient, and scalable way.
Traditional ML methods perform better than deep neural networks when only a
small and domain-specific training data is available for the study.
Related papers
- A Bayesian Approach to Harnessing the Power of LLMs in Authorship Attribution [57.309390098903]
Authorship attribution aims to identify the origin or author of a document.
Large Language Models (LLMs) with their deep reasoning capabilities and ability to maintain long-range textual associations offer a promising alternative.
Our results on the IMDb and blog datasets show an impressive 85% accuracy in one-shot authorship classification across ten authors.
arXiv Detail & Related papers (2024-10-29T04:14:23Z) - Uncovering Key Trends in Industry 5.0 through Advanced AI Techniques [0.0]
This article analyzes around 200 online articles to identify trends within Industry 5.0 using artificial intelligence techniques.
The results reveal a convergence around a core set of themes while also highlighting that Industry 5.0 spans a wide range of topics.
arXiv Detail & Related papers (2024-10-22T07:06:00Z) - LLM-DetectAIve: a Tool for Fine-Grained Machine-Generated Text Detection [87.43727192273772]
It is often hard to tell whether a piece of text was human-written or machine-generated.
We present LLM-DetectAIve, designed for fine-grained detection.
It supports four categories: (i) human-written, (ii) machine-generated, (iii) machine-written, then machine-humanized, and (iv) human-written, then machine-polished.
arXiv Detail & Related papers (2024-08-08T07:43:17Z) - Envisioning Outlier Exposure by Large Language Models for Out-of-Distribution Detection [71.93411099797308]
Out-of-distribution (OOD) samples are crucial when deploying machine learning models in open-world scenarios.
We propose to tackle this constraint by leveraging the expert knowledge and reasoning capability of large language models (LLM) to potential Outlier Exposure, termed EOE.
EOE can be generalized to different tasks, including far, near, and fine-language OOD detection.
EOE achieves state-of-the-art performance across different OOD tasks and can be effectively scaled to the ImageNet-1K dataset.
arXiv Detail & Related papers (2024-06-02T17:09:48Z) - Automatic explanation of the classification of Spanish legal judgments in jurisdiction-dependent law categories with tree estimators [6.354358255072839]
This work contributes with a system combining Natural Language Processing (NLP) with Machine Learning (ML) to classify legal texts in an explainable manner.
We analyze the features involved in the decision and the threshold bifurcation values of the decision paths of tree structures.
Legal experts have validated our solution, and this knowledge has also been incorporated into the explanation process as "expert-in-the-loop" dictionaries.
arXiv Detail & Related papers (2024-03-30T17:59:43Z) - Unsupervised Sentiment Analysis of Plastic Surgery Social Media Posts [91.3755431537592]
The massive collection of user posts across social media platforms is primarily untapped for artificial intelligence (AI) use cases.
Natural language processing (NLP) is a subfield of AI that leverages bodies of documents, known as corpora, to train computers in human-like language understanding.
This study demonstrates that the applied results of unsupervised analysis allow a computer to predict either negative, positive, or neutral user sentiment towards plastic surgery.
arXiv Detail & Related papers (2023-07-05T20:16:20Z) - Application of Transformers based methods in Electronic Medical Records:
A Systematic Literature Review [77.34726150561087]
This work presents a systematic literature review of state-of-the-art advances using transformer-based methods on electronic medical records (EMRs) in different NLP tasks.
arXiv Detail & Related papers (2023-04-05T22:19:42Z) - Multidimensional Perceptron for Efficient and Explainable Long Text
Classification [31.31206469613901]
We propose a simple but effective model, Segment-aWare multIdimensional PErceptron (SWIPE) to replace attention/RNNs in the framework.
SWIPE can effectively learn the label of the entire text with supervised training, while perceive the labels of the segments and estimate their contributions to the long-text labeling.
arXiv Detail & Related papers (2023-04-04T08:49:39Z) - RaFoLa: A Rationale-Annotated Corpus for Detecting Indicators of Forced
Labour [4.393754160527062]
This paper presents the first openly accessible English corpus annotated for multi-class and multi-label forced labour detection.
The corpus consists of 989 news articles retrieved from specialised data sources and annotated according to risk indicators defined by the International Labour Organization (ILO)
arXiv Detail & Related papers (2022-05-05T14:43:31Z) - Rebuilding Trust in Active Learning with Actionable Metrics [77.99796068970569]
Active Learning (AL) is an active domain of research, but is seldom used in the industry despite the pressing needs.
This is in part due to a misalignment of objectives, while research strives at getting the best results on selected datasets.
We present various actionable metrics to help rebuild trust of industrial practitioners in Active Learning.
arXiv Detail & Related papers (2020-12-18T09:34:59Z) - Supervised Text Classification using Text Search [0.0]
Authors describe a class of industrial standard algorithms which can accurately predict classification of any text given prior labelled text data.
These algorithms were used to automate routing of issue tickets to the appropriate team.
arXiv Detail & Related papers (2020-11-14T19:51:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.