A Greek Parliament Proceedings Dataset for Computational Linguistics and
Political Analysis
- URL: http://arxiv.org/abs/2210.12883v1
- Date: Sun, 23 Oct 2022 23:23:28 GMT
- Title: A Greek Parliament Proceedings Dataset for Computational Linguistics and
Political Analysis
- Authors: Konstantina Dritsa, Kaiti Thoma, John Pavlopoulos, Panos Louridas
- Abstract summary: We introduce a curated dataset of the Greek Parliament Proceedings that extends chronologically from 1989 up to 2020.
It consists of more than 1 million speeches with extensive metadata, extracted from 5,355 parliamentary record files.
- Score: 4.396860522241306
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large, diachronic datasets of political discourse are hard to come across,
especially for resource-lean languages such as Greek. In this paper, we
introduce a curated dataset of the Greek Parliament Proceedings that extends
chronologically from 1989 up to 2020. It consists of more than 1 million
speeches with extensive metadata, extracted from 5,355 parliamentary record
files. We explain how it was constructed and the challenges that we had to
overcome. The dataset can be used for both computational linguistics and
political analysis-ideally, combining the two. We present such an application,
showing (i) how the dataset can be used to study the change of word usage
through time, (ii) between significant historical events and political parties,
(iii) by evaluating and employing algorithms for detecting semantic shifts.
Related papers
- SpeakGer: A meta-data enriched speech corpus of German state and federal parliaments [0.12277343096128711]
We provide the SpeakGer data set, consisting of German parliament debates from all 16 federal states of Germany as well as the German Bundestag from 1947-2023.
This data set includes rich meta data in form of information on both reactions from the audience towards the speech as well as information about the speaker's party, their age, their constituency and their party's political alignment.
arXiv Detail & Related papers (2024-10-23T14:00:48Z) - Learning Phonotactics from Linguistic Informants [54.086544221761486]
Our model iteratively selects or synthesizes a data-point according to one of a range of information-theoretic policies.
We find that the information-theoretic policies that our model uses to select items to query the informant achieve sample efficiency comparable to, or greater than, fully supervised approaches.
arXiv Detail & Related papers (2024-05-08T00:18:56Z) - PILA: A Historical-Linguistic Dataset of Proto-Italic and Latin [11.820097994590672]
We introduce the Proto-Italic to Latin dataset, which consists of roughly 3,000 pairs of forms from Proto-Italic and Latin.
We present baseline results for PILA on a pair of traditional computational historical linguistics tasks.
We demonstrate PILA's capability for enhancing other historical-linguistic datasets.
arXiv Detail & Related papers (2024-04-25T05:33:47Z) - Syntactic Language Change in English and German: Metrics, Parsers, and Convergences [56.47832275431858]
The current paper looks at diachronic trends in syntactic language change in both English and German, using corpora of parliamentary debates from the last c. 160 years.
We base our observations on five dependencys, including the widely used Stanford Core as well as 4 newer alternatives.
We show that changes in syntactic measures seem to be more frequent at the tails of sentence length distributions.
arXiv Detail & Related papers (2024-02-18T11:46:16Z) - Multilingual estimation of political-party positioning: From label
aggregation to long-input Transformers [3.651047982634467]
We implement and compare two approaches to automatic scaling analysis of political-party manifestos.
We find that the task can be efficiently solved by state-of-the-art models, with label aggregation producing the best results.
arXiv Detail & Related papers (2023-10-19T08:34:48Z) - The ParlaSent Multilingual Training Dataset for Sentiment Identification in Parliamentary Proceedings [0.0]
The paper presents a new training dataset of sentences in 7 languages, manually annotated for sentiment.
The paper additionally introduces the first domain-specific multilingual transformer language model for political science applications.
arXiv Detail & Related papers (2023-09-18T14:01:06Z) - Panning for gold: Lessons learned from the platform-agnostic automated
detection of political content in textual data [48.7576911714538]
We discuss how these techniques can be used to detect political content across different platforms.
We compare the performance of three groups of detection techniques relying on dictionaries, supervised machine learning, or neural networks.
Our results show the limited impact of preprocessing on model performance, with the best results for less noisy data being achieved by neural network- and machine-learning-based models.
arXiv Detail & Related papers (2022-07-01T15:23:23Z) - BasqueParl: A Bilingual Corpus of Basque Parliamentary Transcriptions [3.4447242282168777]
We release the first version of a newly compiled corpus from Basque parliamentary transcripts.
The corpus is characterized by heavy Basque-Spanish code-switching, and represents an interesting resource to study political discourse in contrasting languages such as Basque and Spanish.
arXiv Detail & Related papers (2022-05-03T14:02:24Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - A Corpus for Large-Scale Phonetic Typology [112.19288631037055]
We present VoxClamantis v1.0, the first large-scale corpus for phonetic typology.
aligned segments and estimated phoneme-level labels in 690 readings spanning 635 languages, along with acoustic-phonetic measures of vowels and sibilants.
arXiv Detail & Related papers (2020-05-28T13:03:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.