Related papers: Differentiating Emigration from Return Migration of Scholars Using Name-Based Nationality Detection Models

Differentiating Emigration from Return Migration of Scholars Using Name-Based Nationality Detection Models

URL: http://arxiv.org/abs/2505.06107v1
Date: Fri, 09 May 2025 15:03:39 GMT
Title: Differentiating Emigration from Return Migration of Scholars Using Name-Based Nationality Detection Models
Authors: Faeze Ghorbanpour, Thiago Zordan Malaguth, Aliakbar Akbaritabar,
Abstract summary: Most web and digital trace data do not include information about an individual's nationality due to privacy concerns.<n>We propose methods to detect the nationality with the least available data, i.e., full names.<n>Our results show that using the country of first publication as a proxy for nationality underestimates the size of return flows.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Most web and digital trace data do not include information about an individual's nationality due to privacy concerns. The lack of data on nationality can create challenges for migration research. It can lead to a left-censoring issue since we are uncertain about the migrant's country of origin. Once we observe an emigration event, if we know the nationality, we can differentiate it from return migration. We propose methods to detect the nationality with the least available data, i.e., full names. We use the detected nationality in comparison with the country of academic origin, which is a common approach in studying the migration of researchers. We gathered 2.6 million unique name-nationality pairs from Wikipedia and categorized them into families of nationalities with three granularity levels to use as our training data. Using a character-based machine learning model, we achieved a weighted F1 score of 84% for the broadest and 67% for the most granular, country-level categorization. In our empirical study, we used the trained and tested model to assign nationality to 8+ million scholars' full names in Scopus data. Our results show that using the country of first publication as a proxy for nationality underestimates the size of return flows, especially for countries with a more diverse academic workforce, such as the USA, Australia, and Canada. We found that around 48% of emigration from the USA was return migration once we used the country of name origin, in contrast to 33% based on academic origin. In the most recent period, 79% of scholars whose affiliation has consistently changed from the USA to China, and are considered emigrants, have Chinese names in contrast to 41% with a Chinese academic origin. Our proposed methods for addressing left-censoring issues are beneficial for other research that uses digital trace data to study migration.

Related papers

Digital Diasporas: How Origin Characteristics and Host-Native Distance Shape Immigrants' Online Cultural Retention [23.221303294436492]
We identify the antecedents of the mosaic hypothesis or factors that enhance (or diminish) the propensity for cultural retention among immigrants.<n>Based on Facebook advertising data for immigrants from 8 countries residing in the USA, our findings suggest that greater host-native distance is linked to higher online cultural retention.
arXiv Detail & Related papers (2025-11-21T20:15:12Z)
Do You Know About My Nation? Investigating Multilingual Language Models' Cultural Literacy Through Factual Knowledge [68.6805229085352]
Most multilingual question-answering benchmarks do not factor in regional diversity in the information they capture.<n>XNationQA encompasses a total of 49,280 questions on the geography, culture, and history of nine countries, presented in seven languages.<n>We benchmark eight standard multilingual LLMs on XNationQA and evaluate them using two novel transference metrics.
arXiv Detail & Related papers (2025-11-01T18:41:34Z)
Measuring Global Migration Flows using Online Data [0.38836072943850625]
Using privacy protected records from three billion Facebook users, we estimate country-to-country migration flows at monthly granularity for 181 countries.<n>We estimate that 39.1 million people migrated internationally in 2022 (0.63% of the population of the countries in our sample)<n>To support research and policy interventions, we will release these estimates publicly through the Humanitarian Data Exchange.
arXiv Detail & Related papers (2025-04-16T01:19:26Z)
Inferring fine-grained migration patterns across the United States [1.6594124470436404]
We develop a scalable iterative-proportional-fitting based method that reconciles high-resolution but biased proprietary data with low-resolution but more reliable Census data.<n>We produce MIGRATE, a dataset of annual migration matrices from 2010 - 2019 that captures flows between 47.4 billion pairs of Census Block Groups.<n>These estimates are highly correlated with external ground-truth datasets, and improve accuracy and reduce bias relative to raw proprietary data.
arXiv Detail & Related papers (2025-03-26T21:07:44Z)
The diaspora model for human migration [0.07852714805965527]
Existing models primarily rely on population size and travel distance to explain flow fluctuations. We propose the diaspora model of migration, incorporating intensity (the number of people moving to a country) and assortativity (the destination within the country) Our model considers only the existing diaspora sizes in the destination country, influencing the probability of migrants selecting a specific residence.
arXiv Detail & Related papers (2023-09-06T15:17:53Z)
This Land is {Your, My} Land: Evaluating Geopolitical Biases in Language Models [40.61046400448044]
We show that large language models (LLM) recall certain geographical knowledge inconsistently when queried in different languages. As a targeted case study, we consider territorial disputes, an inherently controversial and multilingual task. We propose a suite of evaluation metrics to precisely quantify bias and consistency in responses across different languages.
arXiv Detail & Related papers (2023-05-24T01:16:17Z)
Transfer to a Low-Resource Language via Close Relatives: The Case Study on Faroese [54.00582760714034]
Cross-lingual NLP transfer can be improved by exploiting data and models of high-resource languages. We release a new web corpus of Faroese and Faroese datasets for named entity recognition (NER), semantic text similarity (STS) and new language models trained on all Scandinavian languages.
arXiv Detail & Related papers (2023-04-18T08:42:38Z)
Statistical analysis of word flow among five Indo-European languages [0.0]
We use the Google Books Ngram dataset to analyze word flow among English, French, German, Italian, and Spanish. We study what we define as migrant words'', a type of loanwords that do not change their spelling.
arXiv Detail & Related papers (2023-01-17T16:12:42Z)
Geographic Citation Gaps in NLP Research [63.13508571014673]
This work asks a series of questions on the relationship between geographical location and publication success. We first created a dataset of 70,000 papers from the ACL Anthology, extracted their meta-information, and generated their citation network. We show that not only are there substantial geographical disparities in paper acceptance and citation but also that these disparities persist even when controlling for a number of variables such as venue of publication and sub-field of NLP.
arXiv Detail & Related papers (2022-10-26T02:25:23Z)
GeoMLAMA: Geo-Diverse Commonsense Probing on Multilingual Pre-Trained Language Models [68.50584946761813]
We introduce a framework for geo-diverse commonsense probing on multilingual Language Models (mPLMs) We benchmark 11 standard mPLMs which include variants of mBERT, XLM, mT5, and XGLM on GeoMLAMA dataset. We find that 1) larger mPLM variants do not necessarily store geo-diverse concepts better than its smaller variant; 2) mPLMs are not intrinsically biased towards knowledge from the Western countries; and 3) a language may better probe knowledge about a non-native country than its native country.
arXiv Detail & Related papers (2022-05-24T17:54:50Z)
'Moving On' -- Investigating Inventors' Ethnic Origins Using Supervised Learning [0.0]
Patent data provides rich information about technical inventions, but does not disclose the ethnic origin of inventors. I construct a dataset of 95'202 labeled names and train an artificial recurrent neural network with long-short-term memory (LSTM) to predict ethnic origins. I use this model to classify and investigate the ethnic origins of 2.68 million inventors and provide novel descriptive evidence regarding their ethnic origin composition.
arXiv Detail & Related papers (2022-01-03T10:47:47Z)
Return migration of German-affiliated researchers: Analyzing departure and return by gender, cohort, and discipline using Scopus bibliometric data 1996-2020 [0.6299766708197883]
We use Scopus bibliometric data on 8 million publications from 1.1 million researchers who have published at least once with an affiliation address from Germany in 1996-2020. Our analyses shed light on important career stages and gender disparities between researchers who remain in Germany and those who both migrate out and those who eventually return.
arXiv Detail & Related papers (2021-10-15T19:59:21Z)
Cross-Lingual Training with Dense Retrieval for Document Retrieval [56.319511218754414]
We explore different transfer techniques for document ranking from English annotations to multiple non-English languages. Experiments on the test collections in six languages (Chinese, Arabic, French, Hindi, Bengali, Spanish) from diverse language families. We find that weakly-supervised target language transfer yields competitive performances against the generation-based target language transfer.
arXiv Detail & Related papers (2021-09-03T17:15:38Z)
Brain Drain and Brain Gain in Russia: Analyzing International Migration of Researchers by Discipline using Scopus Bibliometric Data 1996-2020 [77.34726150561087]
We analyze all researchers who have published with a Russian affiliation address in Scopus-indexed sources in 1996-2020. While Russia was a donor country in the late 1990s and early 2000s, it has experienced a relatively balanced circulation of researchers in more recent years. Overall, researchers emigrating from Russia outnumbered and outperformed researchers immigrating to Russia.
arXiv Detail & Related papers (2020-08-07T12:47:38Z)
Learning to Learn Morphological Inflection for Resource-Poor Languages [105.11499402984482]
We propose to cast the task of morphological inflection - mapping a lemma to an indicated inflected form - for resource-poor languages as a meta-learning problem. Treating each language as a separate task, we use data from high-resource source languages to learn a set of model parameters. Experiments with two model architectures on 29 target languages from 3 families show that our suggested approach outperforms all baselines.
arXiv Detail & Related papers (2020-04-28T05:13:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.