The global competition for human capital is fuelled by intricate brain circulation dynamics, where individuals with specialized skills traverse geographic, organizational, and national boundaries to address workforce demands. However, a comprehensive framework for integrating and interpreting heterogeneous data on global brain circulation remains elusive. Here we introduce the Global Brain Circulation Dynamics (GBCD) corpus, a longitudinally integrated repository of geo-information encompassing 223 countries/regions from 2000 to 2024.
Garnered from diachronic narrative texts, the GBCD corpus provides granular insights into transnational brain circulation patterns and their interconnections with sociocultural progress. Continuously updated to reflect spatiotemporal dynamics, the GBCD corpus serves as a definitive reference for real-time and ex-post analysis of global brain circulation. Our analysis reveals two pivotal findings:
- narrative brain circulation closely mirrors physical brain mobility
- geopolitical relations and spatiotemporal dynamics exhibit distinct patterns across countries/regions
The GBCD corpus establishes a novel benchmark for examining spatiotemporal brain circulation worldwide, empowering policymakers to develop evidence-based strategies for attracting and retaining human capital in rapidly evolving global landscape.
The GBCD corpus is a comprehensive dataset comprising 2,904,663,710 tokens, structured into two distinct corpora: diachronic and synchronic. The corpus encompasses 1,764,234 entries related to brain circulation features, with the diachronic corpus accounting for 1,311,616 entries that span a 24-year period (2000-2024). Notably, the diachronic corpus is continuously updated in real-time, ensuring the data remains current and relevant for both real-time and ex-post analyses of brain circulation. In contrast, the synchronic corpus contains 452,618 entries, deliberately excluding timestamp features to facilitate synchronic research.
Version | Update Time | Corpus | Entry Count | Processed Token Count | Token Count | Sentence Count |
---|---|---|---|---|---|---|
V1.0 | 2024-8-29 | Diachronic corpus | 623,072 | 1,134,253,949 | 422,954,074 | 16,914,973 |
Synchronic corpus | 348,508 | 606,015,828 | 158,891,392 | 11,250,558 | ||
V2.0 | 2024-12-16 | Diachronic corpus | 1,111,644 | 2,087,930,788 | 707,785,647 | 38,900,418 |
Synchronic corpus | 452,618 | 816,732,922 | 328,842,410 | 19,253,646 |
The corpus captures key attributes relevant to brain circulation, including origin, destination, diachronic narrative text, URL, and timestamp. Notably, geographic entities are mapped to the global country or region level, facilitating the analysis of transnational brain circulation. Each country or region is accompanied by Countrycode, ISO2, and ISO3 identifiers, enabling multidimensional organization of brain circulation data. Furthermore, we distinguish between origin and destination in geographic entities related to circulation flow, allowing for the representation of brain gain and brain drain, and providing insights into bilateral brain circulation between countries/rigions.
Data Label | Data Description | Data Type |
---|---|---|
circulation id | Unique circulation behaviour text identification | int |
content | The narrative text content in the web address | long text |
countrycode | ISO country code | string |
URL | Source links to transfer narrative text, usually pointing to web pages and domain names | string |
timestamp | Month and Year of transfer behaviour described in the text | date object |
sampling | The collection timestamp of the text data in the source dataset | date object |
iso2code | Country ISO 2 letter code | string |
iso3code | Country ISO 3 letter code | string |
origin | The origin of the circulation behaviour, expressed as geopolitical entity, including country or region | string |
destination | The destination of circulation behaviour, expressed as geopolitical entity, including country or region | string |
These regions do not possess formal recognition or authority under international law, meaning they lack official ISO codes and CountryCodes. As a result, they are not represented in global standards used for identifying sovereign states.
To delineate the geographic boundaries of such regions, we rely on Polygon-type geospatial data. This approach allows for the precise definition of the spatial extent of these areas, even in the absence of formal sovereignty. The polygon format enables the mapping of complex territorial claims or disputed regions, capturing their exact geographic features.
Detailed and structured data related to these regions, as well as fully recognized countries, can be accessed in the Supplementary information . This repository includes comprehensive information about their geographic, political, and other relevant attributes, offering an in-depth look at the regions' boundaries, history, and territorial disputes.
The GBCD corpus spans 223 countrie/regions worldwide, encompassing 193 UN member states, one observer state, and 29 non-sovereign island territories.Our national geographic divisions adhere to methods endorsed by the United Nations Statistics Division for international statistical data collection, ensuring consistency and compatibility with global standards.
- Member State of the United Nations: refers to a sovereign country that has been officially admitted to the United Nations (UN) and holds full membership status. Member States enjoy voting rights, participate in all UN activities, and are bound by the principles outlined in the UN Charter.
- Non-Member Observer State of the United Nations: refers to an entity recognized by the United Nations General Assembly that has observer status, granting it certain privileges and participation rights in UN activities, but without full membership or voting rights in the General Assembly.
- Territories and Islands without Internationally Recognized Sovereignty: refer to territories and islands that declare themselves as independent or autonomous but lack widespread recognition as sovereign states under international law or by the global community, including the United Nations.
Leveraging data mining techniques on the GBCD corpus enables researchers to map and characterize the brain circulation patterns of skilled professionals across different countries. Further more, researchers can gain a deeper understanding of the complex dynamics underlying brain circulation and make informed decisions to address the challenges. This study highlights the potential of data-driven approaches to inform policy and promote more effective brain circulation strategies.
The GBCD corpus enables the comprehensive assessment and characterization of global brain circulation, facilitating planning and analysis at the national and geographic levels. To ensure high data quality and extensive geographic coverage, specific names, materials, and map layouts have been employed. It is essential to note that these choices do not imply any endorsement or stance by the authors or their respective countries regarding the legal status of any nation, territory, or region. Additionally, the depiction of borders and boundaries on the maps is purely indicative and does not signify formal recognition or acceptance by the publisher. The maps and database are intended to provide a neutral representation of geographic information, and any interpretation or inference of political boundaries or affiliations is explicitly excluded.
The relevant paper is currently under review, during which time this repository is private. Once it goes public, a bibtex reference will be provided here.