Comprehensive mapping of local and diaspora scientists: a database and analysis of 63951 Greek scientists

Research policy and planning for a given country may benefit from reliable data on both its scientific workforce as well as the diaspora of scientists for countries with substantial brain drain. Here we use a systematic approach using Scopus to generate a comprehensive country-level database of all scientists in Greece. Moreover, we expand that database to include also Greek diaspora scientists. The database that we have compiled includes 63951 scientists who have published at least 5 papers indexed in Scopus. Of those, 35116 have an affiliation in Greece. We validate the sensitivity and specificity of the database against different control sets of scientists. We also analyze the scientific disciplines of these scientists according to the Science Metrix classification (174 subfield disciplines) and provide detailed data on each of the 63951 scientists using multiple citation indicators and a composite thereof. These analyses demonstrate differential concentrations in specific subfields for the local versus the diaspora cohorts, as well as an advantage of the diaspora cohort in terms of citation indicators especially among top-impact researchers. The approach that we have taken can be applied to map also the scientific workforce of other countries and nations for evaluation, planning and policy purposes.


ABSTRACT (198 words)
Research policy and planning for a given country may benefit from reliable data on both its scientific workforce as well as the diaspora of scientists for countries with substantial brain drain. Here we use a systematic approach using Scopus to generate a comprehensive country-level database of all scientists in Greece. Moreover, we expand that database to include also Greek diaspora scientists. The database that we have compiled includes 63951 scientists who have published at least 5 papers indexed in Scopus. Of those, 35116 have an affiliation in Greece. We validate the sensitivity and specificity of the database against different control sets of scientists. We also analyze the scientific disciplines of these scientists according to the Science Metrix classification (174 subfield disciplines) and provide detailed data on each of the 63951 scientists using multiple citation indicators and a composite thereof. These analyses demonstrate differential concentrations in specific subfields for the local versus the diaspora cohorts, as well as an advantage of the diaspora cohort in terms of citation indicators especially among top-impact researchers. The approach that we have taken can be applied to map also the scientific workforce of other countries and nations for evaluation, planning and policy purposes.

INTRODUCTION
Construction of scientist databases can be useful tools for evaluation, planning and policy-making related to science. Effort to compile national databases of scientists with performance metrics, in particular citation indices, are sometimes undertaken by research assessment authorities. 1,2 Often these efforts may not be sufficiently inclusive. For example, they may depend on non-systematic efforts where scientists voluntarily contribute information by themselves in order to be included. Moreover, citation metrics are difficult to standardize, especially when they are not calculated according to the same processes for all scientists, and when differences of scientific fields are insufficiently accounted for. Importantly, for many countries, brain drain is a major challenge for their scientific workforce. [3][4][5] In these countries, planning and policy decisions would greatly benefit from mapping not only the disciplines and impact of scientists who still work in the country but also of those who have emigrated elsewhere. First generation emigrating scientists and often even second and higher generation emigrants may often still be interested in engaging with the scientific workforce of their country of origin, thus contributing valuable expertise. Countries with strong diaspora may benefit from the skills of diaspora scientists. Scientific diaspora can be useful for both the mobile scientists 6-8 and the countries involved at both ends as it can constitute a modern tool of scientific diplomacy and cooperation between the two countries. 9,10 Here we demonstrate how a large scale, standardized approach can be used to create an inclusive, comprehensive database of scientists in a specific nation. Moreover, we show how one can expand that database to include also scientists who have migrated to other countries. We focus our efforts on Greece and its national workforce and scientific diaspora. Greece is a country that has sustained over the years a very strong current of brain drain. [11][12][13][14] Moreover, the country has been hit by a major economic crisis which has severely limited funding for research and development. Despite some improvements in recent years, funding remains highly suboptimal. Furthermore, scientists of Greek origin include many extremely influential scientists worldwide and past analyses suggest that there are many high-impact Greek scientists, both in Greece and abroad, who are leaders in their fields. 15 Moreover, such previous work has suggested that the number of Greek scientists with substantial impact is much higher proportionally than the share of Greeks in the global population (10 million in Greece and perhaps another 3 million in the diaspora). 15 Mapping Greek scientists in Greece and worldwide would be a valuable resource. The availability of comprehensive science publications databases such as Scopus and the fact that many first Greek names and the large majority of last Greek names tend to be highly specific for Greek descent allow creating a database of scientists of Greek origin. In this paper, we describe how we have constructed such a database and how we have examined its sensitivity and specificity in validation samples. We also present descriptive data for the entire database and for comparative evaluations of Greek origin scientists who have an affiliation in Greece and for those who have an affiliation in other countries. Our work may offer a template for similar scientist-mapping efforts on other countries.

Eligibility criteria
We aimed to capture all scientists of Greek origin who have at least 5 published papers (articles, reviews, and conference proceedings). Eligible scientists were both those born in Greece and those born elsewhere (second or higher generation), but their family had a Greek origin. Scientists were eligible regardless of whether they had their current main affiliation in Greece or elsewhere. We excluded scientists who had fewer than 2 papers published after 1950.
In order to capture eligible scientists with an affiliation in Greece, we queried Scopus 16 as of January 15, 2020 and identified all the last names that had at least one author ID (with any number of papers assigned) that included an affiliation in Greece. We found 70967 names where at least one author ID has an affiliation address in Greece. One researcher manually screened all of these names to identify those that seemed to be of Greek origin, allowing for inclusion of those who might be probable, to avoid losing potentially eligible names. A second researcher then examined the manual extraction and made amendments. Eventually, 57732 last names were retained.
We also screened manually the files of the top-100,000 most cited scientists based on a composite indicator that had been published previously. 17  In addition, two online sources of common Greek first names were screened manually starting from the most common ones until 86 first names were selected that were thought to be relatively specific for Greek origin people. For example, George is a common name in Greece, but it is not Greek-specific, i.e. the vast majority of people with first name George are not of Greek origin. Conversely, Georgios is highly Greekspecific.
At a next step, we retrieved from Scopus all author ID files with at least 5 papers (articles, reviews, or conference papers) where scientists had either a seemingly Greek-specific last name (any of the 57332 last names mentioned above, or any of the last names of highly-cited Greek scientists according to any of the three previously published lists) or a seemingly Greek-specific first name (any of the 86 mentioned above).
Eventually, a total of 124656 author ID files were retrieved.
These 124656 files were manually screened, perusing information for each scientist including the first name, last name, country of listed affiliation, and institution of listed affiliation that could help identify if the scientist was of Greek origin or not. The availability of all scientists who shared one of the seemingly Greekspecific names along with country information allowed to identify whether any of these names were in fact not Greek-specific. Some last names occur identically both in Greeks and in some other nationality (e.g. Adam or Spinelli). In these cases, information on first name could help classify that individual if the first name was characteristically Greek. If the first name did not help to differentiate in this regard, the country information was used to arbitrate. The site https://forebears.io/ was consulted also in ambiguous cases, since it shows the relative frequency of surnames and names across different countries.
Of 124656 author files, it was concluded that 62837 were very likely Greek. We listed alphabetically by last names the 62837 authors and recorded additional first names that seemed to be Greek-specific. By screening 2,000 names at a time, it was found that relatively few new Greek-specific names were added after screening 8,000 authors and the incremental addition of eligible Greek origin authors would be limited by adding more first names. This process yielded a total of 370 seemingly Greek-specific first names and we then searched Scopus for all additional author IDs with these first names that had not been already captured in the 62837. These additional authors were then manually screened, and 1012 were deemed (based on their name and country information) to be eligible. The resulting database which comprised of 63849 author IDs was subjected to validation checks, as described below. Additions and deletions emerging during these validation checks and a final contribution by the authors of the present study of Greek scientists they knew of, who had not been captured, increased the final count by 102 to a final count of 63951 author IDs.

Validation -sensitivity for capturing scientists of Greek origin who are in Greece
In order to evaluate the sensitivity of the compiled database in capturing scientists who work in Greece, we searched whether it had included scientists working at a university in Greece, the University of Thessaly. Scientists working in different universities and research institutions in Greece are not likely to have systematically different names, so one university is likely to provide a reasonably representative sample. We searched Google Scholar as the reference database since scientists need to enter their names and affiliations by themselves in creating a profile in Google Scholar. The 130 most-cited scientists with profiles and University of Thessaly affiliation in Google Scholar were screened and it was found that all of them (130/130) had been included in our compiled database. Therefore, the sensitivity was 100%, with binomial 95% confidence interval of 97.2%-100%.

Validation -sensitivity for capturing scientists of Greek origin who are not in Greece
In order to evaluate the sensitivity of the compiled database in capturing scientists of Greek origin who do not work in Greece, we used two approaches.
First, we used a sample of scientists who had entered their names in a Linked In database of Greek biomedical scientists created by one of us (K.D.) for the World Hellenic Biomedical Association. We only considered names that had been entered by the scientists themselves, proving that they identified themselves as Greek; and we further limited the search to scientists who gave an address outside of Greece and who had a work title suggesting that they are faculty or other people in senior positions, as opposed to students. Of 42 such individuals, 34 were found to have at least 5 papers in Scopus. Of those 34, 26 were captured in the compiled database, for a sensitivity of 76.5% (95% confidence interval, 58.8% to 89.3%).
Second, we used the names of people listed in the Wikipedia entry on Greek Diaspora. These names are not necessarily of scientists, therefore we examined if each of the names would have been captured through either one of the Greek-specific last names or one of the first Greek-specific names that we had put together in order to compile our database of Greek scientists. For artists and other people who had acquired an artistic/stage name, we used their original name, since change to artistic/stage names would not apply for scientists. We excluded from the screening people born before 1900, as Greek names in the remote past may have been different. Eventually, 28 first-generation and 88 second-or later-generation Greeks were eligible for screening. 14/28 and 35/88 would have been captured by our last or first name searches, corresponding to sensitivity of 50% (95% confidence interval, 30.6% to 69.4%) and 39.8% (95% confidence interval, 29.4% to 50.8%), respectively.

Validation -specificity for capturing scientists of Greek origin
In order to evaluate whether the compiled database might have captured any scientists who were not actually Greek, we randomly selected 100 of the 63849 author IDs. For each of them, we tried to find whether we could find their name written in Greek in the web. Of the 100, their Scopus affiliation was in Greece for 62, in Cyprus for 4, in other countries for 32, and for 2 authors we had no listed affiliation in Scopus. We could find their name written in Greek for all 100 authors. Therefore the specificity was 100% (95% confidence interval 96.4% to 100%).

Evaluation of split author files
Some scientists in Scopus may have their published work split in 2 or more author ID files, and Scopus encourages authors to communicate directly with them to merge such split files. In order to assess how common this pattern might be in the compiled database of Greek authors, after listing the names alphabetically, every 100 th name was selected and assessed whether more than one author ID files may exist for that person in the database. Of 106 screened names, 9 (8.5%, 95% CI, 4.0 to 15.5%) had their work split in 2 files (n=8) or 3 files (n=1).

Data included for each scientist in the database
From each author ID file included in the database, the following information is included based on data directly imported from Scopus on October 1, 2020 (when 7,983,030 author ID files with at least 5 papers [articles, reviews, or conference papers] were available in Scopus) and calculations that are the same to those performed for a recently published list of top-cited scientists: 18  19 and 20. For authors where Scopus listed an affiliation but not a country, we tried to identify the country whenever it would be unambiguous based on the provided affiliation.

Main descriptive characteristics
Of the 63951 author ID files included in the final database, country of affiliation was available for As shown in Table 1, scientists with affiliation in Greece had an almost similar number of papers as scientists with affiliation outside of Greece, but they had substantially fewer citations, fewer papers that cited their work, and were placed on average in lower ranks compared with scientists with affiliation outside of Greece. Results were qualitatively similar regardless of whether self-citations were counted or excluded (Table   1) (Table 3). Almost all of them (30/32, 94%) were listed by Scopus with an affiliation outside of Greece. Of the 32 scientists, information on place of birth could be found on 28 (except for Terzopoulos, Stamatakis, Pavlou, Argyropoulos); three were born in Cyprus (Nicolaides, Nicolaou, Kalogirou), three were born in the USA (Ioannidis, Joannopoulos, Alivisatos), one was born in the UK (Lyketsos), and the remaining 21 had been born in Greece. Of the 32, 18 had received their first degree from an institution in Greece.

DISCUSSION
We have created and validated a database of scientists of Greek origin that may be helpful for evaluation, planning, and research policy purposes. It may also serve as a template for similar efforts to be undertaken for other countries to map their scientific workforce. The iterative approach that we followed may also have special added value for countries that have sustained heavy brain drain and/or who have a substantial scientific diaspora.
The database includes close to 64000 author ID files representing scientists who have published at least 5 papers. Given that some scientists have their publications split in more than one file, the database probably includes close to 60000 unique scientists. Validation exercises suggest that it probably misses very few scientists who meet the productivity eligibility criteria and who have an affiliation in a Greek institution.
Conversely, a more substantial proportion has been missed among those who have an affiliation in an institution outside of Greece. The estimate of the missingness in this regard varies according to different validation sets that we used. Based on scientists working abroad who on their own initiative offered to be included in a database of Greek scientists, about one in 4 scientists were missed with our approach. The percentage of missingness was higher based on a Wikipedia list of diasporeans, even more when extending beyond first generation emigrants. It is unavoidable that our approach would miss Greeks that acquire non-Greek names (upon 2 nd and subsequent generations and for people who change their names (e.g. through marriage, by making the name less foreign-sounding in their new country, or other reasons) and those who have a Greek last name that was not among the ones we searched for. Some of these individuals may still be captured if they possess a highly Greek-specific first name among the list of first names that we screened for.
Therefore, even though scientists with an affiliation in Greece were a slight majority in the compiled database, Greek origin scientists with an affiliation outside of Greece would probably be the majority if all Greek origin scientists could have been retrieved. The total of Greek origin scientists meeting the productivity eligibility criteria may be in the range of 80,000-100,000 (~1.00-1.25% of global total). Conversely, a few scientists included in the database by failed disambiguation. The validation process suggests that this situation is probably very uncommon.
The database reflects the large extent of general emigration from Greece as well as the massive brain drain that the country has sustained over the years, with accelerated rates in the last decade in conjunction with the economic crisis that hit Greece worse than almost any other highly developed country. We noted that the Of interest, the large majority of these extremely highly-cited scientists were born in Greece, and the majority also received their first degree in Greece. This further demonstrates the power of the brain drain process. At the same time, scientists who have remained in Greece still include large numbers placed at the top-20% of their subfield. Thus, the local scientific workforce still has considerable capacity for excellence.
The database includes scientists scattered across almost every scientific subfield. Scientists with an affiliation in Greece have stronger concentrations than those with affiliations outside of Greece in many fields of clinical medicine, several fields of biology, and agriculture/fisheries/forestry. Greece has one of the highest rates of physicians per population in the world, if not the highest. 21 Many of them are engaged in research, authoring or co-authoring papers, since scientific publications are requested and appraised not only for academic track positions, but even for regular clinical positions in the national health system. The advantage is that these incentives create a large pool of physicians with exposure to research. The disadvantage is that much of this research may not be of high quality and these authors have no lasting commitment to research. The concentration in subfields of agriculture, fisheries, and biology is explained probably by the nature of the economy, although agriculture and related fields have shrunk in latest years. Conversely, there are several other fields where most scientists of Greek origin do not work in Greece. This pattern is particularly strong in the social and economic sciences and some cutting-edge biomedical sciences, such as developmental biology.
Some limitations need to be discussed. First, as we have already acknowledged, the database is still missing several Greek origin scientists, in particular among those living and working abroad. We encourage providing relevant information at www.drosatos.com/greekscientists to bring such cases into our attention.
While it is impossible to update the database adding one more scientist at a time, collecting information on missing individuals may allow us to consider further optimized automated processes in the future.
Second, the constructed database was restricted to scientists with at least 5 full papers. In the entire Scopus database, roughly four-fifths of author ID files have fewer than 5 papers. Some of the author ID files with sparse papers may be split-off fragments of the publication corpus of authors represented by some larger file. Nevertheless, by extrapolation, the total number of Greek authors who have published at least one paper may be in the range of 250,000-500,000. The overwhelming majority of authors of 1-4 papers are not major contributors or leaders in the scientific enterprise. However, many young scientists in this group may become major contributors or leaders in the future. Therefore, follow-up updates would be useful to perform.
Third, errors (either splitting the same author in two or more author ID files or including some papers by two or more authors in the same file) and inaccuracies in affiliations are possible. Authors who recognize errors should contact directly Scopus to make these corrections in Scopus per se, so that they may be carried over in our database, with any potential future updates. The entire Scopus database currently has overall 98.1% precision (proportion of papers in an author ID file that belong to the author) and 94.4% recall (proportion of papers of an author included in the largest profile). Precision and recall may be even better for Greek-name authors, because Greek names are more rare and thus more specific than those of most other origins (e.g. disambiguation challenges for "Liu Wang" are greater than for "Yiannis Triantafyllou"). We found 8.5% of the authors in the database to have a split profile and, given that even when one profile carries the large majority of the author's papers, recall probably substantially exceed 94.4% for our database. Finally, all citation metrics have limitations and their use should be done with caution and not as absolute indicators. [22][23][24] We made no effort to assess the quality of the published works. Some authors may rank high, but may have other reasons for concern, e.g. retracted papers, or implausibly high self-citation metrics or evidence for citation farms. These need to be carefully scrutinized on a case-by-case basis.
Acknowledging these caveats, the compiled database offers a tool that may be useful for both research and policy purposes. For a country that is trying to recover from a lengthy economic crisis and a superimposed crisis from the recent COVID-19 pandemic, realization of its scientific potential, deceleration and reversal of the brain drain and informed decision-making in the interface of science and society may offer substantial added value. Brain drain and diaspora does not need to have negative consequences for the home country; mapping of the scientific workforce and diaspora may help to maximize positive impact. 9,10,25 We also hope that the iterative approach used here may be applied also to map the scientific workforce and scientific diaspora of other countries/nations as well. Scopus data can readily identify scientists with affiliation in a given country. In the case of Greece, where few scientists immigrate to from other countries, almost all scientists with affiliation in Greece have Greek names. This would not be true for countries that attract many scientists from other countries, but usually it is more important to map the entire scientific workforce rather than just native scientists. The ability to map the diaspora of different countries depends on whether there are many first and last names that are country-specific. Specificity may vary substantially across countries and careful validation and cross-checking procedures should be applied accordingly.

Conflicts of interest: none
Jeroen Baas is an employee of Elsevier that runs Scopus