Methodology for Global AI Talent Tracker 2.0

Since 2020, MacroPolo’s Global AI Talent Tracker has sought to quantify and chronicle the balance and flow of top-tier AI talent across countries. Our hope is to update our database every few years to assess how various factors may be affecting the distribution of AI talent, arguably the most crucial ingredient in the intensifying competition for leadership in this general purpose technology.

This comparative study is centered on answering these main questions over time:

Where do top-tier AI researchers come from?
Where do top-tier AI researchers work today?
What are their career paths?

Below we detail our approach and the choices we made in collecting and analyzing the data.

Why focus on top-tier AI researchers?

There is a robust debate over what type of talent is most important to enhancing national or institutional AI capabilities. While some argue that countries should prioritize cultivating a large workforce of relatively lower-skilled AI engineers, others contend that it’s more important to prioritize developing and attracting elite researchers.

While we acknowledge the existence of such a debate, this study isn’t intended to address it or take a position. Our rationale for focusing on top-tier AI researchers is based on the simple premise that this cohort is the most likely to lead the way on potential breakthroughs and/or applying AI use cases to address complex real-world problems.

Why NeurIPS?

The Neural Information Processing Systems conference (NeurIPS) is generally recognized as one of or perhaps the top AI conference. Despite doubling the volume of accepted papers since 2019, the acceptance rate nonetheless stood at around 25%. The research presented at NeurIPS has a specific focus on theoretical advances in neural networks and deep learning, two of the subfields that have driven many of the recent advances in AI.

Given its popularity and selectivity, we use a random but representative sample of papers accepted at the most recent conference for which

Key Stats on NeurIPS 2019	Key Stats on NeurIPS 2022
6,614 Papers submitted	10,411 Papers submitted
1,428 Papers accepted	2,761 Papers accepted
21.6% Paper acceptance rate	25.7% Paper acceptance rate

Collecting the author data

Given the high volume of accepted papers (2,671 for the 2022 conference), gathering granular career information on all researchers is very time-consuming and costly. Therefore we opted to select a random sample of 186 papers with a total of 867 authors at a Confidence Level of 95% and a Margin of Error of 7% for estimates made about the entire population in NeurIPS (compared to 175 papers by 675 authors in 2019 at same confidence level).

Sampling at the paper level has two positive attributes: it makes our sample both representative of the quality of papers accepted at the conference and allows us to make estimates at both the author and paper levels. For estimates made regarding subpopulations—such as the post-graduation employment of international students in the United States—there is a marginal decrease in the Confidence Level and an increase in the Margin of Error.

For the Oral Presentations at the 2022 conference, there were 190 papers and 955 authors. Given the smaller population, we were able to collect the same career and educational data for all 955 authors, yielding a true population statistic with zero Margin of Error. We consider this cohort of authors as a proxy for the “most elite” AI researchers (approximately the top 2%). (We did the same for the 2019 Oral Presentations, which had an acceptance rate of 0.5%).

Coding the author data

To ensure that we have an “apple-to-apple” comparison for each update, we remain faithful to the same data collection criteria. For all authors in our sample, we used LinkedIn, personal websites, and other publicly available sources to gather the following information: 1) undergraduate university and country; 2) graduate university and country; 3) current institutional affiliation and country; 4) the country where the headquarters of the authors’ current affiliation is located (e.g. a researcher working at Tencent in the United States would have their current country designated as “USA” and headquarter country designated as “China”); 5) whether the researcher is currently a graduate student; 6) institution type: private sector vs. academia.

Multiple institutional affiliations

If authors list multiple affiliations on a paper, we use the affiliation from the email listed on their paper.

Institution rankings

For the ranking of top institutions, we use all accepted papers and use a “fractional count” method to assign credit to institutions. In a fractional count, each paper is given a value of 1, and that value is then divided up equally between authors.

For example, consider a paper that was co-authored by two researchers: one affiliated with Tsinghua University and the other affiliated with Stanford University. Both Tsinghua and Stanford would be credited with a count of 0.5. We assume the fractional count method for this metric yields a fair representation of the institution’s contribution to the research paper.

Graduate school and country: Master’s vs. PhD

When coding the institution and country affiliations for an author’s graduate school, we use the highest degree they earned or are currently pursuing. For example, if the author received a Master’s from Tsinghua University in China and is pursuing a PhD from Stanford University, we code Stanford University as their graduate institutional affiliation and “USA” as their graduate country affiliation.

Note on regions

Different regional categories encompass the following:

Asia: China (includes Hong Kong), India, Iran, Mongolia, Malaysia, Japan, Pakistan, Philippines, Vietnam, Russia, South Korea, Singapore, Taiwan

Europe: Austria, Belgium, Croatia, Czech Republic, Denmark, Finland, France, Germany, Greece, Ireland, Italy, Netherlands, Poland, Romania, Spain, Sweden, Switzerland

United Kingdom: England, Scotland, Wales, Northern Ireland

Citation-based metrics vs. Publication-based metrics:

One main methodological divide among metrics assessing AI research capabilities is between metrics based on citations and those based on conference publications. While both methods can bring valuable insights, in this study we have opted for conference publications. We believe that the acceptance metric for a selective conference like NeurIPS strikes a good balance between the quantity and the quality of research papers considered.

Citation counts can provide a measure of quality, but we believe they are more susceptible to irregularities and behaviors that do not necessarily reflect the quality and importance of the research. Examples of this include outsized citation counts for survey papers (“easy cite”); outsized counts for papers on a highly specific topic (“only cite”); gaming of citation counts via “citation cartels”; and cultural biases that affect the visibility of research by different groups.

On balance, studies based on citation counts tend to credit China with a substantially larger share of global AI research, particularly when the threshold for the number of citations required decreases, and the body of papers in the dataset grows very large (above 100,000).

Metrics based on conference acceptances are also subject to some irregularities, including those due to biases held by the paper reviewers. Some of these biases can be muted by a double-blind review process, but these mechanisms remain imperfect. (NeurIPS conferences tend to be predominantly double-blind, with the exception of being single-blind for senior area chairs and program chairs.) While further biases in NeurIPS attendees could stem from the geographic location of the conference, attending the conference is not a requirement for a paper’s acceptance. In fact, such locational bias may have been reduced in the 2022 conference because it adopted a hybrid in-person and remote model.

While acknowledging the limitations of a conference-based approach, we believe that on balance it captures a large and meaningful sample of the researchers that are driving forward the fields of AI and machine learning and making an impact on private sector companies.

Credits

Product managers: Ruihan Huang, AJ Cortese, Graham Chamness

Design and Development: Annie Inacker, Yna Mataya, Chris Roche

Research Assistance: Jingxi (Jersey) Yang, Duoji Jiang, Wenhao Li, Joe Killion, Jiawei Xie, Di Lu