Much Ado About Data: How America and China Stack Up

Analysts often cite the amount of data in China as a core advantage of its artificial intelligence (AI) ecosystem compared to the United States. That’s true to a certain extent: 1.4 billion people + deep smartphone penetration + 24/7 online and offline data collection = staggering amount of data.

But the reality is far more complex, because data is not a single-dimensional input into AI, something that China simply has “more” of. The relationship between data and AI prowess is analogous to the relationship between labor and the economy. China may have an abundance of workers, but the quality, structure, and mobility of that labor force is just as important to economic development.

Likewise, data is better understood as a key input with five different dimensions—quantity, depth, quality, diversity, and access—all of which affect what data can do for AI systems.

What follows is a framework for analyzing the comparative advantages of countries and companies across the five dimensions, with the aim of bringing more precision to comparisons of how America and China stack up. This is, however, just one framework, and I welcome critiques and suggestions on how to quantitatively measure each of these dimensions.

Why Does Data Matter To AI Systems?

Before getting to the five dimensions, a detour into data’s role in AI systems is in order.

Advances in AI have given computers superhuman pattern-recognition skills: the ability to wade through oceans of digital data, spotting thousands of hidden patterns or correlations between inputs and outcomes. AI systems then use those correlations to make inferences or predictions, “learning” how to perform a task based on the examples it has seen in the data.

No single correlation can correctly predict an outcome on its own. But increases in computing power now allow AI algorithms to examine correlations across millions or even billions of examples. As more or better data is fed into the system, the accuracy of these predictions can improve dramatically.

That is why data is crucial to machine learning today. It is the fuel that most AI applications today—online shopping recommendations, facial recognition, autonomous vehicles, and machine translation—run on and what allows them to learn and master a specific task.

Breaking Down the Five Dimensions

The following section provides an overview of the five dimensions (see Table), followed by analysis of each one, and concludes with a brief look at how the balance of capabilities could change over time.

Table. The 5 Dimensions of Data in China and the US

Note: The term “advantage” simply connotes the respective capability in each dimension and is not meant to render a value judgment on how the capability is deployed and to what end. Data can be used for everything from improving cancer diagnoses to expanding a surveillance state.


Many assume the size of China’s population gives it an advantage in the volume of data, but this is actually misleading. Chinese tech companies can tap the world’s largest domestic population, but very few of them have succeeded in reaching global users. In contrast, American tech giants make up for their far smaller pool of domestic users by drawing the majority of their users (and data) from global markets.

WeChat and Facebook make for a clear contrast. WeChat has leveraged China’s 800 million internet users to rapidly scale up, but it has weak global penetration, capping out at 1.1 billion users today. Facebook, however, has long outgrown its US home market and now reaches 2.3 billion users globally.

This means that—for now, at least—Chinese tech companies can scale up faster by relying on only domestic users, while US (and European) companies tend to have a higher ceiling for total users given their global reach.


Depth of data refers to different aspects of user behavior captured in digital form. The more an algorithm is trained on different types of user behavior, the more sophisticated its recommendations or predictions can be for that user.

China’s advantage mainly lies in the fact that its leading tech companies have many more windows into a user’s online and offline behaviors. This is a result of the fact that a far larger portion of an urban Chinese citizen’s real-world activities are funneled through smartphones (see ChinAI for an interactive demonstration).

Each of those real-world activities—bikeshare trips, meals ordered, appointments booked—is a small window into user habits, which can be used to more accurately tailor recommendations for that user. While US tech giants often know a lot about their users’ online habits (search history, pages “liked”, etc.), they have more limited insight into users’ real-world activities compared with Chinese counterparts like Tencent, Alibaba, and Meituan.


Quality refers to both the accuracy, and the structure and storage of the training data. The United States has an edge on both because its data tend to be more reliable, and much more of its data have been digitized and stored in easily retrievable formats.

First, on accuracy. When machine learning applications rely on training data, they are subject to a longstanding rule of computer science: “garbage in, garbage out.” If an AI algorithm is fed inaccurate data, it will produce inaccurate outputs.

For example, if the Chinese government wanted an early warning system for “airpocalypse” days, it might train an algorithm using historical data to find correlations between pollution and hundreds of variables. But if the historical data is inaccurate, the algorithm will learn faulty correlations and produce inaccurate predictions. That kind of inaccuracy is common across many public and private sector datasets in China, giving the US an advantage from its (relatively) reliable data.

Second, on structure and storage. Data is useful to AI algorithms when it is stored in a computer-readable format and structured consistently. A consistent digital database of medical symptoms and their corresponding diagnoses can be used to train an AI doctor, whereas thousands of handwritten slips of diagnoses cannot.

On this front, American hospitals, companies, and bureaucracies have an enormous head start on their Chinese peers, which have not invested as much in enterprise software or digitizing data. That may change over time, however, as Beijing is investing heavily and incentivizing localities to digitize records and adopt AI-powered analytical tools.


Data heterogeneity is important to train AI algorithms on diverse skills related to a given task.

America holds a clear advantage in this dimension because of its diverse domestic population and the global user base of many Silicon Valley companies. Users of Google and Facebook represent a far greater range of languages, ethnicities, and nationalities than users of WeChat or Baidu.

In contrast, a facial recognition algorithm trained on one billion Chinese faces will be excellent at identifying another Chinese face, but it may struggle when deployed in Ethiopia or Norway. The same challenge applies to machine translation and speech recognition with different accents.

One potential advantage for China is the economic diversity of users on which it has deep consumer data. While US companies reach users across the globe, they don’t often draw the same depth of data from those populations.

Chinese companies may have limited global reach, but their insights on the consumption habits of an economically diverse population at home run the gamut: from the global elite of Shanghai (comparable to rich Singaporeans) to poor Guizhou farmers (comparable to parts of Indonesia or India). Such rich data on an economically diverse population may give Chinese AI companies crossover potential in other emerging markets.


China holds a distinct advantage in accessing data from public spaces. That data is gathered through the country’s sprawling network of surveillance, security, and traffic cameras—tools that can “datatize” public spaces by identifying and analyzing the movement of each car, bike, bus, and pedestrian.

Chinese city governments have initiated dozens of partnerships with private firms like Alibaba on “smart city” projects, granting them access to these data streams in a bid to optimize everything from big brother surveillance to traffic management. Partnerships between China’s leading facial recognition startups and law enforcement are similarly vacuuming up hundreds of millions of face scans, using them to stitch together a national surveillance system and track the country’s Uighur minority.

Source: Alibaba Cloud.

Even with that access, perception often outstrips the reality of Chinese capabilities. Many installed surveillance cameras are not currently equipped with AI technology, and even those that are often cannot effectively store or integrate data into larger systems.

Still, the growing access of Chinese government and private actors to this data marks a major departure from the United States, where municipalities have proactively banned facial recognition technology due to concerns over privacy, personal freedoms, and racial profiling.

Where Things Are Headed

The above assessments represent a snapshot—and a relatively subjective one—of where the two countries stand today. So which of these dimensions might see significant shifts in coming years?

Chinese apps such as Tik Tok have recently met with major success outside of China, and if that trend continues it will increase the quantity and diversity of users for Chinese companies. Chinese government incentives for applying AI in the public sector are also likely to raise the quality of data through better structuring and storage.

American tech companies are increasing the depth of their data, with Apple pushing mobile payments and smart home technologies like Amazon’s Alexa capturing more offline activities in digital data.

But perhaps bigger than any relative gains across these dimensions would be advances in the field of AI that dramatically reduce the need for large amounts of user-generated training data. Cutting-edge AI systems like DeepMind’s AlphaGo Zero have already demonstrated the power of approaches like reinforcement learning, which generates its own data through simulations.

If those approaches prove widely applicable, they could devalue the relative importance of data while increasing the value of advanced semiconductors or research talent.

Get Our Stuff