Data bottleneck: the key to AI's next stop

Economic Observer Follow 2026-04-20 16:02

Liu Jin, Duan Lei, Wu Wenxuan/Wen

Modern mainstream AI is mostly based on machine learning and deep learning models, whose mechanism is to "learn" patterns and patterns from data. Without data, models cannot be trained, and the intelligence of these AI cannot be discussed. Therefore, data is often compared to the fuel or blood of AI.

Entering the era of big models, the pre training paradigm based on self supervised learning significantly reduces the dependence on manual annotation in data utilization, enabling models to learn large-scale data at low cost and high efficiency. This promotes the rapid development of collaboration among data, model parameters, and computing power.

People have summarized the famous Scaling Law based on this: there is a smooth power-law relationship between the performance of large language models and the amount of model parameters, training data, and computation. Simply put, the larger the model, the more data it has, and the stronger the computing power, the better the performance of the model.

But the next step of AI development faces enormous data challenges, with the most discussed being 'data exhaustion'.

The reason is not difficult to understand: to some extent, AI training utilizes the "inventory" data accumulated by humans. Internet data, which accounts for an important proportion of pre training data, is the information produced, digitized and deposited online by human beings in the past decades. For example, Wikipedia, although having a small proportion of data, provides high-quality data for training large models. It is the result of over 20 years of hard work and maintenance by thousands of people; Some books and classic literature in the training corpus represent thousands of years of human accumulation.

Although human society adds a large amount of data every year - news, new books, new papers, etc. - high-quality data with relatively linear growth is difficult to match the current super linear development expectations of AI. According to the latest calculations by independent research firm EpochAI, language model training will exhaust publicly available human textual data between 2026 and 2032.

The development of AI faces two dimensions of data challenges: one dimension is whether there is enough data, namely the quantity and coverage of data; Another dimension is the quality of data, including authenticity, annotation level, degree of structuring, etc.

All links and scenarios of AI development and application face data challenges from these two dimensions: the pre training phase faces the data exhaustion and Internet data quality problems mentioned earlier; The post training and alignment stages face a shortage of high-quality annotated data; The industry fine-tuning and application of the pedestal model face the problems of extreme scarcity of professional data and high noise; Multimodal model training faces a shortage of high-quality paired data (such as image text pairs); The embodied model faces the development constraint of extremely high cost of real data.

How to address these data challenges in the development of AI? There are generally three directions: deeper exploration and governance of the data accumulated by human society and the knowledge in the human brain; Relying on machine intelligence to mine and generate data; Innovate in algorithm and model paradigms to reduce dependence on data. Here we mainly discuss the first two directions.

Data augmentation method 1: Collect and organize scattered data

For the accumulated data in human society, the so-called 'data depletion' reflects more of the low hanging fruit that is almost being harvested: publicly available, unprotected textual data is indeed being rapidly consumed by big models, but there is still a large amount of undeveloped data and knowledge space in human society and the human brain.

Firstly, there is a massive amount of non-public data in various industries. Many high-value data are held on platforms, enterprises, professional institutions, equipment, and workflow systems, such as transactions, evaluations, user profiles, etc. on e-commerce platforms; Medical records, imaging, diagnostic records, etc. in the medical field; Process parameters, quality inspection standards, fault records, etc. in the manufacturing industry; Experimental data, process data, and unpublished negative experimental results in the field of scientific research.

These data often involve privacy, property rights, trade secrets, or regulatory compliance, and exist in the form of private, decentralized "data islands". They can exert local value in specific applications through methods such as RAG (Retrieval Enhanced Generations), but it is difficult to aggregate them into a large-scale training corpus that can sustainably enhance general intelligence.

Most of the above scenarios are relatively easy to understand, but here is an easily overlooked example: there has been a long-standing "publication bias" in the scientific community - successful experiments are only published, while failed experiments are discarded. But for AI, both failure cases and success cases have learning value, and a large number of failed experiments that have not been shared constitute an untapped knowledge mine.

AI experts are already exploring some technological means to unleash the potential of this data in training. Typical practices include: conducting joint training through federated learning without moving the original data; And utilizing differential privacy and other technologies to mathematically ensure that individual information cannot be restored, providing a secure boundary for cross institutional data collaboration. This type of approach solves the problem of how to involve data in training without compromising privacy.

But in order to maximize the value of these data in the development of AI, it is necessary to design systems and mechanisms in addition to technology.

There are two paths to explore: one is a bottom-up, market-oriented, and interest driven path, such as data trading markets, data trusts, and data element entry into tables, which incentivize data owners to open up their data and share value-added benefits under compliance; One is the top-down approach, where the government or industry regulators make unified arrangements in areas related to national economy and people's livelihood, public safety, basic scientific research, etc. Through the construction of unified standards, basic platforms, and public datasets, the process of transforming data from "fragmented resources" into "public infrastructure" is accelerated. Technical means provide safety valves, while mechanism design provides liquidity and sustainable incentives, both of which are indispensable.

Secondly, there are still many cognitive assets in the human brain that have not yet been digitized, among which two types have a particularly critical impact on the upper limit of AI's capabilities: the thought trajectory behind complex decisions and the implicit knowledge of experts. If these cognitions are not digitized, AI will find it difficult to learn and replicate, and there will be great potential for exploration in the future.

From the perspective of thinking trajectories, many high-value tasks, such as entrepreneurs' major decisions, doctors' diagnosis of difficult and complicated diseases, engineers' handling of rare faults, etc., are usually recorded by humans as "what has been done" and "what the results are", but lack detailed thinking trajectory data such as "the thinking behind doing so and what alternative solutions have been considered". It's like storing only the questions and answers of a math problem without listing the intermediate steps to solve it.

For AI, without these "thought chain" data, it is difficult to truly learn transferable reasoning abilities and can only perform pattern fitting on a large number of input-output pairs. This is also why models that have added the ability of "thinking chain" in the past year often have improved performance, but the currently available high-quality thinking trajectory data is still very limited.

From the perspective of tacit knowledge, there are many parts of human cognition that are difficult to describe clearly, such as the intuition of senior experts, situational perception, embodied 'muscle memory', and tacit rules in team collaboration. Hidden knowledge in the context of AI is difficult to fully annotate and form training samples, making it difficult for AI to utilize.

The systematic dataization of thought trajectories and tacit knowledge, although costly and difficult, is a highly valuable gold mine with high information density and uniqueness. It is likely to become one of the key sources for the continuous improvement of AI capabilities in the future.

Thirdly, it is equally crucial to govern and improve the quality of the knowledge accumulated by humanity. In the field of AI training, it is often said that "garbage in, garbage out", which means that data quality largely determines the model's ability, because the model itself lacks the ability to automatically identify authenticity and importance, and it is easy to learn erroneous patterns from low-quality data.

The quality of information on the Internet is mixed, full of errors, false, outdated, one-sided and repetitive content. If it is directly used for training, illusion and prejudice will be magnified in the output. In the era of AI, excessive and even malicious GEO (Generative Engine Optimization) around "model referencing and sampling rights" has added new entry points to knowledge pollution.

Therefore, a complete set of work can be carried out around improving the quality of data and knowledge itself: at the bottom layer are conventional data cleaning, deduplication, error correction, and noise filtering; Going up to the next level, it is to establish traceability and version control mechanisms for important knowledge, clarify the source, update time, and responsible parties, and unify concepts and structured relationships through knowledge graphs and other methods; In high-value professional fields, it is necessary to construct a "few but precise" high confidence dataset through fine annotation engineering and the participation of domain experts, as a benchmark for model calibration and evaluation.

Only after the human knowledge itself has undergone such a round of "AI oriented governance and purification", can subsequent model training and reasoning truly stand on a more solid and clean knowledge foundation, rather than stepping on mixed information sediment.

Data augmentation method two: utilizing machine intelligence

In addition to making every effort to explore the accumulated data of human society and the cognition in the human brain, another approach is to use AI's own system to mine and generate data.

Firstly, the synthesized data. There are multiple ways to generate synthetic data, which can be based on rules/templates, statistical distributions, machine learning models, and simulation environments. We will focus on the latter two, which play a more important role in current AI training.

Why can data generated from large models be used to train new large models? It is easier to understand to train student models with high-quality outputs from teacher models, which is known as "knowledge distillation"; Even for training cutting-edge models, synthetic data based on previous generation models can still play an important role in some cases.

For example, for the same math problem, asking the model to answer 100 times and only taking the 20 correct answers as data to train a new model is essentially using the model's own "high-quality quantum set" to amplify the effective samples. On the one hand, through automated generation and filtering, we can expand from the scarce high-quality human problem-solving records to more diverse and logically correct problem-solving trajectories; On the other hand, synthetic data can deliberately 'oversample' on more difficult and sparsely distributed question types to fill in the weak links in real data.

Another example is intelligent driving training that utilizes synthetic data to generate extremely rare accident scenarios. The efficiency of collecting long tail accidents from real road tests is low, but based on real data, we can extract driving elements such as scene type (intersections, highways, city streets, parking lots), weather (sunny, rainy, foggy, snowy, icy), road conditions (dry, slippery, icy, gravel), time (day, night, dusk), etc., and use these elements to form extreme combinations for training in a simulation environment.

But from these two examples, it can also be seen that in such synthetic data, AI cannot create completely new knowledge out of thin air.

The first example relies on an external validator (standard answer) to extract training samples from the upper limit of the model's ability rather than the mean, optimizing the data distribution. The second example is the recombination and amplification of known elements. Strictly speaking, AI here has not truly expanded the boundaries of data, but rather treated the raw data contributed by human society as ore, purified, proportioned, and processed into a more suitable "data alloy" for training, extracting more value within the boundaries of existing knowledge.

Secondly, AI can use reinforcement learning to expand data (which can also be seen as generalized synthetic data). Unlike synthesis based on human samples, this approach truly goes beyond existing human social data, allowing the model to actively generate new trajectory data through continuous interaction with the environment, exploring strategy spaces that have not yet been explored. The core of reinforcement learning lies in the cycle of "state action feedback", allowing agents to gradually learn high reward behavior strategies through trial and error, and each behavior sequence itself is generating data.

The most classic example is AlphaZero. In deterministic chess games such as Go and Chess, it almost does not require a human roster, relying only on rules, random starts, and self play. Through billions of self play situations and outcome feedback, it continuously updates its strategy network and value network, surpassing all human players and traditional chess engines. This indicates that in a closed environment with clear rules and feedback, AI can completely approach or even break through the upper limit of human experience from scratch through self generated data.

An important development in open tasks is the "thought chain reinforcement learning" inference model represented by DeepSeeker R1. The idea is to first allow the model to freely generate a thought chain on tasks such as mathematics and programming that can automatically verify correctness, and then reward or punish based on whether the final answer is correct and whether the thought chain is reasonable, driving the model to continuously adjust its reasoning strategy.

Unlike traditional chain supervision that relies on manual annotation, this approach does not prepare a large human thought chain dataset in advance. Instead, the model continuously generates and filters inference trajectories during the training process, essentially building a new data factory that automatically produces high-quality thought trajectories.

More imaginative is the field of embodied intelligence. Simulation environments have been widely used in autonomous driving and robot training. Through large-scale simulation driving, simulation grasping and assembly, reinforcement learning or related methods are used to generate interaction data far exceeding the number of real road and factory scenes, covering various long tail risk scenarios and rare working conditions. In the real world, robots, through long-term embodied training, will also continuously generate sensor readings, action sequences, and task feedback, all of which are high-value new data that can be used in the future.

Thirdly, another exploration direction is to develop active learning in AI. Unlike passively waiting for humans to feed data, the core idea of active learning is that the model decides what to learn and who to ask what.

In scenarios where data annotation is expensive, the model can select the most valuable samples to request annotation from humans based on the current uncertainty or potential information gain, or focus on exploring the states and tasks that can most reduce uncertainty in the simulation environment. In this way, under the same annotation budget, the model obtains a small group of samples with the highest information density, rather than a "thin layer of supervision" evenly distributed across all samples.

From a longer-term perspective, the combination of active learning, reinforcement learning, and embodied intelligence is expected to transform AI from a role of "passively consuming existing data" to a learner of "actively planning learning paths and actively creating key data" (which is actually a way to explore human cognitive processes).

In the era of AI, there are huge opportunities in the field of data

The next stage of AI development largely depends on who can excel in data analysis. There are at least two reasons here. Firstly, as mentioned earlier, both in terms of scale and quality, data has reached a new ceiling. The solutions that can alleviate these bottlenecks and improve the effective supply of data directly correspond to enormous economic value. Especially in the context of convergence in cutting-edge model capabilities, the focus of AI competition is likely to shift towards' who holds cleaner, rarer, and more difficult to replicate data '.

Secondly, among the three elements of AI, the industry threshold for computing power and basic models is extremely high: when it comes to computing power, we think of Nvidia AMD? When chip manufacturers such as Cambrian mention models, they think of leading laboratories and platforms such as OpenAI and DeepSeek. In contrast, data is more like an ecosystem that can accommodate numerous participants: it is highly dispersed across various vertical industries and scenarios.

This means that leading enterprises in different industries, small and medium-sized companies with unique data insights, and even start-up teams have the opportunity to form their own moats in the AI era by building high-quality data assets, products, and services, without having to directly involve computing power and universal big models.

In addition to companies exploring opportunities in data, governments also need to play a key role. The previous text has distinguished between two approaches to data governance: top-down and bottom-up. In areas suitable for top-down, the government should quickly establish a sharing platform and institutional framework to better utilize this data for AI training and public services; In areas suitable for market mechanisms, efforts should be made to leave room for innovation and avoid excessive concentration or one size fits all regulation.

Roughly speaking, data related to national security, public interests, and basic services are more suitable for government led efforts to ensure order and availability, such as meteorological data, geographic information data (such as surveying and mapping results), population basic information, macroeconomic statistics, social security, and other basic public data. Due to strong externalities and the difficulty of internalizing all risks within a single entity, "livelihood data" such as healthcare and transportation also require strong top-down mechanisms, including unified standards, public data infrastructure, cross departmental data sharing rules, and strict privacy and security boundaries.

In contrast, fields that are more inclined towards commercial competition, such as e-commerce behavior data, consumer finance data, and internal operational data of enterprises, should be more market-oriented to discover data value and optimize allocation. The government only needs to do a good job in regulation, rather than directly replacing the market.

As far as China is concerned, big language model training is highly dependent on Internet data, but due to factors such as the late start of Internet development, the scale and quality of Chinese Internet data are far inferior to that of the English Internet world as a whole (fortunately, most Internet data are public data, and we can also use English data).

However, China has potential structural advantages in other types of data: a large population and market bring rich consumption and scenario data, a complete industrial system and manufacturing chain accumulate a large amount of industrial and IoT data, and more advanced smart cities and digital infrastructure for government affairs form rich urban operation and government data.

If we can improve data regulations, clarify property rights and revenue distribution, build high-quality public data platforms, and encourage industry entities to create high-quality data products around specific scenarios, data is likely to become an important support point for promoting local AI development and gaining competitive advantages.

(Liu Jin is a director and specially appointed expert of the Greater Bay Area Artificial Intelligence Application Research Institute, a professor of accounting and finance and director of the Investment Research Center at Cheung Kong Graduate School of Business, Duan Lei is the research director of the Greater Bay Area Artificial Intelligence Application Research Institute, and Wu Wenxuan is an assistant researcher at the Greater Bay Area Artificial Intelligence Application Research Institute)

Disclaimer: The views expressed in this article are for reference and communication only and do not constitute any advice.

Bookmark

Hot News

Video recommendation

Cook hands over the baton, Apple's new CEO is confirmed, and the "hardware soul" of the 4 trillion yuan empire takes over

Volkswagen launches three new energy vehicles in China, AURA and ERA, with "twin" English names. Industry suggests more recognizable Chinese names. Volkswagen responds: There are no plans to change Chinese names for the new cars

Did humans lose? The humanoid robot Half Horse won the championship with 50 points, and behind it is the ultimate test before mass production

Epaper

Click To Enter

Username login/Mobile number login Don't have an account yet?Sign up for free

Data bottleneck: the key to AI's next stop

Hot News

Video recommendation

Epaper

Username login/Mobile number login

Don't have an account yet?Sign up for free