USRP B200-mini-i Trend 2: Synthetic data breaks the bottleneck of AI training data
The data bottleneck refers to the limited amount of high-quality data that can be used to train AI, and synthetic data promises to break this bottleneck.
USRP B200-mini-i Synthetic data is data synthesized by machine learning models using mathematical and statistical science principles on the basis of imitating real data. There’s an easy metaphor for what synthetic data is: It’s like writing a textbook for AI. For example, although fictional names such as “Xiao Ming” and “Xiao Hong” may appear in the dialogue of English textbooks, it does not affect students’ ability to master English. Therefore, in a sense, for students, the textbooks can be regarded as “synthetic data” that has been compiled, screened and processed.
USRP B200-mini-i Some papers have shown that the size of the model must reach at least 62 billion parameters before it is possible to train the ability to “think chain”, that is, to carry out step-by-step logical reasoning. But the awkwardness of the reality is that there is not so much good, unreplicated, trainable data that humans have produced so far. Using generative AI such as ChatGPT to produce high-quality synthetic data in unprecedented quantities, future AI will achieve higher performance as a result.
USRP B200-mini-i In addition to the demand for large amounts of high-quality data leading to the demand for synthetic data, data security considerations are also important reasons. In recent years, countries have introduced stricter data security protection laws, making it objectively more cumbersome to use human-generated data to train artificial intelligence. Not only may this data contain personal information, but much of it is also protected by copyright. At present, when Internet privacy and copyright protection have not yet formed a unified standard and perfect framework, the use of Internet data for training can easily lead to a large number of legal disputes. However, if we consider desensitizing these data, we face challenges in terms of the accuracy of screening recognition. In this dilemma, synthetic data becomes the cheapest option.
In addition, using human data for training can also lead to AI learning harmful content. Some include ways to make bombs out of household items and control chemicals, while others include many bad habits that AI shouldn’t have, such as USRP B200-mini-i slacking off during tasks like humans do, lying to please users, and creating bias and discrimination. If artificial intelligence is trained with as little exposure to harmful content as possible by using synthetic data, it is expected to overcome the disadvantages associated with training with human data.
As can be seen from the above analysis, synthetic data can be said to be quite groundbreaking, and it is expected to solve the problem that the development of artificial intelligence and data privacy protection are not compatible. At the same time, however, it will be challenging for China to ensure that companies and institutions are responsible in producing synthetic data, and how to produce synthetic data training sets that are consistent with the culture and values of the country, while at the same scale and level of technology as the West’s English-centric web-based data.
In addition, a major change brought about by synthetic data is that big data from human society may no longer be necessary for AI training. In the digital world of the future, the generation, storage and use of human data will continue to follow the laws and order of human society, including maintaining national data security, keeping commercial data secret and respecting personal data privacy, while synthetic data required for AI training will be managed by a different set of standards.