Synthetic data was once considered less desirable than real data, and now some people see synthetic data as a panacea. The real data is messy and full of prejudice. New data privacy regulations make it difficult to collect. In contrast, synthetic data is raw data and can be used to build more diverse data sets. You can generate perfectly labeled faces, such as faces of different ages, shapes, and races, to build a face detection system suitable for different groups of people.
But synthetic data has its limitations. If it does not reflect reality, it may end up producing artificial intelligence that is worse than chaotic, biased real-world data—or it may simply inherit the same problem. “What I don’t want to do is give a thumbs up to this paradigm and say,’Oh, this will solve so many problems,'” said Kathy O’Neill, a data scientist and founder of algorithm auditing company ORCAA. “Because it also ignores a lot of things.”
Deep learning has always been about data.But in the past few years, the artificial intelligence community has learned Ok Data is more important than data big data. Even a small amount of correctly and clearly labeled data can improve the performance of an AI system 10 times more than the amount of unprocessed data or even more advanced algorithms.
Ofir Chakon, CEO and co-founder of Datagen, said this has changed the way the company develops artificial intelligence models. Today, they first obtain as much data as possible, and then adjust and adjust their algorithms for better performance. Instead, they should do the opposite: use the same algorithm while improving their data composition.
But collecting real-world data to perform this iterative experiment is too costly and time-consuming. This is where Datagen comes in. With the help of the synthetic data generator, the team can create and test dozens of new data sets every day to determine which data set can maximize the performance of the model.
To ensure the authenticity of its data, Datagen provided its suppliers with detailed instructions, explaining the number of people to be scanned in each age group, BMI range, and race, and a list of actions for them to perform, such as walking around the room or Drink soda. The supplier sends back high-fidelity still images and motion capture data of these actions. Datagen’s algorithm then expands this data into hundreds of thousands of combinations. Sometimes the synthetic data will be checked again. For example, compare fake faces with real faces to see if they look real.
Datagen is now generating facial expressions to monitor driver alertness in smart cars, generating body movements to track customers in cashierless stores, and iris and hand movements to improve the eye and hand tracking capabilities of VR headsets. The company said its data has been used to develop computer vision systems that serve tens of millions of users.
It is not just synthetic humans that are being manufactured on a large scale. click to enter It is a startup company that uses synthetic AI to perform automated vehicle inspections. It uses design software to recreate all car brands and models that its AI needs to recognize, and then render them with different colors, damage, and deformations under different lighting conditions and different backgrounds. This allows the company to update its artificial intelligence when automakers launch new models, and helps avoid data privacy violations in countries where license plates are considered private information, so they cannot appear in photos used to train artificial intelligence.
Mainly .ai Working with financial, telecommunications, and insurance companies to provide electronic forms of false customer data, allowing companies to share their customer databases with external suppliers in a legal manner. Anonymization can reduce the richness of data sets, but it still cannot fully protect people’s privacy. But synthetic data can be used to generate detailed fake data sets that have the same statistical properties as the company’s real data. It can also be used to simulate data that the company does not already own, including scenarios such as more diverse customer groups or fraudulent activities.
Proponents of synthetic data say it can also help evaluate artificial intelligence.in A recent paper Suchi Saria, associate professor of machine learning and healthcare at Johns Hopkins University, and her co-authors published a paper at the Artificial Intelligence Conference, demonstrating how data generation techniques can be used to infer different patient groups from a set of data. For example, if a company only has data from a younger population in New York City, but wants to understand how its artificial intelligence can work in an aging population with a higher prevalence of diabetes, then this may be useful. She is now starting her own company, Bayesian Health, which will use this technology to help test medical AI systems.
Limitations of fraud
But is the synthetic data exaggerated?
When it comes to privacy, “just because the data is’synthetic’ and does not directly correspond to real user data does not mean that it will not encode sensitive information about real people,” said Aaron Ross, a professor of computer and information science. At the University of Pennsylvania. For example, some data generation techniques have been shown to closely replicate the images or text found in the training data, while other techniques are vulnerable to attacks that make them completely ruminate the data.
This may be fine for a company like Datagen, whose synthetic data is not meant to hide the identity of individuals who agree to be scanned. But for companies that provide solutions to protect sensitive financial or patient information, this will be bad news.
Studies have shown that, especially the combination of two synthetic data technologies——Differential privacy with Generative Adversarial Network—— Bernice Herman, a data scientist at the Institute of Electronic Science at the University of Washington, said that the strongest privacy protection can be produced. But skeptics worry that this nuance may disappear in the marketing terminology of synthetic data vendors, which are not always honest about the technology they use.