Synthetic data

Synthetic data is defined as “production data that aren’t obtained by direct measurement applicable to a given situation” (according to Wikipedia’s listed McGraw-Hill definition). This is usually data produced by Computer simulations or Generative models, capable of aiding the current task in some way, such as anonymizing sensitive available data or filling out a dataset (e.g. making up for undersampled areas of the desired domain).

The main application of interest here is the use of synthetic data for Machine learning related tasks

Bias

Over the last few years, there’s been an growing amount of discussion surrounding bias in training Machine learning models, with the primary responsibility being on the dataset. Datasets can poorly represent reality, and produce less than ideal outputs for sensitive prediction tasks. As mentioned above, synthetic data can be one way to help produce a “fuller” dataset, where certain areas of the input domain are undersampled or skewed. This raises an obvious question: how much and/or what type of synthetic data is satisfactory? Why doesn’t the underlying data speak for itself, and how can we know the right amount/kind of synthetic data to better represent reality?

I think the answer to this question is more clear in some situations than others. In either case, we seemingly must question the source of the synthetic data just as much (if not more) than the original data source. Without checks and balances here, synthetic data is just another source of potentially harmful or unrealistic bias affecting our models. A couple of interesting articles to this point:

The first article (or at least its relevant “Battling Bias” section) is really just a distilled discussion from the second. The following is a quote from the second article (with original typos) addressing tests from the synthetic data company Mostly AI:

“In May, Mostly AI published a discussion of two of its experiments. In the first, researchers started with income data from the 1994 U.S. census and sought to generate a synthetic data set in which the proportion of men and women who earned more than $50,000 a year was more equal that in the original data. In the second, they used data from a controversial recidivism prediction program to generate a synthetic data set in which criminality was less linked to gender and skin color. The resulting data sets aren’t strictly “accurate”—women did earn less in 1994 (and now) and Black men are arrested at a higher rate than other groups—but they are far more useful in contexts where the goal is not to perpetuate sexism and racism. A synthetic data set generated to equalize the income gap between men and women, for example, could help a company make fairer decisions about how much to compensate its employees."

This demonstrates the level of control that those producing synthetic data have over representation in the output. The article directly mentions the purposeful inaccuracy of these data; they represent instead a wishful reality, which as mentioned can be appropriate in certain situations. I think the main takeaway here, however, is that synthetic data can be made to portray reality in whatever way the operators deem fit, and that in itself is inherently dangerous. I don’t mean to say this should be stopped; synthetic data is beginning to play an increasingly large role across the internet thanks to sophisticated Generative models (e.g. GPT-3) and will likely be difficult to moderate in any meaningful way. I think instead the point is that synthetic data used to “address bias” in Machine learning cannot be blindly touted as a solution given there are human beings making decisions about the desired statistical qualities of the “better” data.

Here’s a related quote the article uses from NYU professor Julia Stoyanovich:

Julia Stoyanovich, a computer science professor at New York University, says the debate in the industry shouldn’t be “accuracy versus fairness.” That is, companies don’t have to choose. Instead, “the data should represent the world how it should be.”

The same point from above applies here (unless I’m misinterpreting the phrasing of this quote; I’m reading it as equivalent to “the data should represent how the world should be”). How the world “should be” is incredibly subjective, and the quote seems to ignore that there could be any variance in opinion here. It suggests that the “accuracy versus fairness” tradeoff should merely be tossed aside for some agreed upon view of a perfect world.