Written by Jonathan Bowden

Solving the machine learning data challenge: How well can you fake it?

Data engineering

2 minutes

In this month's Machine Learning blog, Jonathan Bowden, hyperexponential Senior Model Developer, explores some ways we can create "fake" data to validate ideas for potential models. Keep reading to learn more.

Anyone familiar with machine learning will know that many experiments and proof of concepts fail at the first hurdle: we don’t have any data! In this month’s machine learning blog, I’ve turned my attention to generating fake data using various methods, and to test it out, I sourced an aircraft dataset from Top Trumps cards. My goal was to augment the dataset with believable, but entirely made-up, aircraft data.  

I have encountered problems with the lack of data on several occasions, and I am frequently prevented from developing ideas for potential models as a result. Sometimes, the issue is rooted in data privacy, and the last thing I want to do — even with all the best intentions — is be responsible for a breach. Other times, the issue is simply because datasets are too small to generate meaningful conclusions. 

Statistical methods 

However, all hope is not lost since an extensive range of tools can either anonymise existing datasets or use the dataset properties to create new, semi-believable, fictitious data. For today's example, I'll focus on generating new data, and we can explore how to anonymise data sets and other data manipulations in a future post.  

A quick search will yield simple approaches like Faker, which can be used to generate fake names, addresses and phone numbers. It's fast and easy, but it's not exactly going to help solve problems where there are complex data points. For a long time, statistical simulation was the only reliable way to generate fictitious data since curve fitting a normal, lognormal or gamma distribution can usually result in convincing data. However, this method usually falls apart when you look at the correlations. Generating fictitious aeroplanes that are 100m long but have a wingspan of 2m isn't particularly believable. The issue is that each of these features is unaware of the others, and there is no design manager overseeing how these features should relate to each other. 

A way to combat this is Cholesky Decomposition, which forces a set of distributions into fixed correlations. The problem with this approach is that your resulting distributions may not have all the properties of your original input distributions. So, if you used a Gamma distribution as your input, you're unlikely to get a Gamma out. It will be a bit more random than we want, but the underlying data will be believable. To make genuinely believable fake data with these methods, we must turn to copulas which can be complicated, and it's probably at this point that we ask ourselves: is it worth all this statistical effort to generate fake data?

Generative adversarial networks 

If, like me, your answer is "I don't have time for that", we can explore the world of Generative Adversarial Networks (GANs). GANs are typically very popular in image generation; a good example is the AI-generated artworks by DALL.E-2. This is undoubtedly a more complex field than the use case for tabular, structured datasets, but the same principles of generators and discriminators apply. For this experiment, I will use a package called TABGAN, which is open source and promises to be an easy way to use GANs for tabular data generation. 

The idea is that our GAN will contain two machine learning models; one is a "generator" and the other a "discriminator". The generator's job is to create as much data as possible from the original dataset. It looks at the dataset, assesses its properties, and randomly generates new data as convincingly as possible. The discriminator's job is to use real data to evaluate how believable the data the generator sends through is. The two models are entirely separate and essentially play a game with each other (hence the "adversarial" part of the name). The generator learns with time how to best convince the discriminator, which in turn gets better at understanding when data might be fake or not. Given enough time and data, some believable fake data points can make it through to the output. 

The results 

Leveraging the TABGAN package and with less than 20 lines of Python, I doubled the size of my dataset from 70 to 140.   

Here are two of the planes from my output dataset. One is a real plane, and the other is entirely fictitious and generated by my GAN. If you look close enough, I think it’s possible to tell which is real and which is fake, but it’s pretty convincing. 

The answer is at the bottom of this page. 

On a line-by-line basis, it's rather convincing, and if you've spent more than 2 minutes looking for oddities in the above, my exercise is a success here. It is on an aggregate basis, however, that things somehow fall apart. 

I mentioned correlations earlier and my hope was to avoid the need to use Cholesky Decomposition or Copulas as it felt like over-engineering. However, it seems that TABGAN could not preserve those correlations, with small datasets at least. 

Unfortunately, this trend is fairly common across all of the remaining data features:

We are left with the following Pros and Cons: 

Pros  

  1. GANs bridge the gap and can allow for investigations using private data. If done correctly, we can preserve the results of the investigation.  

  2. There is no need to resort to complex statistical simulation; the above output was produced with minimal fiddling and took less than 20 lines of code.  

  3. With this method, we can increase the accuracy of our models as we have convinced the model that there is more data to build on.  

Cons  

  1. In tandem with point #3 above, we also introduce random noise into our dataset, which could work against our accuracy and magnify any underlying biases. As a result, we could end up drawing some seriously false conclusions.  

  2. In a few cases, the GAN had just duplicated an aircraft; this is useless and not something we would want when dealing with a private dataset. 

In conclusion 

In this world of deep fakes and fake news, it may seem like this experiment is a bad idea. However, the argument will always revert to trust. Can I trust the data I’m using, and can I trust the conclusion? These are natural questions to ask of any experiment, whether it involves fake data or not. 

We should not use fake data approaches for anything other than testing theories or proving concepts. As soon as real, more sensitive data is used, this becomes a very different investigation, and all the highest standards of care and diligence must apply. There might be a use case for GANs in improving those pesky, imbalanced datasets where under-sampling and over-sampling methods may not be appropriate. However, this should come with a warning, as generating tons of data from existing large datasets could lead to horrific results that might go unnoticed because a human cannot check every single data point. 

As for my investigation, I am left thinking that, ironically, I didn’t have enough data for my GAN to convincingly create more. There really is no free lunch. 

Answer to above: ID 62 is a Boeing 757-200 PF. ID 107 is not a real plane. The giveaways might be either of the following: 

a) The max altitude of 13,021m is a bit too specific 

b) The max take-off mass is probably too low for a civil aircraft - a large French reconnaissance plane (Geophysica M-55) of a similar wingspan must have confused the GAN in this area