Heading towards the deepfake point for data : The Tribune India

Join Whatsapp Channel

Heading towards the deepfake point for data

Statistical literature frequently uses computer algorithms for computational ease to construct hypothetical datasets.

Heading towards the deepfake point for data

Fabrication: Data fraud is a far more severe global phenomenon than can be imagined. istock



Atanu Biswas

Professor, Indian Statistical Institute, Kolkata

WHEN Kobe Steel, the third-largest steel manufacturer in Japan, announced in 2018 that its CEO would resign to accept accountability for the extensive data fraud scandal that surfaced the year before, the company acknowledged that the fraud had been going on for almost 50 years. Yes, data fraud is a more severe global phenomenon than can be imagined.

There’s nothing new in fabricating data to further business, scientific or societal narratives. Allegations of falsified or wrong data on various topics important to the common people, such as GDP, employment, inflation and Covid fatalities, frequently rock different societies. According to a 2016 paper in the Statistical Journal of the IAOS, roughly one in every five surveys may contain fake data. Unbelievable, eh?

Ideally, opinion and exit polls, presidential and prime ministerial approval ratings, and the popularity of various commercial brands should all be based on precisely crafted and painstakingly surveyed data. Nevertheless, there may be reason to suspect that some such statistics are fabricated.

The same is true in the scientific domain. The report of a set, total or partial, of invented and nonexistent data is known as data fabrication. The way to obtain them is forged, or includes the description of experiments that have never been carried out. A 1982 book, Betrayers of the Truth, by William Broad and Nicholas Wade and a 1997 paper in science by L Corry, J Renn and J Stachel detail how some of the greatest scientists, including Galileo, Newton, Dalton, Mendel, Millikan and Einstein, have been accused of fabricating, falsifying and plagiarising data! But how easy is it to produce false data satisfying some specific requirements? Well, it might not be persuasive in every way. Suppose we require a dataset containing prefixed percentages of people, such as 35 per cent, 30 per cent, 25 per cent and 10 per cent, in four income groups, as well as the percentage of people supporting two political parties, 45 per cent and 55 per cent. It’s simple to create such a two-way table by organising income levels into rows and political inclinations into columns, say. It will be very challenging if we have to comply with a predetermined correlation, say 0.5, between income and political preference. If we also wanted the fictitious data to meet further requirements, then the problem would get more intricate.

Nonetheless, statistical literature frequently and extensively uses suitable computer algorithms for computational ease, also known as computer simulation techniques, to successfully construct hypothetical datasets by constructing microdata that adhere to certain statistical patterns. In fact, many universities and institutes teach it in a statistical computing course. Then, a suggested statistical method might be applied to this simulated dataset. Such a method may be carried out millions of times, and average performance as well as its variation could be assessed. However, since it’s made clear that this approach is based on simulations rather than real data, there’s no dishonesty involved. They’re only employed to comprehend how a suggested statistical technique or metric performs.

If, however, someone asserts that such generated data is authentic, that’s dishonesty. There’re still too many scenarios that can’t be handled by the statistical technology now in use to detect data manipulation, although numerous methods exist for determining whether the data is random; they raise questions about the data, at least. In fact, nothing like precise data manipulation exists. Benford’s law is a widely used method. It states that the proportion of appearance of various leading digits is fixed in a large number of real-life numerical datasets. When a dataset deviates from Benford’s law, there may be an issue. It’s used by the US Internal Revenue Service to identify tax evaders, or at the very least, to narrow the field to better channel resources. Benford’s law didn’t concur with the revenue growth of the erstwhile energy behemoth, Enron Corporation. In 2001, Enron went bankrupt; the smoke clearly pointed to fire.

And now generative AI has become involved in generating fake data in one shot for the purpose of disseminating false information about scientific and social issues. With AI taking control of every bit of our lifestyles, it was inevitable, today or tomorrow, though. According to an article published in Patterns in March, ChatGPT can create remarkably realistic-looking medical data. Then, a new paper, published on November 9 in JAMA Ophthalmology, used GPT-4 in conjunction with Advanced Data Analysis (ADA) to construct a dataset about 160 men and 140 women who have keratoconus, an eye disorder that results in corneal thinning and can impair focus and vision. Wow!

Researchers and journal editors, however, are becoming increasingly concerned about research integrity due to AI’s presumed ability to create convincing data. AI will make it very simple for any researcher or group of researchers to fabricate data, such as fake questionnaire answers or measurements on nonexistent patients, or to create large datasets on any social scenario or scientific experiment.

Nevertheless, the aforementioned GPT-4-generated fabricated data was an ostensibly credible clinical-trial dataset to bolster scientific theory. Crucially, though, a forensic analysis revealed it’s not legitimate! Well, AI’s failure to generate convincing data may be a great relief for now. However, that may not be the end of such agony; more realistic-looking datasets may soon be produced with more updated versions of generative AIs. In contrast, it’s improbable that statistical methods and instruments to recognise them as fraudulent will be researched and updated so quickly. So, overall, it could soon be a dismal situation involving AI-aided tyranny of fake data on societal issues, important planning, predicting elections, medical research and scientific endeavour.

Sincere attempts are being made from many angles to limit the negative impacts of AI, though. For instance, a broad political agreement on a new law governing the use of AI technologies was reached by the European Union last week. In order to establish its initial guidelines for encouraging ‘safe, secure and trustworthy development and use of AI’, the Biden administration released a comprehensive executive order in October. Since the genie can’t be put back into the bottle, will it be possible to clip its wings, at least? If not, it may lead to a dystopian future where people will stop believing data, real or fraudulent. That’s the ‘deepfake point’ for data we are perhaps heading towards.

#Deepfake #Japan


Top News

Gave my statement to police, BJP should not do politics: Swati Maliwal over 'assault' on her

FIR filed against Delhi CM Kejriwal's aide Bibhav Kumar in Swati Maliwal ‘assault’ case

The case was registered after Maliwal filed a multiple-page ...

ED can’t arrest accused after special court has taken cognisance of complaint: Supreme Court

ED can’t arrest PMLA accused without court’s nod after filing of complaint, rules Supreme Court

The verdict comes on a petition filed by one Tarsem Lal chal...

Heatwave alert for northwest India; mercury may hit 45 degrees Celsius in Delhi

Heatwave alert for northwest India; mercury may hit 45 degrees Celsius in Delhi

A fresh heatwave spell will also commence over east and cent...


Cities

View All