Why identical data yields contradictory findings

Other

Updated At : 11:01 AM Oct 28, 2023 IST

ARE there many nuances and conflicting shades of data? Big or small, complex or simple, when spinning a dataset to unravel its mysteries, we frequently find ourselves akin to the blind men in the parable standing in front of an elephant and trying to figure out what the giant animal looks like. Data analysts repeatedly depict different parts of the elephant, such as its legs, tusks, trunk and ears, rather than the entire animal, based on their uses of analytic tools and interpretations, and thus sometimes disseminate untruthful, partial, diverse and contradictory pictures.

First, an illustration from the scientific domain. There are concerns over the potential link between greater anaesthetic doses and early patient fatalities among older patients. In 2019, a paper that disregarded such a connection was published in the British Journal of Anaesthesia. It declared: “These results are reassuring.” Well, the same data was examined in another study published in the same journal that year, and it came to a different conclusion about death rates! It claimed that there were insufficient participants in the trial to draw that conclusion — or any conclusion — about mortality. This unusual experiment (two articles based on identical experimental data) was designed to expand replicability efforts beyond merely methods and results, but also in terms of data interpretation. A recent research paper in Nature shows that data-based conclusions widely lack reproducibility; 246 biologists analysed the same sets of ecological data and got widely divergent results.

However, such a circumstance is not restricted to scientific studies. Any socioeconomic data analysis conducted parallelly by multiple experts may produce varying (and sometimes conflicting) results. The Upshot, a website published by The New York Times, in collaboration with Siena College, conducted a survey of 867 prospective voters in Florida, politically a crucial US state with a razor-thin margin, a few months before the 2016 US presidential elections. According to the polls, Hillary Clinton had a 1 per cent advantage over Donald Trump. The Upshot subsequently gave the raw data to four reputable pollsters and asked them to forecast the outcome. Three of them anticipated a victory for Hillary with margins of 4 per cent, 3 per cent and 1 per cent, while the fourth predicted a victory for Trump with a 1 per cent margin. There was a definite 5 per cent difference overall between the estimates. It makes sense that, based on various samples, several pollsters might produce various projections. But how can they use the same data to make predictions that are wildly different, if not diametrically opposed?

In an article in The New York Times, journalist Nate Cohn stated that “polling results rely as much on the judgments of pollsters as on the science of survey methodology. Two good pollsters, both looking at the same underlying data, could come up with two very different results.” In this particular situation, the pollsters actually chose differently for the weighing factors, including race, sex, age, region, party registration, education, past turnout, self-reported vote intentions and voting history, among others, while adjusting the sample. And these were the sources of the discrepancies.

Such circumstances may, of course, give rise to controversies and problems, especially if significant national policies, economic strategies or socio-political frameworks are to be developed on the basis of that data analysis. We have witnessed a variety of inconsistent estimates and projections about the number of infected people, deaths and job losses during the Covid-19 pandemic, all based on the same (publicly available) data. These were brought out by various specialists’ studies using different epidemiologic models. For a new disease like Covid, whose contagious nature and transmission dynamics were completely unknown at the outset, most of these models were inapplicable.

Subsequently, various economies started to flourish again as the world started to recover from the shock of the pandemic. Surprisingly, we learned that the same data may be used by different experts to conclude differently that a particular economy is recovering in a V-, W-, or K-shaped fashion. The average individual is confused by this strange situation since he/she does not understand the reasons for the inconsistencies and, more importantly, who is depicting the elephant’s trunk and who is depicting the tusk in his/her analysis. Whose analysis can you trust? Experts as well as lay people have difficulty understanding the merits of the numerous contradictory results reached by various data analysts and subject experts.

Even if data analysts are truthful in their analyses and all of their calculations and procedures are accurate, the findings they produce could nevertheless be vastly different, if not contradictory. The obvious question that would keep haunting us is — how did the same data yield different (and contradictory) conclusions when analysed by different experts? Some experts argue that crowdsourcing research, if that is possible, can balance discussions, validate findings and better inform policy. Sort of taking an average?

Statistics is an improvised science. Statistical analyses are not unique. Importantly, different scientists develop statistical models, measurements and analysis procedures according to their wisdom. Finding the ‘best’ model for analysing a certain dataset is the data analysts’ job. But it’s a not-so-easy task, for sure. These models, the pertinent metrics and the corresponding margin of error for generating conclusions substantially influence the decision-making processes. It’s interesting how, depending on them, many conclusions that seem to contradict one another could all be true.

The history of human civilisation demonstrates that various incongruent inferences can be made from even the seemingly obvious ‘data’ that the sun traverses the sky every day. It took ages to resolve the conflict. Therefore, domain knowledge is required; the basis of any data analysis should be a grasp of the causality behind an event. In general, it might be difficult to extract factual information and accurately represent the elephant. Only top statistical experts, along with experts with domain knowledge, at the most aspire to show the entire elephant. But it’s impossible to hold it against the common person if he/she struggles to understand it at all in this age of contradictions.

Why identical data yields contradictory findings

Other

Live Matches