How Artificial Intelligence Might be Worsening the Reproducibility Crisis in Science and Technology
Artificial Intelligence has become an integral tool in scientific research, but concerns are growing that the misuse of these powerful tools is leading to a reproducibility crisis in science and its technological applications. Let's explore the fundamental issues contributing to this detrimental effect, which applies not only to AI in scientific research but also to AI development and utilization in general.
Artificial Intelligence, or AI, has become an integral part of society and of technology in general, finding every month several new applications in medicine, engineering, and the sciences. In particular, AI has become a very important tool in scientific research and in the development of new technology-based products. It enables researchers to identify patterns in data that may not be obvious to the human eye, and other kinds of computational data processing. All this certainly entails a revolution, one that in many cases materializes in the form of game-changing software solutions. Among tens of examples, some such as large language models that can be put to "think", speech recognition models with superb capabilities, and programs like Deepmind's AlphaFold 2 that revolutionized biology.
Despite AI's growing stake in society, concerns are growing that the misuse of these powerful tools is worsening the already strong and dangerous crisis in reproducibility that threatens science and technology. Here, I will discuss the reasons behind this phenomenon, focusing mainly on the high-level factors that apply broadly to data science and AI development beyond strictly scientific applications. I believe the discussion presented here is valuable for all those involved in developing, researching, and teaching about AI models.
On reproducibility in science and problems specific to AI-based science
First, let's see what reproducibility is, and what the issue with it is, especially in the context of science and technology.
Reproducibility is a major principle underpinning the scientific method, stating that the results of experiments, or more pertinent to this post the results of training or executing an AI model, must be reproducible. This means they must be fully replicable and repeatable.
For an AI project to be reproducible, papers and code should be clear enough to delineate conditions, input data, network architectures, algorithms, and any other elements of the AI-building process. In an ideal open-source world, all these elements should be made available clearly enough that others could replicate and repeat what the original developers did faithfully.
Upfront from the above definition, you probably see an obvious problem with it regarding proprietary models. For obvious reasons, full details will never be given for such models, and of course it is impossible to control copyright violation. But even with presumably open models, incomplete description is very common. Together with other problems that I will discuss below, this all contributes to a growing problem of reproducibility in science, engineering and technology.
In AI, reproducibility is essential for ensuring the validity and reliability of a new scientific model or scientific work using an AI model. More broadly, models being reproducible fosters in the community a sense of trust, which is essential for an AI-based scientific tool to be accepted in the scientific community.
Reproducibility also facilitates the accumulation and the integration of knowledge, as new studies can build upon previous ones and confirm or challenge their results. Furthermore, reproducibility fosters innovation and creativity, as researchers can use existing data and methods with a certain degree of safety to explore new questions and hypotheses.
However, reproducibility is not always easy to achieve, and there are many factors that can affect it. Although we are here discussing reproducibility specifically in the context of AI tool developments for science, engineering and technology, the core factors are essentially all there in the guidelines underpinning reproducibility of science in general. But in the AI field, this comes with the caveat that many of these factors play probably a bigger role than in science.
Data quality and availability Of course, science relies in data, and data must be good. At this point, a distinguishing feature of AI-based science is that contrary to analytical models, which can be fitted with relatively few data points, AI models require vast amounts of data for training.
The data used in a study should be accurate, complete, and consistent, and should be made accessible to other researchers who want to reproduce the study. However, data quality can be compromised by errors, noise, outliers, missing values, or inconsistencies. Data availability can also be limited by ethical, legal, or technical barriers, or by the reluctance of researchers to share their data due to concerns about privacy, competition, or criticism. In the context of AI this is especially critical, because AI models require huge amounts of data for training, and it must be reliable, well spread throughout the input domain, clean of problems and biases, and curated properly without keeping defective points and also being very careful about discarding points flagged as outliers.
Modeling details and transparency This is where proprietary systems, intellectual property, withholding patents, and other copyright-related issues typically interfere.
Ideally, at least for open projects, the methods and the models used in a study should be clearly described and documented, and should be made available to other researchers who want to reproduce the study. For big and complex AI models with many different sub-networks, architectures, activation functions, bias terms, pre- and post-processing modules, and other elements, this might all be very difficult to achieve. Even without any bad intentions, elements might be left out simply due to their number, complexity. In large open source projects involving many people, a simple miscommunication could result in a whole component being improperly described. And even when source code and models are shared through a resource like GitHub, one might overlook the need to explain how inputs must be processed or data curated.
Modeling details can be simply omitted or be ambiguous, or plain incorrect. Transparency can be lacking due to the complexity or the proprietary nature of the methods and the models. Moreover, some methods and models may have hidden assumptions, parameters, or dependencies that can affect their performance and their generalizability.
Note in particular that in AI model development it is very common to tune large numbers of parameters and procedures in order to optimize the results of training and testing. Most often, these practices are not transparent or documented, and they are often guided by subjective metrics and "hunches" for improved losses and performance.
On top of all this, the extreme difficulty in interpreting the inner workings of AI models further contribute to this factor, especially when the people in charge of preparing documentation are not the developers themselves.
Risks of data leakage and manipulation The data used to train an AI system should be sufficiently separated from the test data, and the results (that is how the trained network performs) should be maximally independent of the data. In other words, training and testing data should not overlap, and if a model is well trained then re-running its training with different training and testing sets should produce models that work similarly well.
Data leakage can occur when there is an overlap or a correlation between subsets of the training and testing data, in this case leading to overfitting or bias. In turn, data manipulation can occur when researchers modify the data or the parameters to obtain the desired results, either intentionally or unintentionally, leading to unrealistically high accuracy that doesn't reflect true performance.
Challenges in creating AI models for real-world conditions The data used to train and test an AI system should reflect the real-world conditions in which the system will be deployed. However, real-world data can be more diverse, complex, and noisy than the data used in lab conditions, and can introduce new sources of variability and uncertainty. I will present one concrete example of this in the next section.
Furthermore, real-world conditions can change over time, and the AI system may need to adapt to new situations and scenarios not represented in the original training and testing data.
For more on the broad problem of reproducibility in science (but not much centered on AI), you can read this article that I used to help me in my writing of this post.
How AI can worsen the reproducibility crisis in science and technology
Let me now discuss specifically how AI can worsen the reproducibility crisis in science and technology, to then briefly present what has been proposed as a way to improve the situation.
I wrote this part of the blog post based on information and examples that I condensed from this articles about AI and the reproducibility crisis in general science and this one focused on reproducibility of AI models for chemistry.
Data leakage
As we saw above, data leakage occurs when there is insufficient separation between the data used to train an AI system and the data used to test it. This need sounds obvious, but it turns out to be an important part of the problem. The problem is that avoiding data leakage is complicated due to the high-dimensional nature of data often used for AI models and due to the correlations and other unwanted features that data might present.
Data leakage has been shown to bias AI systems towards learning to identify features associated with specific individuals or instruments, rather than the scientific phenomena of interest. For example, a team of scientists reported that an AI system could diagnose COVID-19 infection by analyzing chest X-ray images, but then computer scientists from the Kansas State University showed that an AI algorithm trained on the same images but only using blank background sections that showed no body parts, could still identify COVID-19 cases at well above chance level. This indicated that the AI system was picking up on consistent differences in the backgrounds of the medical images in the data set, rather than any clinically relevant features. Clearly, here the AI system learned to identify features associated with specific individuals or instruments (in this case, the backgrounds of the medical images), rather than the scientific phenomena of interest (in this case, the presence or absence of COVID-19 infection). See here a full peer-reviewed paper reporting this.
The problem of data leakage may be subtle yet have deep detrimental effects on an AI model, even rendering it practically useless. In particular, leaking may occur if a random subset of test data is drawn from the same pool used for training. In this example study which analyzed an AI model developed in another work to analyze histopathology images, the AI scientists found that if medical data from the same person (or the same scientific device) are used in the training and test sets, the AI model learns to recognize features linked to that person rather than a particular medical condition. The study also reports that the same can happen when data from different imaging devices is mixed together in training and test sets. In brief, in these cases the AI system is performing correctly on the data but not because it is learning the patterns that are relevant to the disease, and rather by learning patterns that are specific to the individual or instrument. The conclusion of this study is that extra care has to be taken when splitting data into training, test and validation sets, and also that it is crucial to do control trials on image backgrounds that are blank in order to determine whether the output of the algorithm makes sense.
Even the AI giants risk falling into this kind of trap. For example, an AI system developed by researchers at Google Health to analyze retinal images for signs of diabetic retinopathy trained on high-quality scans, then rejected most images of positives that were taken under suboptimal but valid conditions that a human expert would process perfectly well. The AI system had learned to identify features associated with high-quality scans, rather than the signs of diabetic retinopathy itself. As a result, the AI's performance was not reproducible in real-world conditions.
Manipulation of data and parameters
The flexibility and tunability of AI, coupled with a lack of highly standardized rigor in developing these models, can lead to researchers manipulating data and parameters until they align with expected results, even if without any bad intentions.
This issue is compounded by the fact that many researchers have not been adequately trained in the correct application of ML to testing scientific hypotheses and modeling problems, not even being experts in AI. This is not a critic, but reality: most scientists creating AI models to address problems from some field of science do not have a formal background in computer sciences or another hardcore discipline directly related to AI systems; rather, most have degrees in a natural science or in engineering.
In one example a team of researchers used AI to predict future climate patterns based on historical weather data under the hypothesis that global temperatures will rise significantly over the next century due to human-induced climate change. They developed an AI model and trained it on historical weather data, but the initial results did not support their hypothesis: the model predicted only a slight increase in global temperatures, that they knew was of the right direction but too small. The researchers then decided to tweak the AI model by adjusting parameters, giving more weight to the recent (warmer) years. They also excluded certain data points that they considered to be outliers. After these adjustments, the AI model's predictions aligned with their original hypothesis. The researchers then essentially overfitted their model to the hypothesis they wanted to confirm, rather than letting the AI learn unbiased patterns from the data.
Challenges when it comes to AI models for real-world conditions
Another very common problem arises when the test dataset does not accurately reflect real-world data. AI models that perform well in "lab conditions" may fail when deployed in the real world due to the greater variation in conditions and the presence of noise unseen during training.
We already saw an example of this in the example from Google Health's AI system for disease detection from imaging scans.
This kind of problems appears everywhere, especially when working with image or video data or with information from sensors. In some cases these problems might be not "just" important for a better science, but actually a matter of safety. Let's take for example the case of AI models for self-driving cars trained and tested on a dataset composed of thousands of hours of driving footage captured in clear weather conditions during the day. The AI model performs exceptionally well in these conditions, accurately detecting other vehicles, pedestrians, and traffic signs, and making correct driving decisions, but then struggles in conditions that were not represented in the training and testing dataset and fails, perhaps fatally, in situations that are largely underrepresented in the dataset, such as difficulties detecting pedestrians in low light conditions or struggling to recognize traffic signs obscured by snow or rain.
While this conclusion may seem trivial, it is crucial to ensure that the test dataset accurately reflects the conditions in which the AI model will be deployed.
Caveats of synthetic data generation
Several techniques can be used to generate synthetic data for under-sampled regions, thus effectively augmenting the datasets; for example, Deepmind used this strategy to increase the amount of data used to train its AlphaFold 2 model.
However, these methods to correct imbalances in training or test datasets can also lead themselves to problems. Although the practice might be helpful to stabilize training if properly done, it can also be very dangerous because it might bias the model strongly and there's always the risk that the "interpolated" data is actually wrong. Moreover, the bias will work in a way that throws overly optimistic performance estimates but then results in poor performance on real-world problems, and it might perpetuate biases inherent in the original (actually partially synthetic) data.
The compromise here is in generating data very carefully, and perhaps not too different from the existing data. However, this might introduce correlations and lead to data leakage, and it might not end up serving the purpose of smoothly covering the domain of inputs.
Setting up standards to address these challenges and problems
I elaborated this section guided by this general discussion and this article focused specifically on AI applications to chemistry, and this general discussion on best practices for reproducibility in AI sciences.
It turns out that researchers are well aware of all the problems discussed above, and have proposed a checklist of standards for reporting AI-based science. The list includes questions on data quality, modeling details, and risks of data leakage. There is also a call for research papers using AI to make their methods and data fully open. However, achieving full reproducibility is challenging in any computational science, particularly in AI.
It is worth noting that reproducibility does not guarantee correct results; it only ensures self-consistent ones. High-impact AI models created by big companies are often not immediately available, and researchers may be hesitant to release their code due to concerns about public scrutiny, or simply for IP issues. Despite these challenges, the push for transparency and rigorous standards in AI-based science continues and is essential.
While AI and ML have the potential to revolutionize scientific research, several signs point at misuse and bad practices that are detrimental. It would not hurt covering these points in university courses, at least to some superficial extent as I have done here while specialized more curricula could cover these problems more deeply.
Identifying these problems was crucial, and now we enter the phase of seeking and implementing solutions. Essentially, this entails the delineation of rigorous standards and an accordingly adequate training of researchers who use, and especially those who develop, AI systems.
Related literature
To write this article I based myself mainly on these resources and in examples and links inside of them:
Six factors affecting reproducibility in life science research and how to handle them
Is AI leading to a reproducibility crisis in science?
Best practices in machine learning for chemistry – Nature Chemistry
A Step Toward Quantifying Independently Reproducible Machine Learning Research
www.lucianoabriata.com I write about everything that lies in my broad sphere of interests: nature, science, technology, programming, etc. Subscribe to get my new stories by email. To consult about small jobs check my services page here. You can contact me here. You can tip me here.