AI and the Lack of Diversity in Data: Implications and the Path Forward for Rare Disease Research

Artificial intelligence (AI) and machine learning (ML) are poised to be play a pivotal role in clinical trials by automating patient recruitment, data collection and analysis. However, most AI algorithms primarily rely on databases from populations of North American and European origins, neglecting the genetic and morphological diversity worldwide. In the case of rare diseases, this dearth of cross-cultural data poses a significant challenge in both diagnosis and clinical trial recruitment.

We spoke with genomics data scientist Harsha Rajasimha, Ph.D., Founder and Executive Chairman of IndoUSrare, about the risks of developing AI/ML algorithms based on biased data, as well as efforts underway to improve collaboration on the collection and sharing of global health data.

What are the challenges of the lack of diversity in clinical trial data, particularly as you’re trying to build or grow your AI algorithms to improve clinical trials?

Rajasimha: I’ve been working in the genetics space since 2000 when the first Human Genome Project was completed. What we have learned over the years is that most of the data from the genomic sequencing that was done comes from the Western population and the Caucasian population, so there is a significant dearth of diverse genetic data in these databases.

Moving into the clinical trial space where there are over 400,000 clinical trials ongoing or completed; the majority of them suffer from significant operational challenges, as it costs two and a half billion dollars to bring one drug through the regulatory pipeline. Nine out of 10 drugs fail during the “Valley of Death” as it’s called, during phase 1, phase 2 and phase 3 trials.

What this means for rare diseases is that it takes 10 to 12 years from when a gene mutation or a biomarker is discovered in the laboratory to when a treatment option becomes available through the regulatory process. This is unsustainable. The drug development process is very expensive, super time consuming, and it results in really high drug prices, as the cost of research and development has to be recovered.

AI has the potential to significantly accelerate this process and be an important part of the solution. The challenge is 80% or more of clinical trials have historically relied on the western population when it comes to patient recruitment. and setting up investigator sites, These sites are primarily set up in the U.S. and the European Union, with the rest of the world combined comprising less than 20% of investigator sites, historically. In addition, we know that clinical trials have relied primarily on academic medical center investigator sites, and these academic centers are concentrated in Boston, Houston and the Bay Area, meaning only about 2% of the population has an equitable chance of accessing these clinical trials. What that generates is a rich data set that is poor in diversity.

Thus far, 80% or more of clinical trial participants have been Caucasian male adults from affluent socioeconomic backgrounds, and that’s where we are cautioning against this mad rush to build new AI and ML tools while not paying attention to the data that we’re using to train these models.

What can we do to improve access and the data collected? Are there any tools or global efforts available to help sponsors reach underserved populations, make them aware of the opportunities and enroll them?

Rajasimha: That is the question that is screaming out, and unfortunately the answers are not black and white. Right now, there is an increasing move toward globalization of clinical research, and FDA has made it a top priority to increase diversity, equity, inclusion and access in clinical trials. It has also been seeking to establish a global clinical trial network.

What we are suggesting is that data scientists understand that it’s not the algorithm that’s a big deal, it’s the data, the domain knowledge and interpretation of the data that’s the big deal. We have a variety of AI and ML algorithms which are well established, and new algorithms can be crafted. That’s not a problem. But if you don’t have the data, those algorithms are no good. Part of the solution is technology based and part of it is overcoming regulatory barriers surrounding data sharing, data ownership and cross-border collaboration. But there is no blanket solution.

There may be policy level changes that need to happen; there are also project level deliberations that can happen. For example, a rare disease might have a global patient registry that includes data from patients from all over the world in one central database sitting on a U.S.-based Cloud server. But what regulation protects patients residing in India whose data is sitting on these databases? These are often not considered closely and that can cause trouble down the road. So, we need both policy advocacy and cross border collaboration to craft proper collaboration agreements among institutes and countries with ethics committees and data protection and data sharing agreements. This is difficult work, but it has to be done if we are going to make this equitable.

So rare disease research may be a step ahead due to existing databases that collect global data?

Rajasimha: In a limited number of cases. Most of these databases are regionally oriented and typically funded by a government. They are very rare outside of the U.S. and Europe, and these efforts need to be encouraged through cross-border collaborations at the government level but also at the ground level among various stakeholders. That’s why we are bringing these stakeholders together at the Bridging Rare Summit this October 29-30. It’s a two-day immersive program with key opinion leaders from the U.S. and India representing medical research and patient advocacy groups and industry.

There are several rare disease conferences. We introduced this summit because we were not seeing a focus on cross-border collaboration. The rare disease revolution began in 1983 with the Orphan Drug Act signed by President Ronald Reagan here in the U.S., and we are celebrating the 40th anniversary this year. We have now 1,100 orphan drugs approved by the FDA, and that represents less than 10% of all rare diseases. There are about 11,000 named rare diseases, and most of them have no approved treatment option. Most are still in research and development, and that means they need more global collaboration and data, because rare diseases are very small in number in any single country. But investments in gene and cell therapies have been big in this space over the last few years, so there is a huge opportunity ahead of us. Multiple companies are trying to develop therapies for the same diseases, but there are not enough patients to enroll in all those clinical trials in one country. We’ve got to engage and make sure that all countries are aligned and participating in this revolution if this industry is going to flourish and patients are going to have treatment options for these severe and debilitating illnesses.

In order to reach a global patient base, are you looking at the use of decentralized clinical trials (DCT) to reach a more diverse pool of subjects, the collection of real world data to support drug development, or is it a combination of both?

Rajasimha: It’s a combination of both. If you look at the patient journey with a rare disease, the goal is to have a good treatment option or even a cure like a gene therapy. But, you often start with a small amount of patients who have been diagnosed with a rare genetic mutation, and they go through this diagnostic odyssey with many patients waiting five to seven years before they get their diagnosis. Then we form a patient registry study to understand the physiology of the disease, which is a prerequisite and necessary step before any clinical research or development can be conducted or even investments can be made.

Once you’ve completed a patient registry and national history study leading to the clinical trial phase, that’s where decentralization becomes valuable not just in rare disease scenarios but also in common diseases—making it a patient-centric study and making sure the patients don’t have to travel to the clinical trial site repeatedly, which is critical in a global clinical trial, especially if the number of investigators sites are limited and/or not available in certain countries.

Real world data captured in the context of the patient registry, natural history study or investigator-initiated trials is critical, but there also is data that can be gleaned from existing medical records, and insurance claims data and prescription drug data. All that real world evidence is important.

You spoke about diagnoses. What do you see in terms of AI’s potential role in improving diagnoses of rare diseases?

Rajasimha: AI has enormous potential to influence every step of the R&D process, starting with getting the proper diagnosis. Often, diagnosis comes through rapid newborn screening and a confirmatory test, which is the lowest hanging fruit. In the U.S., we have a universal screening panel that includes about 70 diseases. Other countries have similar programs, but most countries do not yet have a mandatory newborn screening program.

Next is genetic space diagnosis. If there is a causal gene mutation associated with the monogenic disease, we can sequence the genes, and that is only done if there’s a reason to do so. For example, the child is having symptoms or developmental delays. And it has become very cost-effective to conduct those tests, so we are often contacted by families who have gone through those sequencing tests already, and they know there is a gene mutation and want to know, what does this mean? If we can identify other patients with a similar gene mutation, that becomes part of the research cohort.

AI can play a role in making the base diagnosis, and some tools already exist, such as the GestaltMatcher that looks at facial images. If you have good quality data from different races and ethnicities, then the machine is fully trained to identify a person with Down Syndrome by just looking at images of the face, for example, because there are distinct craniofacial features. But to get to that AI model, you need to start with good quality and large enough data sets to train the algorithms.

You mentioned the Bridging Rare Summit, are there other efforts underway to bring together leaders from different countries and healthcare organizations to improve these data sets?

Rajasimha: We are working with governments both at the executive and the legislative branch levels. One example is looking at consensus data, is it capturing the race and ethnicity data at the level that is necessary? For example, the category Asian-American represents almost 3 ½ billion people in the world who fall under “Asian,” whereas indigenous Native American represents about 1% of the population. So we need to break some of this data down further so we understand which disease is more prevalent in certain ethnic groups. Sickle cell anemia disproportionately affects African, Middle East and Asian populations, and within Asia it is more likely to affect certain tribal populations, whereas cystic fibrosis and ALS disproportionately affect Caucasian populations. We need to make sure we are capturing data at the necessary level of granularity to train AI algorithms properly and enhance our understanding of which diseases affect which populations more or less.

We work with governments to educate them from our perspective on what kind of changes may be necessary to capture demographic data, medical history and family history data. We also work with the legislators to see how we can improve newborn screenings by adding more rare disease screenings, so there can be more cross-border thinking instead of nationalistic thinking, because rare diseases require cross-border collaboration.

 

 

Related Articles

About The Author

Exit mobile version