AI is good at finding complex associations between data. It can explore not just individual factors that contribute to the problem but also investigate secondary and tertiary factors that can explain a condition. This ability makes AI particularly useful in Bioinformatics.
Bioinformatics is a field of using computational analysis to understand biology. Due to recent improvements in technology to obtain data, many biological datasets became available to researchers. Those biological datasets include human genomics data, gene expression transcriptomics data, proteomics data from clinical samples (omics data), cell assays, and animal studies. In parallel, there are a lot of efforts going into curating and storing that data into databases to share publicly to advance science. For example, one of the largest efforts is a cancer genomics program, The Cancer Genome Atlas (TCGA), which generated multiple types of omics data with over 20,000 primary cancer and matched normal samples spanning 33 cancer types. Bioinformatics analysis tries to understand associations between such data and build new biological hypotheses based on the data to eventually solve the problems in healthcare and biotechnology.
Traditionally, bioinformatics focus was univariate testing, where you compare one factor at a time between groups, e.g. whether gene A is significantly differentially expressed between cancer vs non-cancer populations. This was helpful if the significant feature is critical in the problem and inhibiting that one feature would alter the fundamental problem. However, biology is not that simple, and running univariate tests has limitations in understanding multiple complicated factors associated with the problem.
With TCGA data, for instance, AI can build a model to classify 11,000 patients’ cancer types using 20,000+ different genes’ expression data. The accuracy of predicting cancer type by such a model is way higher than just looking at the individual gene expression levels.
A similar approach to find omics data patterns that describe certain patient populations can be applied to drug response prediction. Immune checkpoint inhibitors are used for a broad range of cancer patients although the reported response rate is 20-40%. Finding the right population that responds to the treatment helps patients in treatment decision-making and avoids unnecessary side effects. There are many studies underway looking for specific biomarkers to find personalized medicine by integrating multi-omics data including genomics, transcriptomics, and epigenetics. AI can be used to understand such complex data.
AI can also predict the properties of molecules. DNA, RNA, and proteins are biological molecules essential for reproduction, metabolism, and maintaining cell homeostasis in living organisms. Modeling chemical structures and affinity between molecules using AI are useful in predicting drug efficacy or finding new drugs to a certain target disease protein. DNA, RNA, or protein structure prediction is another problem that computational scientists have been trying to solve. Very recently, AlphaFold2, a neural network-based AI model showed near-experimental accuracy predicting protein 3D structures. The model opened a door to use predicted structures to further investigate protein structural biology in bioinformatics.
One of the challenges AI is facing in bioinformatics is the amount of curated data. Compared to other data such as consumer behavior data (e.g. choice of movies), a number of biological data is still small and oftentimes in different formats between different institutes that generated the data. Having a well-curated database system with associated clinical metadata would help to use AI more in bioinformatics. With such efforts, AI can become a powerful tool in this space.
In the next few articles, I’d like to talk in more detail about examples of AI in bioinformatics.