In the last blog, I touched briefly on a few examples of AI in Bioinformatics. One area where we can leverage AI is diagnostics. Good diagnostics use simple test information from easy-to-collect samples (e.g., blood, urine, etc) and guide subsequent decisions on the treatment of patients. When there is a good amount of data with corresponding decisions, AI can choose the best combination of input data to accurately model the decisions.
Today, I’d like to discuss in more detail a diagnostics example, a step-by-step guide on how to use NeoPulse Manager to predict active TB cases.
For active tuberculosis (TB) diagnostics we used routine blood test data. Chest x-ray, molecular test, or microbiological culture is another way to diagnose whether a person has active TB, yet is not always an option in under-resourced countries or takes too long to obtain the results. The goal of the project was to develop an affordable and rapid TB diagnostic that uses routine blood test results. We obtained data from Wu et al. (2019) and built a convolutional neural network to predict if the individual has active TB or not. Age, gender, and 58 blood-derived readings were used as input for the modeling. Here, I’d like to describe how it was done in more detail.
Step 1: Data preparation.
You need training and a test dataset. The training set is used for model training and the test set is used for the model validation. Ideally, an independent test set should be used, but here we used a subset of the data defined by the original publication. Data was first normalized so all features have (near) normal distribution and are within a similar range for the machine learning purpose. NeoPulse can visualize data distribution and normalize each feature in many ways (e.g., z-score, log-scale, divide by mean, etc).
Step 2: Model design and training.
Once the data looks good, it’s time for training. You can select your favorite machine learning algorithm from the drop-down menu or let “auto” decide one based on your data. By default, it creates 4 models with 20 iterations each (80 models total). In this case, since this was relatively a small dataset (425 observations), the training only took 5 min.
Step 3: Evaluation.
Now that you have models, let’s evaluate them with the test dataset you set aside at Step 1. “A/B Testing” function in the NeoPulse Manager makes it easy to select well-performing models and run them with the test data. You can visualize results with metrics of your choice (e.g., sensitivity, specificity, accuracy, etc.)
Step 4: Repeat! (Only if needs improvement).
“Auto” mode usually does a decent job but there is always room for improvement. Depending on the goal you have; you can either go back to Step 1 and/or Step 2 to improve your model. At Step 1, you can re-evaluate normalization, remove certain features, and/or further engineer features. At Step 2, you can choose other algorithms and/or modify parameters. These updates may require some previous experience, but “Auto”-selected parameters can give you a good starting point. In this TB example, we achieved 88% accuracy after slight modifications of the auto model, which outperformed reported models.
Next step? As mentioned earlier, ideally the test dataset should be totally independent of the training set. We are currently seeking other external datasets to test and update the model. Lastly, we can also examine which features (blood readings) are contributing to the model, to select features important for the diagnostic. We believe further improvement of the model and explaining it will lead to a new affordable TB diagnostic. A similar approach can be applied to other infectious diseases and opens a door to the treatment of patients more efficiently.
Wu J, Bai J, Wang W, Xi L, Zhang P, Lan J, Zhang L, and Li S. (2019). ATB discrimination: An in Silico Tool for Identification of Active Tuberculosis Disease Based on Routine Blood Test and T-SPOT.TB Detection Results. J Chem Inf Model. 59(11):4561-4568.