Abstract
Background and Objective: Colorectal cancer (CRC) is a common malignant tumor of the digestive system; it is associated with high morbidity and mortality. However, an early prediction of colorectal adenoma (CRA) that is a precancerous disease of most CRC patients provides an opportunity to make an appropriate strategy for prevention, early diagnosis and treatment. It has been aimed to develop a machine learning model to predict CRA that could assist physicians in classifying high-risk patients, make informed choices and prevent CRC.
Methods: Patients who had undergone a colonoscopy to fill out a questionnaire at the Sixth People Hospital of Shanghai in China from July 2018 to November 2018 were instructed. A classification model with the gradient boosting decision tree (GBDT) was developed to predict CRA. This model was compared with three other models, namely, random forest (RF), support vector machine (SVM), and logistic regression (LR). The area under the receiver operating characteristic curve (AUC) was used to evaluate performance of the models.
Results: Among the 245 included patients, 65 patients had CRA. The area under the receiver operating characteristic (AUCs) of GBDT, RF, SVM ,and LR with 10 fold-cross validation was 0.8131, 0.74, 0.769 and 0.763. An online prediction service, CRA Inference System, to substantialize the proposed solution for patients with CRA was also built.
Conclusion: Four classification models for CRA prediction were developed and compared, and the GBDT model showed the highest performance. Implementing a GBDT model for screening can reduce the cost of time and money and help physicians identify high-risk groups for primary prevention.
Keywords: Colorectal adenoma, colorectal cancer, gradient boosting decision tree, prediction, clinical data, early prevention.
Graphical Abstract