Abstract
Background: SARS-CoV-2 has paralyzed mankind due to its high transmissibility and its associated mortality, causing millions of infections and deaths worldwide. The search for gene expression biomarkers from the host transcriptional response to infection may help understand the underlying mechanisms by which the virus causes COVID-19. This research proposes a smart methodology integrating different RNA-Seq datasets from SARS-CoV-2, other respiratory diseases, and healthy patients.
Methods: The proposed pipeline exploits the functionality of the ‘KnowSeq’ R/Bioc package, integrating different data sources and attaining a significantly larger gene expression dataset, thus endowing the results with higher statistical significance and robustness in comparison with previous studies in the literature. A detailed preprocessing step was carried out to homogenize the samples and build a clinical decision system for SARS-CoV-2. It uses machine learning techniques such as feature selection algorithm and supervised classification system. This clinical decision system uses the most differentially expressed genes among different diseases (including SARS-Cov-2) to develop a four-class classifier.
Results: The multiclass classifier designed can discern SARS-CoV-2 samples, reaching an accuracy equal to 91.5%, a mean F1-Score equal to 88.5%, and a SARS-CoV-2 AUC equal to 94% by using only 15 genes as predictors. A biological interpretation of the gene signature extracted reveals relations with processes involved in viral responses.
Conclusion: This work proposes a COVID-19 gene signature composed of 15 genes, selected after applying the feature selection ‘minimum Redundancy Maximum Relevance’ algorithm. The integration among several RNA-Seq datasets was a success, allowing for a considerable large number of samples and therefore providing greater statistical significance to the results than in previous studies. Biological interpretation of the selected genes was also provided.
Keywords: COVID-19, RNA-Seq, machine learning, feature selection, gene signature, WHO.
Graphical Abstract
[http://dx.doi.org/10.1038/s41591-020-0869-5] [PMID: 32296168]
[http://dx.doi.org/10.1101/2020.03.18.20034561]
[http://dx.doi.org/10.1056/NEJMc2001468] [PMID: 32003551]
[http://dx.doi.org/10.1001/jamainternmed.2020.0994] [PMID: 32167524]
[http://dx.doi.org/10.1101/2020.05.06.20092999]
[http://dx.doi.org/10.18632/aging.103344] [PMID: 32470948]
[http://dx.doi.org/10.1056/NEJMoa2002032] [PMID: 32109013]
[http://dx.doi.org/10.1073/pnas.1809700115] [PMID: 30482864]
[http://dx.doi.org/10.1038/s41467-020-19587-y] [PMID: 33203890]
[http://dx.doi.org/10.1164/rccm.202004-1343LE] [PMID: 32649217]
[http://dx.doi.org/10.1371/journal.pbio.3000849] [PMID: 32898168]
[http://dx.doi.org/10.1126/sciadv.abe5984] [PMID: 33536218]
[http://dx.doi.org/10.1016/j.immuni.2015.11.003] [PMID: 26682989]
[http://dx.doi.org/10.1371/journal.pone.0052198] [PMID: 23326326]
[http://dx.doi.org/10.1038/nrg2934] [PMID: 21191423]
[http://dx.doi.org/10.1186/s12943-019-1061-8] [PMID: 31484581]
[http://dx.doi.org/10.1371/journal.pone.0212127] [PMID: 30753220]
[http://dx.doi.org/10.1016/j.ygyno.2018.10.002] [PMID: 30297273]
[http://dx.doi.org/10.1093/bioinformatics/btm344] [PMID: 17720704]
[http://dx.doi.org/10.1016/j.asoc.2009.11.010]
[http://dx.doi.org/10.1186/s13059-019-1861-6] [PMID: 31870412]
[http://dx.doi.org/10.1016/j.neucom.2016.07.080]
[http://dx.doi.org/10.1371/journal.pone.0196836] [PMID: 29750795]
[http://dx.doi.org/10.1016/j.biosystems.2018.12.009] [PMID: 30611843]
[http://dx.doi.org/10.1371/journal.pcbi.1006826] [PMID: 30785874]
[http://dx.doi.org/10.1161/CIRCRESAHA.121.319060] [PMID: 33853355]
[http://dx.doi.org/10.1186/s40635-020-00361-9] [PMID: 33306162]
[http://dx.doi.org/10.1038/s41598-021-83110-6] [PMID: 33608566]
[http://dx.doi.org/10.1016/j.immuni.2020.11.017] [PMID: 33296687]
[http://dx.doi.org/10.1016/j.csbj.2020.12.016] [PMID: 33425248]
[http://dx.doi.org/10.1016/j.compbiomed.2021.104387] [PMID: 33872966]
[http://dx.doi.org/10.1142/S0219720009004230] [PMID: 19634197]
[http://dx.doi.org/10.1198/016214502753479248]
[http://dx.doi.org/10.1080/01621459.1951.10500769]
[http://dx.doi.org/10.1016/S1046-2023(03)00155-5] [PMID: 14597310]
[http://dx.doi.org/10.1093/bib/bbs037] [PMID: 22851511]
[http://dx.doi.org/10.1371/journal.pgen.0030161] [PMID: 17907809]
[http://dx.doi.org/10.1109/TPAMI.2005.159] [PMID: 16119262]
[http://dx.doi.org/10.1186/1471-2105-7-3] [PMID: 16398926]
[http://dx.doi.org/10.1017/CBO9780511801389]
[http://dx.doi.org/10.1109/TIT.1967.1053964]
[http://dx.doi.org/10.1023/A:1010933404324]
[http://dx.doi.org/10.1002/jmv.26232] [PMID: 32592501]
[http://dx.doi.org/10.3390/genes11070760] [PMID: 32646047]
[http://dx.doi.org/10.1136/bmjopen-2020-044497] [PMID: 33408218]
[http://dx.doi.org/10.1038/s41392-021-00526-2] [PMID: 33677468]
[http://dx.doi.org/10.3390/cells9112374] [PMID: 33138195]
[http://dx.doi.org/10.1016/j.immuni.2020.07.009] [PMID: 32783921]
[http://dx.doi.org/10.7717/peerj.9357] [PMID: 32566414]
[http://dx.doi.org/10.1186/s13054-021-03559-9] [PMID: 33849612]
[PMID: 32706090]
[http://dx.doi.org/10.3389/fimmu.2020.582102] [PMID: 33193390]
[http://dx.doi.org/10.1016/j.genrep.2020.101012] [PMID: 33398248]
[http://dx.doi.org/10.3390/v13050832] [PMID: 34064525]
[http://dx.doi.org/10.1002/jmv.26406] [PMID: 32776556]
[http://dx.doi.org/10.1128/mBio.02374-20] [PMID: 33184103]
[http://dx.doi.org/10.1016/j.gene.2020.145057] [PMID: 32805314]
[http://dx.doi.org/10.1038/s41467-019-11203-y] [PMID: 30602773]
[http://dx.doi.org/10.1016/j.celrep.2019.01.092] [PMID: 30784587]
[http://dx.doi.org/10.1038/s41422-018-0113-8] [PMID: 30514900]
[http://dx.doi.org/10.1007/s00415-020-10184-z] [PMID: 32862241]
[http://dx.doi.org/10.1038/s41385-018-0108-2] [PMID: 30542107]
[http://dx.doi.org/10.1093/intimm/dxz014] [PMID: 30753547]
[http://dx.doi.org/10.1186/s12943-020-01223-4] [PMID: 31901224]
[http://dx.doi.org/10.1161/STROKEAHA.118.023436] [PMID: 31084332]
[http://dx.doi.org/10.3390/microorganisms8111744] [PMID: 33172188]