Generic placeholder image

Current Bioinformatics

Editor-in-Chief

ISSN (Print): 1574-8936
ISSN (Online): 2212-392X

Research Article

DEGoldS: A Workflow to Assess the Accuracy of Differential Expression Analysis Pipelines through Gold-standard Construction

Author(s): Mikel Hurtado, Fernando Mora-Márquez, Álvaro Soto, Daniel Marino, Pablo G. Goicoechea and Unai López de Heredia*

Volume 18, Issue 4, 2023

Published on: 21 March, 2023

Page: [296 - 309] Pages: 14

DOI: 10.2174/1574893618666230222122054

Price: $65

Abstract

Background: Non-model species lacking public genomic resources have an extra handicap in bioinformatics that could be assisted by parameter tuning and the use of alternative software. Indeed, for RNA-seq-based gene differential expression analysis, parameter tuning could have a strong impact on the final results that should be evaluated. However, the lack of gold-standard datasets with known expression patterns hampers robust evaluation of pipelines and parameter combinations.

Objective: The aim of the presented workflow is to assess the best differential expression analysis pipeline among several alternatives, in terms of accuracy. To achieve this objective, an automatic procedure of gold-standard construction for simulation-based benchmarking is implemented.

Methods: The workflow, which is divided into four steps, simulates read libraries with known expression values to enable the construction of gold-standards for benchmarking pipelines in terms of true and false positives. We validated the workflow with a case study consisting of real RNA-seq libraries of radiata pine, a forest tree species with no publicly available reference genome.

Results: The workflow is available as a freeware application (DEGoldS) consisting on sequential Bash and R scripts that can run in any UNIX OS platform. The presented workflow proved to be able to construct a valid gold-standard from real count data. Additionally, benchmarking showed that slight pipeline modifications produced remarkable differences in the outcome of differential expression analysis.

Conclusion: The presented workflow solves the issues associated with robust gold-standard construction for benchmarking in differential expression experiments and can accommodate with a wide range of pipelines and parameter combinations.

Graphical Abstract

[1]
Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 2009; 10(1): 57-63.
[http://dx.doi.org/10.1038/nrg2484] [PMID: 19015660]
[2]
Ergin S, Kherad N, Alagoz M. RNA sequencing and its applications in cancer and rare diseases. Mol Biol Rep 2022; 49(3): 2325-33.
[http://dx.doi.org/10.1007/s11033-021-06963-0] [PMID: 34988891]
[3]
Martin LBB, Fei Z, Giovannoni JJ, Rose JKC. Catalyzing plant science research with RNA-seq. Front Plant Sci 2013; 4: 66.
[http://dx.doi.org/10.3389/fpls.2013.00066] [PMID: 23554602]
[4]
López de Heredia U. Las técnicas de secuenciación masiva en el estudio de la diversidad biológica. Munibe Cienc Naturales 2016; 64: 7-31.
[http://dx.doi.org/10.21630/mcn.2016.64.07]
[5]
Roberts A, Pimentel H, Trapnell C, Pachter L. Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics 2011; 27(17): 2325-9.
[http://dx.doi.org/10.1093/bioinformatics/btr355] [PMID: 21697122]
[6]
Zhao Y, Wang K, Wang W, Yin T, Dong W, Xu C. A high-throughput SNP discovery strategy for RNA-seq data. BMC Genomics 2019; 20(1): 160.
[http://dx.doi.org/10.1186/s12864-019-5533-4] [PMID: 30813897]
[7]
Ekblom R, Galindo J. Applications of next generation sequencing in molecular ecology of non-model organisms. Heredity 2011; 107(1): 1-15.
[http://dx.doi.org/10.1038/hdy.2010.152] [PMID: 21139633]
[8]
Nazarov PV, Muller A, Kaoma T, et al. RNA sequencing and transcriptome arrays analyses show opposing results for alternative splicing in patient derived samples. BMC Genomics 2017; 18(1): 443.
[http://dx.doi.org/10.1186/s12864-017-3819-y] [PMID: 28587590]
[9]
Lataretu M, Hölzer M. RNAflow: An effective and simple RNA-seq differential gene expression pipeline using nextflow. Genes (Basel) 2020; 11(12): 1487.
[http://dx.doi.org/10.3390/genes11121487] [PMID: 33322033]
[10]
Spinozzi G, Tini V, Adorni A, Falini B, Martelli MP. ARPIR: automatic RNA-Seq pipelines with interactive report. BMC Bioinformatics 2020; 21(S19) (Suppl. 19): 574.
[http://dx.doi.org/10.1186/s12859-020-03846-2] [PMID: 33349239]
[11]
Trapnell C, Roberts A, Goff L, et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc 2012; 7(3): 562-78.
[http://dx.doi.org/10.1038/nprot.2012.016] [PMID: 22383036]
[12]
Pertea M, Kim D, Pertea GM, Leek JT, Salzberg SL. Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. Nat Protoc 2016; 11(9): 1650-67.
[http://dx.doi.org/10.1038/nprot.2016.095] [PMID: 27560171]
[13]
The 1000 Genomes Project Consortium; A global reference for human genetic variation. Nature 2015; 526: 68-74.
[http://dx.doi.org/10.1038/nature15393] [PMID: 26432245]
[14]
Conesa A, Madrigal P, Tarazona S, et al. A survey of best practices for RNA-seq data analysis. Genome Biol 2016; 17(1): 13.
[http://dx.doi.org/10.1186/s13059-016-0881-8] [PMID: 26813401]
[15]
López de Heredia U, Vázquez-Poletti JL. RNA-seq analysis in forest tree species: bioinformatic problems and solutions. Tree Genet Genomes 2016; 12(2): 30.
[http://dx.doi.org/10.1007/s11295-016-0995-x]
[16]
Martin JA, Wang Z. Next-generation transcriptome assembly. Nat Rev Genet 2011; 12(10): 671-82.
[http://dx.doi.org/10.1038/nrg3068] [PMID: 21897427]
[17]
Raghavan V, Kraft L, Mesny F, Rigerte L. A simple guide to de novo transcriptome assembly and annotation. Brief Bioinform 2022; 23(2)bbab563,
[http://dx.doi.org/10.1093/bib/bbab563] [PMID: 35076693]
[18]
Freedman AH, Clamp M, Sackton TB. Error, noise and bias in de novo transcriptome assemblies. Mol Ecol Resour 2021; 21(1): 18-29.
[http://dx.doi.org/10.1111/1755-0998.13156] [PMID: 32180366]
[19]
Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 2014; 15(12): 550.
[http://dx.doi.org/10.1186/s13059-014-0550-8] [PMID: 25516281]
[20]
Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 2010; 26(1): 139-40.
[http://dx.doi.org/10.1093/bioinformatics/btp616] [PMID: 19910308]
[21]
Corchete LA, Rojas EA, Alonso-López D, De Las Rivas J, Gutiérrez NC, Burguillo FJ. Systematic comparison and assessment of RNA-seq procedures for gene expression quantitative analysis. Sci Rep 2020; 10(1): 19737.
[http://dx.doi.org/10.1038/s41598-020-76881-x] [PMID: 33184454]
[22]
Rehrauer H, Opitz L, Tan G, Sieverling L, Schlapbach R. Blind spots of quantitative RNA-seq: the limits for assessing abundance, differential expression, and isoform switching. BMC Bioinformatics 2013; 14(1): 370.
[http://dx.doi.org/10.1186/1471-2105-14-370] [PMID: 24365034]
[23]
Williams CR, Baccarella A, Parrish JZ, Kim CC. Empirical assessment of analysis workflows for differential expression analysis of human samples using RNA-Seq. BMC Bioinformatics 2017; 18(1): 38.
[http://dx.doi.org/10.1186/s12859-016-1457-z] [PMID: 28095772]
[24]
Merino GA, Conesa A, Fernández EA. A benchmarking of workflows for detecting differential splicing and differential expression at isoform level in human RNA-seq studies. Brief Bioinform 2019; 20(2): 471-81.
[http://dx.doi.org/10.1093/bib/bbx122] [PMID: 29040385]
[25]
Robert C, Watson M. Errors in RNA-Seq quantification affect genes of relevance to human disease. Genome Biol 2015; 16(1): 177.
[http://dx.doi.org/10.1186/s13059-015-0734-x] [PMID: 26335491]
[26]
Everaert C, Luypaert M, Maag JLV, et al. Benchmarking of RNA-sequencing analysis workflows using whole-transcriptome RT-qPCR expression data. Sci Rep 2017; 7(1): 1559.
[http://dx.doi.org/10.1038/s41598-017-01617-3] [PMID: 28484260]
[27]
Engström PG, Steijger T, Sipos B, et al. Systematic evaluation of spliced alignment programs for RNA-seq data. Nat Methods 2013; 10(12): 1185-91.
[http://dx.doi.org/10.1038/nmeth.2722] [PMID: 24185836]
[28]
Kanitz A, Gypas F, Gruber AJ, Gruber AR, Martin G, Zavolan M. Comparative assessment of methods for the computational inference of transcript isoform abundance from RNA-seq data. Genome Biol 2015; 16(1): 150.
[http://dx.doi.org/10.1186/s13059-015-0702-5] [PMID: 26201343]
[29]
Seyednasrollah F, Laiho A, Elo LL. Comparison of software packages for detecting differential expression in RNA-seq studies. Brief Bioinform 2015; 16(1): 59-70.
[http://dx.doi.org/10.1093/bib/bbt086] [PMID: 24300110]
[30]
Soneson C, Delorenzi M. A comparison of methods for differential expression analysis of RNA-seq data. BMC Bioinformatics 2013; 14(1): 91-14, 91.
[http://dx.doi.org/10.1186/1471-2105-14-91] [PMID: 23497356]
[31]
Tang M, Sun J, Shimizu K, Kadota K. Evaluation of methods for differential expression analysis on multi-group RNA-seq count data. BMC Bioinformatics 2015; 16(1): 360.
[http://dx.doi.org/10.1186/s12859-015-0794-7] [PMID: 26538400]
[32]
Baik B, Yoon S, Nam D. Benchmarking RNA-seq differential expression analysis methods using spike-in and simulation data. PLoS One 2020; 15(4): e0232271.
[http://dx.doi.org/10.1371/journal.pone.0232271] [PMID: 32353015]
[33]
Stupnikov A, McInerney CE, Savage KI, et al. Robustness of differential gene expression analysis of RNA-seq. Comput Struct Biotechnol J 2021; 19: 3470-81.
[http://dx.doi.org/10.1016/j.csbj.2021.05.040] [PMID: 34188784]
[34]
Rapaport F, Khanin R, Liang Y, et al. Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data. Genome Biol 2013; 14(9): R95.
[http://dx.doi.org/10.1186/gb-2013-14-9-r95] [PMID: 24020486]
[35]
Costa-Silva J, Domingues D, Lopes FM. RNA-Seq differential expression analysis: An extended review and a software tool. PLoS One 2017; 12(12): e0190152.
[http://dx.doi.org/10.1371/journal.pone.0190152] [PMID: 29267363]
[36]
Ching T, Huang S, Garmire LX. Power analysis and sample size estimation for RNA-Seq differential expression. RNA 2014; 20(11): 1684-96.
[http://dx.doi.org/10.1261/rna.046011.114] [PMID: 25246651]
[37]
Rajkumar AP, Qvist P, Lazarus R, et al. Experimental validation of methods for differential gene expression analysis and sample pooling in RNA-seq. BMC Genomics 2015; 16(1): 548.
[http://dx.doi.org/10.1186/s12864-015-1767-y] [PMID: 26208977]
[38]
Lin B, Pang Z. Stability of methods for differential expression analysis of RNA-seq data. BMC Genomics 2019; 20(1): 35.
[http://dx.doi.org/10.1186/s12864-018-5390-6] [PMID: 30634899]
[39]
Germain PL, Vitriolo A, Adamo A, Laise P, Das V, Testa G. RNAontheBENCH: computational and empirical resources for benchmarking RNAseq quantification and differential expression methods. Nucleic Acids Res 2016; 44(11): 5054-67.
[http://dx.doi.org/10.1093/nar/gkw448] [PMID: 27190234]
[40]
Babraham Bioinformatics. Available from: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
[41]
Grabherr MG, Haas BJ, Yassour M, et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol 2011; 29(7): 644-52.
[http://dx.doi.org/10.1038/nbt.1883] [PMID: 21572440]
[42]
Mora-Márquez F, Vázquez-Poletti JL, López de Heredia U. NGScloud2: optimized bioinformatic analysis using Amazon Web Services. PeerJ 2021; 9: e11237.
[http://dx.doi.org/10.7717/peerj.11237] [PMID: 33959420]
[43]
Manni M, Berkeley MR, Seppey M, Simão FA, Zdobnov EM. BUSCO update: Novel and streamlined workflows along with broader and deeper phylogenetic coverage for scoring of eukaryotic, prokaryotic, and viral genomes. Mol Biol Evol 2021; 38(10): 4647-54.
[http://dx.doi.org/10.1093/molbev/msab199] [PMID: 34320186]
[44]
Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics 2013; 29(8): 1072-5.
[http://dx.doi.org/10.1093/bioinformatics/btt086] [PMID: 23422339]
[45]
Wu TD, Watanabe CK. GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 2005; 21(9): 1859-75.
[http://dx.doi.org/10.1093/bioinformatics/bti310] [PMID: 15728110]
[46]
Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 2011; 12(1): 323.
[http://dx.doi.org/10.1186/1471-2105-12-323] [PMID: 21816040]
[47]
Kim D, Paggi JM, Park C, Bennett C, Salzberg SL. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol 2019; 37(8): 907-15.
[http://dx.doi.org/10.1038/s41587-019-0201-4] [PMID: 31375807]
[48]
Danecek P, Bonfield JK, Liddle J, et al. Twelve years of SAMtools and BCFtools. Gigascience 2021; 10(2)giab008,
[http://dx.doi.org/10.1093/gigascience/giab008] [PMID: 33590861]
[49]
Kovaka S, Zimin AV, Pertea GM, Razaghi R, Salzberg SL, Pertea M. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biol 2019; 20(1): 278.
[http://dx.doi.org/10.1186/s13059-019-1910-1] [PMID: 31842956]
[50]
Anders S, Pyl PT, Huber W. HTSeq--a Python framework to work with high-throughput sequencing data. Bioinformatics 2015; 31(2): 166-9.
[http://dx.doi.org/10.1093/bioinformatics/btu638] [PMID: 25260700]
[51]
McCarthy DJ, Chen Y, Smyth GK. Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Res 2012; 40(10): 4288-97.
[http://dx.doi.org/10.1093/nar/gks042] [PMID: 22287627]
[52]
Chen Y, Lun ATL, Smyth GK. From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline. F1000 Res 2016; 5(1438): 1438.
[http://dx.doi.org/10.12688/f1000research.8987.2] [PMID: 27508061]
[53]
Soneson C, Love MI, Robinson MD. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000 Res 2015; 4: 1521.
[http://dx.doi.org/10.12688/f1000research.7563.1] [PMID: 26925227]
[54]
Pertea G, Pertea M. GFF Utilities: GffRead and GffCompare. F1000 Res 2020; 9: 304.
[http://dx.doi.org/10.12688/f1000research.23297.1] [PMID: 32489650]
[55]
Zimin AV, Stevens KA, Crepeau MW, et al. An improved assembly of the loblolly pine mega-genome using long-read single-molecule sequencing. Gigascience 2017; 6(1): 1-4.
[http://dx.doi.org/10.1093/gigascience/giw016] [PMID: 28369353]
[56]
Falk T, Herndon N, Grau E, et al. Growing and cultivating the forest genomics database, TreeGenes. Database (Oxford) 2018; 2018: 1-11.
[http://dx.doi.org/10.1093/database/bay084] [PMID: 30239664]
[57]
Wegrzyn JL, Staton MA, Street NR, et al. Cyberinfrastructure to Improve Forest Health and Productivity: The Role of Tree Databases in Connecting Genomes, Phenomes, and the Environment. Front Plant Sci 2019; 10: 813.
[http://dx.doi.org/10.3389/fpls.2019.00813] [PMID: 31293610]
[58]
Le Provost G, Herrera R, Paiva J, Chaumeil P, Salin F, Plomion C. A micromethod for high throughput RNA extraction in forest trees. Biol Res 2007; 40(3): 291-7.
[http://dx.doi.org/10.4067/S0716-97602007000400003] [PMID: 18449457]
[59]
Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J 2011; 17(1): 10-2.
[http://dx.doi.org/10.14806/ej.17.1.200]
[60]
Galaxy Community. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2022 update. Nucleic Acids Res 2022; 50(W1)W345-51,
[http://dx.doi.org/10.1093/nar/gky379] [PMID: 35446428]
[61]
Mora-Márquez F, Chano V, Vázquez-Poletti JL, López de Heredia U. TOA: A software package for automated functional annotation in non‐model plant species. Mol Ecol Resour 2021; 21(2): 621-36.
[http://dx.doi.org/10.1111/1755-0998.13285] [PMID: 33070442]
[62]
Supek F, Bošnjak M, Škunca N, Šmuc T. REVIGO summarizes and visualizes long lists of gene ontology terms. PLoS One 2011; 6(7): e21800.
[http://dx.doi.org/10.1371/journal.pone.0021800] [PMID: 21789182]
[63]
Yu G, Li F, Qin Y, Bo X, Wu Y, Wang S. GOSemSim: an R package for measuring semantic similarity among GO terms and gene products. Bioinformatics 2010; 26(7): 976-8.
[http://dx.doi.org/10.1093/bioinformatics/btq064] [PMID: 20179076]
[64]
Yu G. Gene ontology semantic similarity analysis using GOSemSim. Methods Mol Biol 2020; 2117: 207-15.
[http://dx.doi.org/10.1007/978-1-0716-0301-7_11] [PMID: 31960380]
[65]
Liu Y, Zhou J, White KP. RNA-seq differential expression studies: more sequence or more replication? Bioinformatics 2014; 30(3): 301-4.
[http://dx.doi.org/10.1093/bioinformatics/btt688] [PMID: 24319002]
[66]
Frazee AC, Jaffe AE, Langmead B, Leek JT. Polyester: simulating RNA-seq datasets with differential transcript expression. Bioinformatics 2015; 31(17): 2778-84.
[http://dx.doi.org/10.1093/bioinformatics/btv272] [PMID: 25926345]

Rights & Permissions Print Cite
© 2024 Bentham Science Publishers | Privacy Policy