DEGoldS: A Workflow to Assess the Accuracy of Differential Expression
Analysis Pipelines through Gold-standard Construction

Mikel      Hurtado; Fernando      Mora-Márquez; Álvaro      Soto; Daniel      Marino; Pablo   G.   Goicoechea; Unai López      de Heredia

doi:10.2174/1574893618666230222122054

Abstract

Background: Non-model species lacking public genomic resources have an extra handicap in bioinformatics that could be assisted by parameter tuning and the use of alternative software. Indeed, for RNA-seq-based gene differential expression analysis, parameter tuning could have a strong impact on the final results that should be evaluated. However, the lack of gold-standard datasets with known expression patterns hampers robust evaluation of pipelines and parameter combinations.

Objective: The aim of the presented workflow is to assess the best differential expression analysis pipeline among several alternatives, in terms of accuracy. To achieve this objective, an automatic procedure of gold-standard construction for simulation-based benchmarking is implemented.

Methods: The workflow, which is divided into four steps, simulates read libraries with known expression values to enable the construction of gold-standards for benchmarking pipelines in terms of true and false positives. We validated the workflow with a case study consisting of real RNA-seq libraries of radiata pine, a forest tree species with no publicly available reference genome.

Results: The workflow is available as a freeware application (DEGoldS) consisting on sequential Bash and R scripts that can run in any UNIX OS platform. The presented workflow proved to be able to construct a valid gold-standard from real count data. Additionally, benchmarking showed that slight pipeline modifications produced remarkable differences in the outcome of differential expression analysis.

Conclusion: The presented workflow solves the issues associated with robust gold-standard construction for benchmarking in differential expression experiments and can accommodate with a wide range of pipelines and parameter combinations.

« Previous Next »

Graphical Abstract

[1]
Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet  2009; 10(1): 57-63.
 [http://dx.doi.org/10.1038/nrg2484] [PMID:  19015660]

[2]
Ergin S, Kherad N, Alagoz M. RNA sequencing and its applications in cancer and rare diseases. Mol Biol Rep  2022; 49(3): 2325-33.
 [http://dx.doi.org/10.1007/s11033-021-06963-0] [PMID:  34988891]

[3]
Martin LBB, Fei Z, Giovannoni JJ, Rose JKC. Catalyzing plant science research with RNA-seq. Front Plant Sci  2013; 4: 66.
 [http://dx.doi.org/10.3389/fpls.2013.00066] [PMID:  23554602]

[4]
López de Heredia U. Las técnicas de secuenciación masiva en el estudio de la diversidad biológica. Munibe Cienc Naturales  2016; 64: 7-31.
 [http://dx.doi.org/10.21630/mcn.2016.64.07]

[5]
Roberts A, Pimentel H, Trapnell C, Pachter L. Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics  2011; 27(17): 2325-9.
 [http://dx.doi.org/10.1093/bioinformatics/btr355] [PMID:  21697122]

[6]
Zhao Y, Wang K, Wang W, Yin T, Dong W, Xu C. A high-throughput SNP discovery strategy for RNA-seq data. BMC Genomics  2019; 20(1): 160.
 [http://dx.doi.org/10.1186/s12864-019-5533-4] [PMID:  30813897]

[7]
Ekblom R, Galindo J. Applications of next generation sequencing in molecular ecology of non-model organisms. Heredity  2011; 107(1): 1-15.
 [http://dx.doi.org/10.1038/hdy.2010.152] [PMID:  21139633]

[8]
Nazarov PV, Muller A, Kaoma T, et al. RNA sequencing and transcriptome arrays analyses show opposing results for alternative splicing in patient derived samples. BMC Genomics  2017; 18(1): 443.
 [http://dx.doi.org/10.1186/s12864-017-3819-y] [PMID:  28587590]

[9]
Lataretu M, Hölzer M. RNAflow: An effective and simple RNA-seq differential gene expression pipeline using nextflow. Genes (Basel)  2020; 11(12): 1487.
 [http://dx.doi.org/10.3390/genes11121487] [PMID:  33322033]

[10]
Spinozzi G, Tini V, Adorni A, Falini B, Martelli MP. ARPIR: automatic RNA-Seq pipelines with interactive report. BMC Bioinformatics  2020; 21(S19) (Suppl. 19): 574.
 [http://dx.doi.org/10.1186/s12859-020-03846-2] [PMID:  33349239]

[11]
Trapnell C, Roberts A, Goff L, et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc  2012; 7(3): 562-78.
 [http://dx.doi.org/10.1038/nprot.2012.016] [PMID:  22383036]

[12]
Pertea M, Kim D, Pertea GM, Leek JT, Salzberg SL. Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. Nat Protoc  2016; 11(9): 1650-67.
 [http://dx.doi.org/10.1038/nprot.2016.095] [PMID:  27560171]

[13]
The 1000 Genomes Project Consortium; A global reference for human genetic variation. Nature  2015; 526: 68-74.
 [http://dx.doi.org/10.1038/nature15393] [PMID:  26432245]

[14]
Conesa A, Madrigal P, Tarazona S, et al. A survey of best practices for RNA-seq data analysis. Genome Biol  2016; 17(1): 13.
 [http://dx.doi.org/10.1186/s13059-016-0881-8] [PMID:  26813401]

[15]
López de Heredia U, Vázquez-Poletti JL. RNA-seq analysis in forest tree species: bioinformatic problems and solutions. Tree Genet Genomes  2016; 12(2): 30.
 [http://dx.doi.org/10.1007/s11295-016-0995-x]

[16]
Martin JA, Wang Z. Next-generation transcriptome assembly. Nat Rev Genet  2011; 12(10): 671-82.
 [http://dx.doi.org/10.1038/nrg3068] [PMID:  21897427]

[17]
Raghavan V, Kraft L, Mesny F, Rigerte L. A simple guide to de novo transcriptome assembly and annotation. Brief Bioinform  2022; 23(2)bbab563, 
 [http://dx.doi.org/10.1093/bib/bbab563] [PMID:  35076693]

[18]
Freedman AH, Clamp M, Sackton TB. Error, noise and bias in de novo transcriptome assemblies. Mol Ecol Resour  2021; 21(1): 18-29.
 [http://dx.doi.org/10.1111/1755-0998.13156] [PMID:  32180366]

[19]
Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol  2014; 15(12): 550.
 [http://dx.doi.org/10.1186/s13059-014-0550-8] [PMID:  25516281]

[20]
Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics  2010; 26(1): 139-40.
 [http://dx.doi.org/10.1093/bioinformatics/btp616] [PMID:  19910308]

[21]
Corchete LA, Rojas EA, Alonso-López D, De Las Rivas J, Gutiérrez NC, Burguillo FJ. Systematic comparison and assessment of RNA-seq procedures for gene expression quantitative analysis. Sci Rep  2020; 10(1): 19737.
 [http://dx.doi.org/10.1038/s41598-020-76881-x] [PMID:  33184454]

[22]
Rehrauer H, Opitz L, Tan G, Sieverling L, Schlapbach R. Blind spots of quantitative RNA-seq: the limits for assessing abundance, differential expression, and isoform switching. BMC Bioinformatics  2013; 14(1): 370.
 [http://dx.doi.org/10.1186/1471-2105-14-370] [PMID:  24365034]

[23]
Williams CR, Baccarella A, Parrish JZ, Kim CC. Empirical assessment of analysis workflows for differential expression analysis of human samples using RNA-Seq. BMC Bioinformatics  2017; 18(1): 38.
 [http://dx.doi.org/10.1186/s12859-016-1457-z] [PMID:  28095772]

[24]
Merino GA, Conesa A, Fernández EA. A benchmarking of workflows for detecting differential splicing and differential expression at isoform level in human RNA-seq studies. Brief Bioinform  2019; 20(2): 471-81.
 [http://dx.doi.org/10.1093/bib/bbx122] [PMID:  29040385]

[25]
Robert C, Watson M. Errors in RNA-Seq quantification affect genes of relevance to human disease. Genome Biol  2015; 16(1): 177.
 [http://dx.doi.org/10.1186/s13059-015-0734-x] [PMID:  26335491]

[26]
Everaert C, Luypaert M, Maag JLV, et al. Benchmarking of RNA-sequencing analysis workflows using whole-transcriptome RT-qPCR expression data. Sci Rep  2017; 7(1): 1559.
 [http://dx.doi.org/10.1038/s41598-017-01617-3] [PMID:  28484260]

[27]
Engström PG, Steijger T, Sipos B, et al. Systematic evaluation of spliced alignment programs for RNA-seq data. Nat Methods  2013; 10(12): 1185-91.
 [http://dx.doi.org/10.1038/nmeth.2722] [PMID:  24185836]

[28]
Kanitz A, Gypas F, Gruber AJ, Gruber AR, Martin G, Zavolan M. Comparative assessment of methods for the computational inference of transcript isoform abundance from RNA-seq data. Genome Biol  2015; 16(1): 150.
 [http://dx.doi.org/10.1186/s13059-015-0702-5] [PMID:  26201343]

[29]
Seyednasrollah F, Laiho A, Elo LL. Comparison of software packages for detecting differential expression in RNA-seq studies. Brief Bioinform  2015; 16(1): 59-70.
 [http://dx.doi.org/10.1093/bib/bbt086] [PMID:  24300110]

[30]
Soneson C, Delorenzi M. A comparison of methods for differential expression analysis of RNA-seq data. BMC Bioinformatics  2013; 14(1): 91-14, 91.
 [http://dx.doi.org/10.1186/1471-2105-14-91] [PMID:  23497356]

[31]
Tang M, Sun J, Shimizu K, Kadota K. Evaluation of methods for differential expression analysis on multi-group RNA-seq count data. BMC Bioinformatics  2015; 16(1): 360.
 [http://dx.doi.org/10.1186/s12859-015-0794-7] [PMID:  26538400]

[32]
Baik B, Yoon S, Nam D. Benchmarking RNA-seq differential expression analysis methods using spike-in and simulation data. PLoS One  2020; 15(4): e0232271.
 [http://dx.doi.org/10.1371/journal.pone.0232271] [PMID:  32353015]

[33]
Stupnikov A, McInerney CE, Savage KI, et al. Robustness of differential gene expression analysis of RNA-seq. Comput Struct Biotechnol J  2021; 19: 3470-81.
 [http://dx.doi.org/10.1016/j.csbj.2021.05.040] [PMID:  34188784]

[34]
Rapaport F, Khanin R, Liang Y, et al. Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data. Genome Biol  2013; 14(9): R95.
 [http://dx.doi.org/10.1186/gb-2013-14-9-r95] [PMID:  24020486]

[35]
Costa-Silva J, Domingues D, Lopes FM. RNA-Seq differential expression analysis: An extended review and a software tool. PLoS One  2017; 12(12): e0190152.
 [http://dx.doi.org/10.1371/journal.pone.0190152] [PMID:  29267363]

[36]
Ching T, Huang S, Garmire LX. Power analysis and sample size estimation for RNA-Seq differential expression. RNA  2014; 20(11): 1684-96.
 [http://dx.doi.org/10.1261/rna.046011.114] [PMID:  25246651]

[37]
Rajkumar AP, Qvist P, Lazarus R, et al. Experimental validation of methods for differential gene expression analysis and sample pooling in RNA-seq. BMC Genomics  2015; 16(1): 548.
 [http://dx.doi.org/10.1186/s12864-015-1767-y] [PMID:  26208977]

[38]
Lin B, Pang Z. Stability of methods for differential expression analysis of RNA-seq data. BMC Genomics  2019; 20(1): 35.
 [http://dx.doi.org/10.1186/s12864-018-5390-6] [PMID:  30634899]

[39]
Germain PL, Vitriolo A, Adamo A, Laise P, Das V, Testa G. RNAontheBENCH: computational and empirical resources for benchmarking RNAseq quantification and differential expression methods. Nucleic Acids Res  2016; 44(11): 5054-67.
 [http://dx.doi.org/10.1093/nar/gkw448] [PMID:  27190234]

[40]
Babraham Bioinformatics.  Available from: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/

[41]
Grabherr MG, Haas BJ, Yassour M, et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol  2011; 29(7): 644-52.
 [http://dx.doi.org/10.1038/nbt.1883] [PMID:  21572440]

[42]
Mora-Márquez F, Vázquez-Poletti JL, López de Heredia U. NGScloud2: optimized bioinformatic analysis using Amazon Web Services. PeerJ  2021; 9: e11237.
 [http://dx.doi.org/10.7717/peerj.11237] [PMID:  33959420]

[43]
Manni M, Berkeley MR, Seppey M, Simão FA, Zdobnov EM. BUSCO update: Novel and streamlined workflows along with broader and deeper phylogenetic coverage for scoring of eukaryotic, prokaryotic, and viral genomes. Mol Biol Evol  2021; 38(10): 4647-54.
 [http://dx.doi.org/10.1093/molbev/msab199] [PMID:  34320186]

[44]
Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics  2013; 29(8): 1072-5.
 [http://dx.doi.org/10.1093/bioinformatics/btt086] [PMID:  23422339]

[45]
Wu TD, Watanabe CK. GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics  2005; 21(9): 1859-75.
 [http://dx.doi.org/10.1093/bioinformatics/bti310] [PMID:  15728110]

[46]
Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics  2011; 12(1): 323.
 [http://dx.doi.org/10.1186/1471-2105-12-323] [PMID:  21816040]

[47]
Kim D, Paggi JM, Park C, Bennett C, Salzberg SL. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol  2019; 37(8): 907-15.
 [http://dx.doi.org/10.1038/s41587-019-0201-4] [PMID:  31375807]

[48]
Danecek P, Bonfield JK, Liddle J, et al. Twelve years of SAMtools and BCFtools. Gigascience  2021; 10(2)giab008, 
 [http://dx.doi.org/10.1093/gigascience/giab008] [PMID:  33590861]

[49]
Kovaka S, Zimin AV, Pertea GM, Razaghi R, Salzberg SL, Pertea M. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biol  2019; 20(1): 278.
 [http://dx.doi.org/10.1186/s13059-019-1910-1] [PMID:  31842956]

[50]
Anders S, Pyl PT, Huber W. HTSeq--a Python framework to work with high-throughput sequencing data. Bioinformatics  2015; 31(2): 166-9.
 [http://dx.doi.org/10.1093/bioinformatics/btu638] [PMID:  25260700]

[51]
McCarthy DJ, Chen Y, Smyth GK. Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Res  2012; 40(10): 4288-97.
 [http://dx.doi.org/10.1093/nar/gks042] [PMID:  22287627]

[52]
Chen Y, Lun ATL, Smyth GK. From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline. F1000 Res  2016; 5(1438): 1438.
 [http://dx.doi.org/10.12688/f1000research.8987.2] [PMID:  27508061]

[53]
Soneson C, Love MI, Robinson MD. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000 Res  2015; 4: 1521.
 [http://dx.doi.org/10.12688/f1000research.7563.1] [PMID:  26925227]

[54]
Pertea G, Pertea M. GFF Utilities: GffRead and GffCompare. F1000 Res  2020; 9: 304.
 [http://dx.doi.org/10.12688/f1000research.23297.1] [PMID:  32489650]

[55]
Zimin AV, Stevens KA, Crepeau MW, et al. An improved assembly of the loblolly pine mega-genome using long-read single-molecule sequencing. Gigascience  2017; 6(1): 1-4.
 [http://dx.doi.org/10.1093/gigascience/giw016] [PMID:  28369353]

[56]
Falk T, Herndon N, Grau E, et al. Growing and cultivating the forest genomics database, TreeGenes. Database (Oxford)  2018; 2018: 1-11.
 [http://dx.doi.org/10.1093/database/bay084] [PMID:  30239664]

[57]
Wegrzyn JL, Staton MA, Street NR, et al. Cyberinfrastructure to Improve Forest Health and Productivity: The Role of Tree Databases in Connecting Genomes, Phenomes, and the Environment. Front Plant Sci  2019; 10: 813.
 [http://dx.doi.org/10.3389/fpls.2019.00813] [PMID:  31293610]

[58]
Le Provost G, Herrera R, Paiva J, Chaumeil P, Salin F, Plomion C. A micromethod for high throughput RNA extraction in forest trees. Biol Res  2007; 40(3): 291-7.
 [http://dx.doi.org/10.4067/S0716-97602007000400003] [PMID:  18449457]

[59]
Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J  2011; 17(1): 10-2.
 [http://dx.doi.org/10.14806/ej.17.1.200]

[60]
Galaxy Community. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2022 update. Nucleic Acids Res  2022; 50(W1)W345-51, 
 [http://dx.doi.org/10.1093/nar/gky379] [PMID:  35446428]

[61]
Mora-Márquez F, Chano V, Vázquez-Poletti JL, López de Heredia U. TOA: A software package for automated functional annotation in non‐model plant species. Mol Ecol Resour  2021; 21(2): 621-36.
 [http://dx.doi.org/10.1111/1755-0998.13285] [PMID:  33070442]

[62]
Supek F, Bošnjak M, Škunca N, Šmuc T. REVIGO summarizes and visualizes long lists of gene ontology terms. PLoS One  2011; 6(7): e21800.
 [http://dx.doi.org/10.1371/journal.pone.0021800] [PMID:  21789182]

[63]
Yu G, Li F, Qin Y, Bo X, Wu Y, Wang S. GOSemSim: an R package for measuring semantic similarity among GO terms and gene products. Bioinformatics  2010; 26(7): 976-8.
 [http://dx.doi.org/10.1093/bioinformatics/btq064] [PMID:  20179076]

[64]
Yu G. Gene ontology semantic similarity analysis using GOSemSim. Methods Mol Biol  2020; 2117: 207-15.
 [http://dx.doi.org/10.1007/978-1-0716-0301-7_11] [PMID:  31960380]

[65]
Liu Y, Zhou J, White KP. RNA-seq differential expression studies: more sequence or more replication? Bioinformatics  2014; 30(3): 301-4.
 [http://dx.doi.org/10.1093/bioinformatics/btt688] [PMID:  24319002]

[66]
Frazee AC, Jaffe AE, Langmead B, Leek JT. Polyester: simulating RNA-seq datasets with differential transcript expression. Bioinformatics  2015; 31(17): 2778-84.
 [http://dx.doi.org/10.1093/bioinformatics/btv272] [PMID:  25926345]

Rights & Permissions Print Cite

Journal Information

For Authors

For Editors

For Reviewers

Explore Articles

Open Access

Open Access Articles

For Visitors

DOI https://dx.doi.org/10.2174/1574893618666230222122054	Print ISSN 1574-8936
Publisher Name Bentham Science Publisher	Online ISSN 2212-392X

Current Bioinformatics

DEGoldS: A Workflow to Assess the Accuracy of Differential Expression Analysis Pipelines through Gold-standard Construction

Abstract Play Pause

Graphical Abstract

Related Journals

Related Books

Abstract