Abstract
Background: Non-model species lacking public genomic resources have an extra handicap in bioinformatics that could be assisted by parameter tuning and the use of alternative software. Indeed, for RNA-seq-based gene differential expression analysis, parameter tuning could have a strong impact on the final results that should be evaluated. However, the lack of gold-standard datasets with known expression patterns hampers robust evaluation of pipelines and parameter combinations.
Objective: The aim of the presented workflow is to assess the best differential expression analysis pipeline among several alternatives, in terms of accuracy. To achieve this objective, an automatic procedure of gold-standard construction for simulation-based benchmarking is implemented.
Methods: The workflow, which is divided into four steps, simulates read libraries with known expression values to enable the construction of gold-standards for benchmarking pipelines in terms of true and false positives. We validated the workflow with a case study consisting of real RNA-seq libraries of radiata pine, a forest tree species with no publicly available reference genome.
Results: The workflow is available as a freeware application (DEGoldS) consisting on sequential Bash and R scripts that can run in any UNIX OS platform. The presented workflow proved to be able to construct a valid gold-standard from real count data. Additionally, benchmarking showed that slight pipeline modifications produced remarkable differences in the outcome of differential expression analysis.
Conclusion: The presented workflow solves the issues associated with robust gold-standard construction for benchmarking in differential expression experiments and can accommodate with a wide range of pipelines and parameter combinations.
Graphical Abstract
[http://dx.doi.org/10.1038/nrg2484] [PMID: 19015660]
[http://dx.doi.org/10.1007/s11033-021-06963-0] [PMID: 34988891]
[http://dx.doi.org/10.3389/fpls.2013.00066] [PMID: 23554602]
[http://dx.doi.org/10.21630/mcn.2016.64.07]
[http://dx.doi.org/10.1093/bioinformatics/btr355] [PMID: 21697122]
[http://dx.doi.org/10.1186/s12864-019-5533-4] [PMID: 30813897]
[http://dx.doi.org/10.1038/hdy.2010.152] [PMID: 21139633]
[http://dx.doi.org/10.1186/s12864-017-3819-y] [PMID: 28587590]
[http://dx.doi.org/10.3390/genes11121487] [PMID: 33322033]
[http://dx.doi.org/10.1186/s12859-020-03846-2] [PMID: 33349239]
[http://dx.doi.org/10.1038/nprot.2012.016] [PMID: 22383036]
[http://dx.doi.org/10.1038/nprot.2016.095] [PMID: 27560171]
[http://dx.doi.org/10.1038/nature15393] [PMID: 26432245]
[http://dx.doi.org/10.1186/s13059-016-0881-8] [PMID: 26813401]
[http://dx.doi.org/10.1007/s11295-016-0995-x]
[http://dx.doi.org/10.1038/nrg3068] [PMID: 21897427]
[http://dx.doi.org/10.1093/bib/bbab563] [PMID: 35076693]
[http://dx.doi.org/10.1111/1755-0998.13156] [PMID: 32180366]
[http://dx.doi.org/10.1186/s13059-014-0550-8] [PMID: 25516281]
[http://dx.doi.org/10.1093/bioinformatics/btp616] [PMID: 19910308]
[http://dx.doi.org/10.1038/s41598-020-76881-x] [PMID: 33184454]
[http://dx.doi.org/10.1186/1471-2105-14-370] [PMID: 24365034]
[http://dx.doi.org/10.1186/s12859-016-1457-z] [PMID: 28095772]
[http://dx.doi.org/10.1093/bib/bbx122] [PMID: 29040385]
[http://dx.doi.org/10.1186/s13059-015-0734-x] [PMID: 26335491]
[http://dx.doi.org/10.1038/s41598-017-01617-3] [PMID: 28484260]
[http://dx.doi.org/10.1038/nmeth.2722] [PMID: 24185836]
[http://dx.doi.org/10.1186/s13059-015-0702-5] [PMID: 26201343]
[http://dx.doi.org/10.1093/bib/bbt086] [PMID: 24300110]
[http://dx.doi.org/10.1186/1471-2105-14-91] [PMID: 23497356]
[http://dx.doi.org/10.1186/s12859-015-0794-7] [PMID: 26538400]
[http://dx.doi.org/10.1371/journal.pone.0232271] [PMID: 32353015]
[http://dx.doi.org/10.1016/j.csbj.2021.05.040] [PMID: 34188784]
[http://dx.doi.org/10.1186/gb-2013-14-9-r95] [PMID: 24020486]
[http://dx.doi.org/10.1371/journal.pone.0190152] [PMID: 29267363]
[http://dx.doi.org/10.1261/rna.046011.114] [PMID: 25246651]
[http://dx.doi.org/10.1186/s12864-015-1767-y] [PMID: 26208977]
[http://dx.doi.org/10.1186/s12864-018-5390-6] [PMID: 30634899]
[http://dx.doi.org/10.1093/nar/gkw448] [PMID: 27190234]
[http://dx.doi.org/10.1038/nbt.1883] [PMID: 21572440]
[http://dx.doi.org/10.7717/peerj.11237] [PMID: 33959420]
[http://dx.doi.org/10.1093/molbev/msab199] [PMID: 34320186]
[http://dx.doi.org/10.1093/bioinformatics/btt086] [PMID: 23422339]
[http://dx.doi.org/10.1093/bioinformatics/bti310] [PMID: 15728110]
[http://dx.doi.org/10.1186/1471-2105-12-323] [PMID: 21816040]
[http://dx.doi.org/10.1038/s41587-019-0201-4] [PMID: 31375807]
[http://dx.doi.org/10.1093/gigascience/giab008] [PMID: 33590861]
[http://dx.doi.org/10.1186/s13059-019-1910-1] [PMID: 31842956]
[http://dx.doi.org/10.1093/bioinformatics/btu638] [PMID: 25260700]
[http://dx.doi.org/10.1093/nar/gks042] [PMID: 22287627]
[http://dx.doi.org/10.12688/f1000research.8987.2] [PMID: 27508061]
[http://dx.doi.org/10.12688/f1000research.7563.1] [PMID: 26925227]
[http://dx.doi.org/10.12688/f1000research.23297.1] [PMID: 32489650]
[http://dx.doi.org/10.1093/gigascience/giw016] [PMID: 28369353]
[http://dx.doi.org/10.1093/database/bay084] [PMID: 30239664]
[http://dx.doi.org/10.3389/fpls.2019.00813] [PMID: 31293610]
[http://dx.doi.org/10.4067/S0716-97602007000400003] [PMID: 18449457]
[http://dx.doi.org/10.14806/ej.17.1.200]
[http://dx.doi.org/10.1093/nar/gky379] [PMID: 35446428]
[http://dx.doi.org/10.1111/1755-0998.13285] [PMID: 33070442]
[http://dx.doi.org/10.1371/journal.pone.0021800] [PMID: 21789182]
[http://dx.doi.org/10.1093/bioinformatics/btq064] [PMID: 20179076]
[http://dx.doi.org/10.1007/978-1-0716-0301-7_11] [PMID: 31960380]
[http://dx.doi.org/10.1093/bioinformatics/btt688] [PMID: 24319002]
[http://dx.doi.org/10.1093/bioinformatics/btv272] [PMID: 25926345]