Hardware Performance Evaluation of De novo Transcriptome Assembly Software in Amazon Elastic Compute Cloud

Fernando       Mora-Márquez; José   Luis    Vázquez-Poletti; Víctor       Chano; Carmen       Collada; Álvaro       Soto; Unai    López    de Heredia

doi:10.2174/1574893615666191219095817

Abstract

Background: Bioinformatics software for RNA-seq analysis has a high computational requirement in terms of the number of CPUs, RAM size, and processor characteristics. Specifically, de novo transcriptome assembly demands large computational infrastructure due to the massive data size, and complexity of the algorithms employed. Comparative studies on the quality of the transcriptome yielded by de novo assemblers have been previously published, lacking, however, a hardware efficiency-oriented approach to help select the assembly hardware platform in a cost-efficient way.

Objective: We tested the performance of two popular de novo transcriptome assemblers, Trinity and SOAPdenovo-Trans (SDNT), in terms of cost-efficiency and quality to assess limitations, and provided troubleshooting and guidelines to run transcriptome assemblies efficiently.

Methods: We built virtual machines with different hardware characteristics (CPU number, RAM size) in the Amazon Elastic Compute Cloud of the Amazon Web Services. Using simulated and real data sets, we measured the elapsed time, cost, CPU percentage and output size of small and large data set assemblies.

Results: For small data sets, SDNT outperformed Trinity by an order the magnitude, significantly reducing the time duration and costs of the assembly. For large data sets, Trinity performed better than SDNT. Both the assemblers provide good quality transcriptomes.

Conclusion: The selection of the optimal transcriptome assembler and provision of computational resources depend on the combined effect of size and complexity of RNA-seq experiments.

Keywords: Cloud computing, cost-efficiency, quality, RNA-seq, transcriptome, magnitude.

« Previous Next »

Graphical Abstract

[1] 
Capobianco E. RNA-Seq data: a complexity journey. Comput Struct Biotechnol J  2014; 11(19): 123-30.
[http://dx.doi.org/10.1016/j.csbj.2014.09.004] [PMID:  25408846] 
[2] 
Marx V. Biology: the big challenges of big data. Nature  2013; 498(7453): 255-60.
[http://dx.doi.org/10.1038/498255a] [PMID:  23765498] 
[3] 
Yang A, Troup M, Ho JWK. Scalability and validation of big data bioinformatics software. Comput Struct Biotechnol J  2017; 15: 379-86.
[http://dx.doi.org/10.1016/j.csbj.2017.07.002] [PMID:  28794828] 
[4] 
Baker M. Next-generation sequencing: adjusting to data overload. Nat Methods  2010; 7(7): 495-9.
[http://dx.doi.org/10.1038/nmeth0710-495] 
[5] 
López de Heredia U, Vázquez-Poletti JL. RNA-seq analysis in forest tree species: bioinformatic problems and solutions. Tree Genet Genomes  2016; 12(2): 30.
[http://dx.doi.org/10.1007/s11295-016-0995-x] 
[6] 
Martin JA, Wang Z. Next-generation transcriptome assembly. Nat Rev Genet  2011; 12(10): 671-82.
[http://dx.doi.org/10.1038/nrg3068] [PMID:  21897427] 
[7] 
Miller JR, Koren S, Sutton G. Assembly algorithms for next-generation sequencing data. Genomics  2010; 95(6): 315-27.
[http://dx.doi.org/10.1016/j.ygeno.2010.03.001] [PMID:  20211242] 
[8] 
Geniza M, Jaiswal P. Tools for building de novo transcriptome assembly. Curr Plant Biol  2017; 11-12: 41-5.
[http://dx.doi.org/10.1016/j.cpb.2017.12.004] 
[9] 
Honaas LA, Wafula EK, Wickett NJ, et al. Selecting superior de novo transcriptome assemblies: Lessons learned by leveraging the best plant genome. PLoS One  2016; 11(1) e0146062
[http://dx.doi.org/10.1371/journal.pone.0146062] [PMID:  26731733] 
[10] 
Grabherr MG, Haas BJ, Yassour M, et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol  2011; 29(7): 644-52.
[http://dx.doi.org/10.1038/nbt.1883] [PMID:  21572440] 
[11] 
Haas BJ, Papanicolaou A, Yassour M, et al. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat Protoc  2013; 8(8): 1494-512.
[http://dx.doi.org/10.1038/nprot.2013.084] [PMID:  23845962] 
[12] 
Xie Y, Wu G, Tang J, et al. SOAPdenovo-Trans: de novo transcriptome assembly with short RNA-Seq reads. Bioinformatics  2014; 30(12): 1660-6.
[http://dx.doi.org/10.1093/bioinformatics/btu077] [PMID:  24532719] 
[13] 
Compeau PEC, Pevzner PA, Tesler G. How to apply de Bruijn graphs to genome assembly. Nat Biotechnol  2011; 29(11): 987-91.
[http://dx.doi.org/10.1038/nbt.2023] [PMID:  22068540] 
[14] 
Luo R, Liu B, Xie Y, et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience  2012; 1(1): 18.
[http://dx.doi.org/10.1186/2047-217X-1-18] [PMID:  23587118] 
[15] 
Schulz MH, Zerbino DR, Vingron M, Birney E. Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics  2012; 28(8): 1086-92.
[http://dx.doi.org/10.1093/bioinformatics/bts094] [PMID:  22368243] 
[16] 
Salzberg SL, Phillippy AM, Zimin A, et al. GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Res  2012; 22(3): 557-67.
[http://dx.doi.org/10.1101/gr.131383.111] [PMID:  22147368] 
[17] 
Chang Z, Wang Z, Li G. The impacts of read length and transcriptome complexity for de novo assembly: a simulation study. PLoS One  2014; 9(4) e94825
[http://dx.doi.org/10.1371/journal.pone.0094825] [PMID:  24736633] 
[18] 
O’Neil ST, Emrich SJ. Assessing De Novo transcriptome assembly metrics for consistency and utility. BMC Genomics  2013; 14(1): 465.
[http://dx.doi.org/10.1186/1471-2164-14-465] [PMID:  23837739] 
[19] 
Behera S, Voshall A. Performance comparison and an ensemble approach of transcriptome assembly. IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 2017.
[http://dx.doi.org/10.1109/BIBM.2017.8218005] 
[20] 
Jain P, Krishnan NM, Panda B. Augmenting transcriptome assembly by combining de novo and genome-guided tools. PeerJ  2013; 1 e133
[21] 
Wang S, Gribskov M. Comprehensive evaluation of de novo transcriptome assembly programs and their effects on differential gene expression analysis. Bioinformatics  2017; 33(3): 327-33.
[PMID:  28172640] 
[22] 
Chikhi R, Medvedev P. Informed and automated k-mer size selection for genome assembly. Bioinformatics  2014; 30(1): 31-7.
[http://dx.doi.org/10.1093/bioinformatics/btt310] [PMID:  23732276] 
[23] 
Durai DA, Schulz MH. Informed kmer selection for de novo transcriptome assembly. Bioinformatics  2016; 32(11): 1670-7.
[http://dx.doi.org/10.1093/bioinformatics/btw217] [PMID:  27153653] 
[24] 
Andrews S. FastQC: a quality control tool for high throughput sequence data 2010.Available from:. http://www.bioinformatics. babraham.ac.uk/projects/fastqc  Accessed on October 6, 2011.
[25] 
Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics  2014; 30(15): 2114-20.
[http://dx.doi.org/10.1093/bioinformatics/btu170] [PMID:  24695404] 
[26] 
Mora-Márquez F, Vázquez-Poletti JL, López de Heredia U. NGScloud: RNA-seq analysis of non-model species using cloud computing. Bioinformatics  2018; 34(19): 3405-7.
[http://dx.doi.org/10.1093/bioinformatics/bty363] [PMID:  29726914] 
[27] 
Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics  2006; 22(13): 1658-9.
[http://dx.doi.org/10.1093/bioinformatics/btl158] [PMID:  16731699] 
[28] 
Yang Y, Smith SA. Optimizing de novo assembly of short-read RNA-seq data for phylogenomics. BMC Genomics  2013; 14: 328.
[http://dx.doi.org/10.1186/1471-2164-14-328] [PMID:  23672450] 
[29] 
Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics  2013; 29(8): 1072-5.
[http://dx.doi.org/10.1093/bioinformatics/btt086] [PMID:  23422339] 
[30] 
Bushmanova E, Antipov D, Lapidus A, Suvorov V, Prjibelski AD. rnaQUAST: a quality assessment tool for de novo transcriptome assemblies. Bioinformatics  2016; 32(14): 2210-2.
[http://dx.doi.org/10.1093/bioinformatics/btw218] [PMID:  27153654] 
[31] 
Waterhouse RM, Seppey M, Simão FA, et al. BUSCO applications from quality assessments to gene prediction and phylogenomics. Mol Biol Evol  2018; 35(3): 543-8.
[http://dx.doi.org/10.1093/molbev/msx319] [PMID:  29220515] 
[32] 
Durai DA, Schulz MH. In silico read normalization using set multi-cover optimization. Bioinformatics  2018; 34(19): 3273-80.
[http://dx.doi.org/10.1093/bioinformatics/bty307] [PMID:  29912280] 
[33] 
López de Heredia U. ENT-RS-CLOUD RNA-seq differential Expression aNalysis for Tree species in the Cloud Master's thesis, Escuela Nacional de Sanidad (ISCIII) . 2014.
[34] 
Lu B, Zeng Z, Shi T. Comparative study of de novo assembly and genome-guided assembly strategies for transcriptome reconstruction based on RNA-Seq. Sci China Life Sci  2013; 56(2): 143-55.
[http://dx.doi.org/10.1007/s11427-013-4442-z] [PMID:  23393030] 
[35] 
Hsieh PH, Oyang YJ, Chen CY. Effect of de novo transcriptome assembly on transcript quantification. Sci Rep  2019; 9(1): 8304.
[http://dx.doi.org/10.1038/s41598-019-44499-3] [PMID:  31165774] 

Rights & Permissions Print Cite

Article Metrics

4

Journal Information

For Authors

For Editors

For Reviewers

Explore Articles

Open Access

Open Access Articles

For Visitors

DOI https://dx.doi.org/10.2174/1574893615666191219095817	Print ISSN 1574-8936
Publisher Name Bentham Science Publisher	Online ISSN 2212-392X

Current Bioinformatics

Hardware Performance Evaluation of De novo Transcriptome Assembly Software in Amazon Elastic Compute Cloud

Abstract Play Pause

Graphical Abstract

Related Journals

Related Books

Abstract