Generic placeholder image

Recent Patents on Engineering

Editor-in-Chief

ISSN (Print): 1872-2121
ISSN (Online): 2212-4047

General Research Article

Geometric Feature of DNA Sequences

Author(s): Hongjie Xu*

Volume 18, Issue 9, 2024

Published on: 03 October, 2023

Article ID: e031023221607 Pages: 14

DOI: 10.2174/0118722121271190230928072933

Price: $65

Abstract

Background: The primary goal of molecular phylogenetics is to characterize the similarity/ dissimilarity of DNA sequences. Existing sequence comparison methods with some patented are mostly alignment-based and remain computationally arduous.

Objective: In this patent study, we propose a novel alignment-free approach based on a previous DNA curve representation without degeneracy.

Method: The method combines two important geometric elements that describe the global and local features of the curve, respectively. It allows us to use a 24-dimensional vector called a characterization vector to numerically characterize a DNA sequence. We then measure the dissimilarity/ similarity of various DNA sequences by the Euclidean distances between their characterization vectors.

Results: We compare our approach with other existing algorithms on 4 data sets including COVID-19, and find that our apporach can produce consistent results and is faster than the alignment-based methods.

Conclusion: The method stated in this study, can assist in analyzing biological molecular sequences efficiently and will be helpful to molecular biologists.

Graphical Abstract

[1]
R. Dong, H. Zheng, K. Tian, S.C. Yau, W. Mao, W. Yu, C. Yin, C. Yu, R.L. He, J. Yang, and S.S.T. Yau, "Virus database and online inquiry system based on natural vectors", Evol. Bioinform. Online, vol. 13, 2017.
[http://dx.doi.org/10.1177/1176934317746667] [PMID: 29308007]
[2]
W. Gong, and X.Q. Fan, "A geometric characterization of DNA sequence", Physica A, vol. 527, p. 121429, 2019.
[http://dx.doi.org/10.1016/j.physa.2019.121429]
[3]
H.H. Huang, and C. Yu, "Clustering DNA sequences using the out-of-place measure with reduced n-grams", J. Theor. Biol., vol. 406, pp. 61-72, 2016.
[http://dx.doi.org/10.1016/j.jtbi.2016.06.029] [PMID: 27375217]
[4]
X. Jin, Q. Jiang, Y. Chen, S.J. Lee, R. Nie, S. Yao, D. Zhou, and K. He, "Similarity/dissimilarity calculation methods of DNA sequences: A survey", J. Mol. Graph. Model., vol. 76, pp. 342-355, 2017.
[http://dx.doi.org/10.1016/j.jmgm.2017.07.019] [PMID: 28763687]
[5]
J. Ren, X. Bai, Y.Y. Lu, K. Tang, Y. Wang, G. Reinert, and F. Sun, "Alignment-free sequence analysis and applications", Annu. Rev. Biomed. Data Sci., vol. 1, no. 1, pp. 93-114, 2018.
[http://dx.doi.org/10.1146/annurev-biodatasci-080917-013431] [PMID: 31828235]
[6]
S.S.T. Yau, J. Wang, A. Niknejad, C. Lu, N. Jin, and Y.K. Ho, "DNA sequence representation without degeneracy", Nucleic Acids Res., vol. 31, no. 12, pp. 3078-3080, 2003.
[http://dx.doi.org/10.1093/nar/gkg432] [PMID: 12799435]
[7]
C. Yu, "Natural vector method for virus phylogenetic classification: A mini-review", Curr. Bioinform., vol. 13, no. 4, pp. 332-336, 2018.
[http://dx.doi.org/10.2174/1574893612666170620125024]
[8]
C. Yu, B.T. Baune, K.A. Fu, M.L. Wong, and J. Licinio, "Genetic clustering of depressed patients and normal controls based on single-nucleotide variant proportion", J. Affect. Disord., vol. 227, pp. 450-454, 2018.
[http://dx.doi.org/10.1016/j.jad.2017.11.023] [PMID: 29154167]
[9]
C. Yu, T. Hernandez, H. Zheng, S.C. Yau, H.H. Huang, R.L. He, J. Yang, and S.S.T. Yau, "Real time classification of viruses in 12 dimensions", PLoS One, vol. 8, no. 5, p. e64328, 2013.
[http://dx.doi.org/10.1371/journal.pone.0064328] [PMID: 23717598]
[10]
A. Zielezinski, S. Vinga, J. Almeida, and W.M. Karlowski, "Alignment-free sequence comparison: Benefits, applications, and tools", Genome Biol., vol. 18, no. 1, p. 186, 2017.
[http://dx.doi.org/10.1186/s13059-017-1319-7] [PMID: 28974235]
[11]
D.Q.N. Nguyen, L. Xing, P.D.T. Le, and L. Lin, "A graph-theoretical approach to DNA similarity analysis", Commun. Inf. Syst., vol. 22, no. 3, pp. 383-400, 2022.
[http://dx.doi.org/10.4310/CIS.2022.v22.n3.a5]
[12]
N. Ramanathan, J. Ramamurthy, and G. Natarajan, "Numerical characterization of DNA sequences for alignment-free sequence comparison-A review", Comb. Chem. High Throughput Screen., vol. 25, no. 3, pp. 365-380, 2022.
[http://dx.doi.org/10.2174/1386207324666210811101437] [PMID: 34382516]
[13]
M. Deng, C. Yu, Q. Liang, R.L. He, and S.S.T. Yau, "A novel method of characterizing genetic sequences: Genome space with biological distance and applications", PLoS One, vol. 6, no. 3, p. e17293, 2011.
[http://dx.doi.org/10.1371/journal.pone.0017293] [PMID: 21399690]
[14]
R.K. Rout, S. Umer, S. Sheikh, S. Sindhwani, and S. Pati, "EightyDVec: A method for protein sequence similarity analysis using physicochemical properties of amino acids", Comput. Methods Biomech. Biomed. Eng. Imaging Vis., vol. 10, no. 1, pp. 3-13, 2022.
[http://dx.doi.org/10.1080/21681163.2021.1956369]
[15]
M. Uddin, M.K. Islam, M.R. Hassan, F. Jahan, and J.H. Baek, "A fast and efficient algorithm for DNA sequence similarity identification", Complex & Intelligent Systems, vol. 9, no. 2, pp. 1265-1280, 2023.
[http://dx.doi.org/10.1007/s40747-022-00846-y] [PMID: 36035628]
[16]
L. Wang, and T. Jiang, "On the complexity of multiple sequence alignment", J. Comput. Biol., vol. 1, no. 4, pp. 337-348, 1994.
[http://dx.doi.org/10.1089/cmb.1994.1.337] [PMID: 8790475]
[17]
D. Bielińska-Wąż, P. Wąż, and A. Nandy, "Graphical representations of biological sequences", Comb. Chem. High Throughput Screen., vol. 25, no. 3, pp. 347-348, 2022.
[http://dx.doi.org/10.2174/1386207325666220104221516] [PMID: 35038979]
[18]
E. Delibaş, A. Arslan, A. Şeker, and B. Diri, "A novel alignment-free DNA sequence similarity analysis approach based on top-k n-gram match-up", J. Mol. Graph. Model., vol. 100, p. 107693, 2020.
[http://dx.doi.org/10.1016/j.jmgm.2020.107693] [PMID: 32805559]
[19]
H.F. Löchel, and D. Heider, "Chaos game representation and its applications in bioinformatics", Comput. Struct. Biotechnol. J., vol. 19, pp. 6263-6271, 2021.
[http://dx.doi.org/10.1016/j.csbj.2021.11.008] [PMID: 34900136]
[20]
L. He, S. Sun, Q. Zhang, X. Bao, and P.K. Li, "Alignment-free sequence comparison for virus genomes based on location correlation coefficient", Infect. Genet. Evol., vol. 96, p. 105106, 2021.
[http://dx.doi.org/10.1016/j.meegid.2021.105106] [PMID: 34626822]
[21]
B. Medhat, and A. Shawish, "FLR: A revolutionary alignment-free similarity analysis methodology for DNA sequences", IEEE/ACM Trans. Comput. Biol. Bioinformatics, vol. 18, no. 5, pp. 1924-1936, 2021.
[http://dx.doi.org/10.1109/TCBB.2020.2967385] [PMID: 31976902]
[22]
N. Sun, S. Pei, L. He, C. Yin, R.L. He, and S.S.T. Yau, "Geometric construction of viral genome space and its applications", Comput. Struct. Biotechnol. J., vol. 19, pp. 4226-4234, 2021.
[http://dx.doi.org/10.1016/j.csbj.2021.07.028] [PMID: 34429843]
[23]
E. Hamori, "Novel DNA sequence representations", Nature, vol. 314, no. 6012, pp. 585-586, 1985.
[http://dx.doi.org/10.1038/314585a0] [PMID: 3990794]
[24]
E. Hamori, and J. Ruskin, "H curves, a novel method of representation of nucleotide series especially suited for long DNA sequences", J. Biol. Chem., vol. 258, no. 2, pp. 1318-1327, 1983.
[http://dx.doi.org/10.1016/S0021-9258(18)33196-X] [PMID: 6822501]
[25]
M.A. Gates, "Simpler DNA sequence representations", Nature, vol. 316, no. 6025, p. 219, 1985.
[http://dx.doi.org/10.1038/316219a0] [PMID: 3927167]
[26]
A. Nandy, "A new graphical representation and analysis of DNA sequence structure: I. Methodology and application to globin genes", Curr. Sci., vol. 66, pp. 309-314, 1994.
[27]
P.M. Leong, and S. Morgenthaler, "Random walk and gap plots of DNA sequences", Bioinformatics, vol. 11, no. 5, pp. 503-507, 1995.
[http://dx.doi.org/10.1093/bioinformatics/11.5.503] [PMID: 8590173]
[28]
G. Xie, and Z. Mo, "Three 3D graphical representations of DNA primary sequences based on the classifications of DNA bases and their applications", J. Theor. Biol., vol. 269, no. 1, pp. 123-130, 2011.
[http://dx.doi.org/10.1016/j.jtbi.2010.10.018] [PMID: 20969878]
[29]
Y. Zhang, B. Liao, and K. Ding, "On 2D graphical representation of DNA sequence of nondegeneracy", Chem. Phys. Lett., vol. 411, no. 1-3, pp. 28-32, 2005.
[http://dx.doi.org/10.1016/j.cplett.2005.06.005]
[30]
D. Bajusz, R.A. Miranda-Quintana, A. Rácz, and K. Héberger, "Extended many-item similarity indices for sets of nucleotide and protein sequences", Comput. Struct. Biotechnol. J., vol. 19, pp. 3628-3639, 2021.
[http://dx.doi.org/10.1016/j.csbj.2021.06.021] [PMID: 34257841]
[31]
V. Bonnici, A. Cracco, and G. Franco, "A k-mer based sequence similarity for pangenomic analyses", Lect. Notes Comput. Sci., vol. 13164, pp. 31-44, 2022.
[http://dx.doi.org/10.1007/978-3-030-95470-3_3]
[32]
J.K. Das, A. Sengupta, P.P. Choudhury, and S. Roy, "Mapping sequence to feature vector using numerical representation of codons targeted to amino acids for alignment-free sequence analysis", Gene, vol. 766, p. 145096, 2021.
[http://dx.doi.org/10.1016/j.gene.2020.145096] [PMID: 32919006]
[33]
E. Delibaş, and A. Arslan, "DNA sequence similarity analysis using image texture analysis based on first-order statistics", J. Mol. Graph. Model., vol. 99, p. 107603, 2020.
[http://dx.doi.org/10.1016/j.jmgm.2020.107603] [PMID: 32442904]
[34]
M.S. Hammad, M.S. Mabrouk, W.I. Al-atabany, and V.F. Ghoneim, "Genomic image representation of human coronavirus sequences for COVID-19 detection", Alex. Eng. J., vol. 63, pp. 583-597, 2023.
[http://dx.doi.org/10.1016/j.aej.2022.08.023]
[35]
Y. Huang, and T. Wang, "New graphical representation of a DNA sequence based on the ordered dinucleotides and its application to sequence analysis", Int. J. Quantum Chem., vol. 112, no. 6, pp. 1746-1757, 2012.
[http://dx.doi.org/10.1002/qua.23157]
[36]
H. Iuchi, T. Matsutani, K. Yamada, N. Iwano, S. Sumi, S. Hosoda, S. Zhao, T. Fukunaga, and M. Hamada, "Representation learning applications in biological sequence analysis", Comput. Struct. Biotechnol. J., vol. 19, pp. 3198-3208, 2021.
[http://dx.doi.org/10.1016/j.csbj.2021.05.039] [PMID: 34141139]
[37]
X. Jiao, S. Pei, Z. Sun, J. Kang, and S.S.T. Yau, "Determination of the nucleotide or amino acid composition of genome or protein sequences by using natural vector method and convex hull principle", Fundamental Research, vol. 1, no. 5, pp. 559-564, 2021.
[http://dx.doi.org/10.1016/j.fmre.2021.08.010]
[38]
C. Li, W. Fei, Y. Zhao, and X. Yu, "Novel graphical representation and numerical characterization of DNA sequences", Appl. Sci. (Basel), vol. 6, no. 3, p. 63, 2016.
[http://dx.doi.org/10.3390/app6030063]
[39]
H. Liu, "2D graphical representation of dna sequence based on horizon lines from a probabilistic view", Biosci. J., vol. 34, pp. 744-750, 2018.
[http://dx.doi.org/10.14393/BJ-v34n3a2018-39932]
[40]
H. Liu, "A joint probabilistic model in DNA sequences", Curr. Bioinform., vol. 13, no. 3, pp. 234-240, 2018.
[http://dx.doi.org/10.2174/1574893613666180305161928]
[41]
Y. Lu, L. Zhao, Z. Li, and X. Dong, "Genetic similarity analysis based on positive and negative sequence patterns of DNA", Symmetry (Basel), vol. 12, no. 12, p. 2090, 2020.
[http://dx.doi.org/10.3390/sym12122090]
[42]
J.A. Tenreiro Machado, "Shannon information analysis of the chromosome code, Mathematical methods in modern complexity science", Nonlinear Systems and Complexity, vol. 33, pp. 1-12, 2022.
[http://dx.doi.org/10.1007/978-3-030-79412-5_1]
[43]
J.A. Tenreiro Machado, A.C. Costa, and M.D. Quelhas, "Fractional dynamics in DNA", Commun. Nonlinear Sci. Numer. Simul., vol. 16, no. 8, pp. 2963-2969, 2011.
[http://dx.doi.org/10.1016/j.cnsns.2010.11.007]
[44]
R. Wu, W. Liu, Y. Mao, and J.Z.J. Zheng, "2D graphical representation of DNA sequences based on variant map", IEEE Access, vol. 8, pp. 173755-173765, 2020.
[http://dx.doi.org/10.1109/ACCESS.2020.3025591]
[45]
C. Yu, Q. Liang, C. Yin, R.L. He, and S.S.T. Yau, "A novel construction of genome space with biological geometry", DNA Res., vol. 17, no. 3, pp. 155-168, 2010.
[http://dx.doi.org/10.1093/dnares/dsq008] [PMID: 20360268]
[46]
C. Yu, M. Deng, and S.S.T. Yau, "DNA sequence comparison by a novel probabilistic method", Inf. Sci., vol. 181, no. 8, pp. 1484-1492, 2011.
[http://dx.doi.org/10.1016/j.ins.2010.12.010]
[47]
Z. Wang, J. Tan, Y. Long, Y. Liu, W. Lei, J. Cai, Y. Yang, and Z. Liu, "SaAlign: Multiple DNA/RNA sequence alignment and phylogenetic tree construction tool for ultra-large datasets and ultra-long sequences based on suffix array", Comput. Struct. Biotechnol. J., vol. 20, pp. 1487-1493, 2022.
[http://dx.doi.org/10.1016/j.csbj.2022.03.018] [PMID: 35422971]
[48]
D. Bielińska-Wąż, P. Wąż, and D. Panas, "Applications of 2D and 3D-dynamic representations of DNA/RNA sequences for a description of genome sequences of viruses", Comb. Chem. High Throughput Screen., vol. 25, no. 3, pp. 429-438, 2022.
[http://dx.doi.org/10.2174/1386207324666210804120454] [PMID: 34348613]
[49]
E. Delibas, and A. Arslan, "A new feature vector model for alignment-free DNA sequence similarity analysis", SIGMA J Eng Nat Sci, vol. 40, pp. 610-619, 2022.
[50]
K. Su, O. Mayans, K. Diederichs, and J.R. Fleming, "Pairwise sequence similarity mapping with PaSiMap: Reclassification of immunoglobulin domains from titin as case study", Comput. Struct. Biotechnol. J., vol. 20, pp. 5409-5419, 2022.
[http://dx.doi.org/10.1016/j.csbj.2022.09.034] [PMID: 36212532]
[51]
Z. Qi, and X. Wen, "Novel Protein Sequence Comparison Method Based on Transition Probability Graph and Information Entropy", Comb. Chem. High Throughput Screen., vol. 25, no. 3, pp. 392-400, 2022.
[http://dx.doi.org/10.2174/1386207323666200901103001] [PMID: 32875978]
[52]
Z. Qi, Y. Ning, and Y. Huang, "Protein Sequence Comparison Method Based on 3-ary Huffman Coding", Match (Mulh.), vol. 90, no. 2, pp. 357-380, 2023.
[http://dx.doi.org/10.46793/match.90-2.357Q]
[53]
P. Jarnot, J. Ziemska-Legiecka, M. Grynberg, and A. Gruca, "Insights from analyses of low complexity regions with canonical methods for protein sequence comparison", Brief. Bioinform., vol. 23, no. 5, p. bbac299, 2022.
[http://dx.doi.org/10.1093/bib/bbac299] [PMID: 35914952]
[54]
C. Li, Q. Dai, and P. He, "A time series representation of protein sequences for similarity comparison", J. Theor. Biol., vol. 538, p. 111039, 2022.
[http://dx.doi.org/10.1016/j.jtbi.2022.111039] [PMID: 35085534]
[55]
W. Li, L. Yang, Y. Qiu, Y. Yuan, X. Li, and Z. Meng, "FFP: Joint Fast Fourier transform and fractal dimension in amino acid property-aware phylogenetic analysis", BMC Bioinformatics, vol. 23, no. 1, p. 347, 2022.
[http://dx.doi.org/10.1186/s12859-022-04889-3] [PMID: 35986255]
[56]
I. Lima, and E.A. Cino, "Sequence similarity in 3D for comparison of protein families", J. Mol. Graph. Model., vol. 106, p. 107906, 2021.
[http://dx.doi.org/10.1016/j.jmgm.2021.107906] [PMID: 33848948]
[57]
M.R. Mehri, A. Fatemeh, and Z.S. Vahid, "A novel graphical representation and similarity analysis of protein sequences based on physicochemical properties", Physica A Statistical Mechanics and its Applications, vol. 510, 2018.
[58]
Z. Mu, T. Yu, X. Liu, H. Zheng, L. Wei, and J. Liu, "FEGS: A novel feature extraction model for protein sequences and its applications", BMC Bioinformatics, vol. 22, no. 1, p. 297, 2021.
[http://dx.doi.org/10.1186/s12859-021-04223-3] [PMID: 34078264]
[59]
Z. Sun, S. Pei, R.L. He, and S.S.T. Yau, "A novel numerical representation for proteins: Three-dimensional chaos game representation and its extended natural vector", Comput. Struct. Biotechnol. J., vol. 18, pp. 1904-1913, 2020.
[http://dx.doi.org/10.1016/j.csbj.2020.07.004] [PMID: 32774785]
[60]
C. Wu, R. Gao, Y. De Marinis, and Y. Zhang, "A novel model for protein sequence similarity analysis based on spectral radius", J. Theor. Biol., vol. 446, pp. 61-70, 2018.
[http://dx.doi.org/10.1016/j.jtbi.2018.03.001] [PMID: 29524440]
[61]
H. Zhang, X. Yuan, H. Deng, L. Zhu, and Z. Wang, "Sequence alignment method, system, storage medium and terminal based on CPU parallel computing", CN Patent 116450364.
[62]
A. Zhang, X. Liao, Y. Cui, C. Yang, C. Huang, T. Tang, L. Peng, and Z. Xia, "Anchor point screening method, device and computer equipment based on Bloom filter", CN Patent 113782097.
[63]
G. Li, H. Guo, B. Liu, and Y. Wang, "Real time sequence alignment method based on Pan-genome", CN Patent 115662521.
[64]
H. Guo, G. Li, B. Liu, and Y. Wang, "A Sequence alignment Method Based on Population Genome", CN Patent 115602246.
[65]
D.F. Riddle, Analytic Geometry., 6th ed PWS Publishing Company: Toronto, 1996.
[66]
T. Banchoff, S. Lovett, Differential Geometry of Curves and Surfaces. A K Peters, Ltd., 2010.
[67]
R.R. Sokal, and C.D. Michener, "A Statistical Method for Evaluating Systematic Relationships", Univ. Kans. Sci. Bull., vol. 28, pp. 1409-1438, 1958.
[68]
C.D. Michener, and R.R. Sokal, "A Quantitative Approach to a Problem in Classification", Evolution, vol. 11, no. 2, pp. 130-162, 1957.
[http://dx.doi.org/10.2307/2406046]
[69]
C.P. Kurtzman, J. Fell, and T. Boekhout, The Yeasts, a Taxonomic Study, Volum 1., 5th ed ELSEVIER: Amsterdam, 2011.
[70]
T. Hoang, C. Yin, H. Zheng, C. Yu, R. Lucy He, and S.S.T. Yau, "A new method to cluster DNA sequences using Fourier power spectrum", J. Theor. Biol., vol. 372, pp. 135-145, 2015.
[http://dx.doi.org/10.1016/j.jtbi.2015.02.026] [PMID: 25747773]
[71]
S. Kumar, G. Stecher, M. Li, C. Knyaz, and K. Tamura, "MEGA X: Molecular Evolutionary Genetics Analysis across computing platforms", Mol. Biol. Evol., vol. 35, no. 6, pp. 1547-1549, 2018.
[http://dx.doi.org/10.1093/molbev/msy096] [PMID: 29722887]
[72]
M.A. Larkin, G. Blackshields, N.P. Brown, R. Chenna, P.A. McGettigan, H. McWilliam, F. Valentin, I.M. Wallace, A. Wilm, R. Lopez, J.D. Thompson, T.J. Gibson, and D.G. Higgins, "Clustal W and Clustal X version 2.0", Bioinformatics, vol. 23, no. 21, pp. 2947-2948, 2007.
[http://dx.doi.org/10.1093/bioinformatics/btm404] [PMID: 17846036]
[73]
Y. Junejo, M. Ozaslan, M. Safdar, R.A. Khailany, S. Rehman, W. Yousaf, and M.A. Khan, "Novel SARS-CoV-2/COVID-19: Origin, pathogenesis, genes and genetic variations, immune responses and phylogenetic analysis", Gene Rep., vol. 20, p. 100752, 2020.
[http://dx.doi.org/10.1016/j.genrep.2020.100752] [PMID: 32566803]
[74]
L.L. Ren, Y.M. Wang, Z.Q. Wu, Z.C. Xiang, L. Guo, T. Xu, Y.Z. Jiang, Y. Xiong, Y.J. Li, X.W. Li, H. Li, G.H. Fan, X.Y. Gu, Y. Xiao, H. Gao, J.Y. Xu, F. Yang, X.M. Wang, C. Wu, L. Chen, Y.W. Liu, B. Liu, J. Yang, X.R. Wang, J. Dong, L. Li, C.L. Huang, J.P. Zhao, Y. Hu, Z.S. Cheng, L.L. Liu, Z.H. Qian, C. Qin, Q. Jin, B. Cao, and J.W. Wang, "Identification of a novel coronavirus causing severe pneumonia in human: A descriptive study", Chin. Med. J. (Engl.), vol. 133, no. 9, pp. 1015-1024, 2020.
[http://dx.doi.org/10.1097/CM9.0000000000000722] [PMID: 32004165]
[75]
H. Lu, C.W. Stratton, and Y.W. Tang, "Outbreak of pneumonia of unknown etiology in Wuhan, China: The mystery and the miracle", J. Med. Virol., vol. 92, no. 4, pp. 401-402, 2020.
[http://dx.doi.org/10.1002/jmv.25678] [PMID: 31950516]

Rights & Permissions Print Cite
© 2024 Bentham Science Publishers | Privacy Policy