Digitizing DNA Sequences Using Multiset-Based Nucleotide Frequencies for Machine Learning-Based Mutation Detection
DOI:
https://doi.org/10.31181/dmame7220241213Keywords:
Multiset DNA structure, Multiset average frequency, Recurrent neural network, Gene mutationsAbstract
Investigating algebraic structures in a non-conventional framework supplements mathematics for hard-nosed practical applications to the fields of theoretical biology and computer science. One such algebraic structure is multigroup whose underlying set is a multiset. The genome is the entire set of DNA instructions found within a cell which contains all the information needed for an individual to develop and function. DNA and RNA are the hereditary materials that play a vital role in the metabolism process of living things, especially protein synthesis. In genomic database DNA sequences are stored in the form of string or text data types. The only data that works with machine learning algorithms is numerical. Thus, it is necessary to transform DNA sequence strings to numerical values. This article is organized in the following manner. Firstly, we prove that standard genetic code is a multigroup and genome architecture of the whole population can be interpreted as the sum of multisets. Next, it is described how a numerical representation of DNA bases relates to its algebraic representation. We further employed Gated Recurrent Unit, Long Short-Term Memory, and Bidirectional Long Short-Term Memory to identify changes between the DNA sequences. Experimental results show that GRU with multiset-based numerical values for DNA bases offers 98% accuracy on testing data. This novel technique will aid in the detection of mutations in complex diseases.
Downloads
References
Blizard, W. D. (1991). The Development of Multiset Theory. The Review of Modern Logic, 1(4), 319 – 52.
Knuth, D. E. (1982). The art of computer programming. 8th ed. Addison-Wesley.
De Bruijn, N. G. (1983). Denumerations of rooted trees and multisets. Discrete Applied Mathematics, 6(1), 25-33. https://doi.org/10.1016/0166-218X(83)90097-5
Ibrahim, A. M., & Ejegwa, P. A. (2016). A Survey on the Concept of Multigroups. Journal of the Nigerian Association of Mathematical Physics, 38, 1-8.
Nazmul, S., Majumdar, P., & Samanta, S. K. (2013). On multisets and multigroups. Annals of Fuzzy Mathematics and Informatics, 6(3), 643--656.
Ejegwa, P. A., & Ibrahim, A. M. (2017). Characteristics submultigroups of a multigroup. Gulf Journal of Mathematics, 5(4), 1-8. https://doi.org/10.56947/gjom.v5i4.115
Ejegwa, P. A., & Ibrahim, A. M. (2017). Some homomorphic properties of multigroups. Buletinul Academiei de Ştiinţe a Republicii Moldova Matematica, 1(83), 67-76.
Ejegwa, P. A., & Ibrahim, A. M. (2017). On Comultisets and Factor Multigroups. Theory and Applications of Mathematics & Computer Science, 7(2), 124-40.
Ejegwa, P. A., & Ibrahim, A. M. (2017). Normal submultigroups and comultisets of a multigroup. Quasigroups and Related Systems, 2(25), 231-244.
Ejegwa, P. A., & Ibrahim, A. M. (2017). Direct Product of Multigroups and Its Generalization. International Journal of Mathematical Combinatorics, 4(2017), 1-18.
Ibrahim, A. M. & Ejegwa, P. A. (2017). Multigroup actions on multiset. Annals of Fuzzy Mathematics and Informatics, 14(5), 515-526.
Ejegwa, P. A., & Ibrahim, A. M. (2020). Some Properties of Multigroups. Palestine Journal of Mathematics, 9(1), 31-47.
Watson, J. D., & Crick, F. H. C. (1953). Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid. Nature, 171, 737-738. https://doi.org/10.1038/171737a0
Crick, F. H. C., Barnett, L., Brenner, S., & Watts-Tobin R. J. (1961). General Nature of the Genetic Code for Proteins. Nature, 192, 1227-1232. https://doi.org/10.1038/1921227a0
Liczner, C., Duke, K., Juneau, G., Egli, M., & Wilds, C. J. (2021). Beyond ribose and phosphate: Selected nucleic acid modifications for structure-function investigations and therapeutic applications. Beilstein Journal of Organic Chemistry, 17, 908-931. https://doi.org/10.3762/bjoc.17.76
Hornos, J. E. M., Hornos, Y. M. M., & Forger, M. (1999). Symmetry and Symmetry Breaking: An Algebraic Approach to the Genetic Code. International Journal of Modern Physics B, 13(23), 2795-2885. https://doi.org/10.1142/S021797929900268X
Sánchez, R., & Grau, R. (2006). A novel algebraic structure of the genetic code over the Galois field of four DNA bases. Acta Biotheoretica, 54(1), 27-42. https://doi.org/10.1007/s10441-006-6192-9
Sánchez, R., Morgado, E., & Grau, R. (2004). The Genetic Code Boolean Lattice. arXiv preprint q-bio/0412034. https://doi.org/10.48550/arXiv.q-bio/0412034
Sánchez, R., Morgado, E., & Grau, R. (2005). A genetic code Boolean structure. I. The meaning of Boolean deductions. Bulletin of Mathematical Biology, 67, 1-14. https://doi.org/10.1016/j.bulm.2004.05.005
Sanchez, R., & Grau, R. (2005). A genetic code Boolean structure. II. The genetic information system as a Boolean information system. Bulletin of Mathematical Biology, 67(5), 1017-1029. https://doi.org/10.1016/j.bulm.2004.12.004
Grau, R., Del C. Chavez, M., Sanchez, R., Morgado, E., Casas, G., & Bonet, I. (2006, May). Boolean algebraic structures of the genetic code: possibilities of applications. In International Workshop on Knowledge Discovery and Emergent Complexity in Bioinformatics (pp. 10-21). Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-540-71037-0_2
Sánchez, R., Perfetti, L. A., Grau, R., & Morgado, E. (2004). A new DNA sequences vector space on a genetic code Galois field. arXiv preprint q-bio/0412019. https://doi.org/10.48550/arXiv.q-bio/0412019
Sanchez, R., (2018). Symmetric Group of the Genetic--Code Cubes. Effect of the Genetic--Code Architecture on the Evolutionary Process. Communications in Mathematical and in Computer Chemistry, 79, 527-560.
Sánchez, R., & Grau, R. (2009). An algebraic hypothesis about the primeval genetic code architecture. Mathematical Biosciences, 221(1), 60-76. https://doi.org/10.1016/j.mbs.2009.07.001
Aisah, I., Sayyidatunnisa, N. U., Subartini, B., & Kartiwa, A. (2019, July). Vector space of codons sequence over galois field GF (73). In IOP Conference Series: Materials Science and Engineering (Vol. 567, No. 1, p. 012019). IOP Publishing. https://doi.org/10.1088/1757-899X/567/1/012019
Sanchez, R., & Barreto, J. (2021). Genomic abelian finite groups. bioRxiv, 2021-06. https://doi.org/10.1101/2021.06.01.446543
Riaz, A., Kousar, S., Kausar, N., Pamucar, D., & Addis, G. M. (2022). Codes over Lattice‐Valued Intuitionistic Fuzzy Set Type‐3 with Application to the Complex DNA Analysis. Complexity, 2022(1), 5288187. https://doi.org/10.1155/2022/5288187
Wildberger, N. J. (2003). A new look at multisets. School of mathematics, UNSW Sydney, 2052, 1-21.
Syropoulos, A. (2001). Mathematics of multisets. In Multiset Processing: Mathematical, Computer Science, and Molecular Computing Points of View 1 (pp. 347-358). Springer Berlin Heidelberg.
Lindemann, B., Müller, T., Vietz, H., Jazdi, N., & Weyrich, M. (2021). A survey on long short-term memory networks for time series prediction. Procedia CIRP, 99, 650-655. https://doi.org/10.1016/j.procir.2021.03.102
Lugo, L., & Barreto, H. E. (2021). A Recurrent Neural Network approach for whole genome bacteria identification. Applied Artificial Intelligence, 35(9), 642-656. https://doi.org/10.1080/08839514.2021.1937161
Syropoulos, A. (2001). Mathematics of Multisets. In: Calude, C.S., PĂun, G., Rozenberg, G., Salomaa, A. (eds) Multiset Processing. WMC 2000. Lecture Notes in Computer Science, 2235. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45523-X_17
Lindemann, B., Müller, T., Vietz, H., Jazdi, N., & Weyrich, M. (2021). A survey on long short-term memory networks for time series prediction. Procedia CIRP, 99, 650-655. https://doi.org/10.1016/j.procir.2021.03.088
Lugo, L., & Hernández, E. B. (2021). A Recurrent Neural Network approach for whole genome bacteria identification. Applied Artificial Intelligence, 35(9), 642-656. https://doi.org/10.1080/08839514.2021.1922842
Kalaiarasi, K., Soundaria, R., Kausar, N., Agarwal, P., Aydie, H., & Alsamir, H. (2021). Optimization of the average monthly cost of an EOQ inventory model for deteriorating items in machine learning using PYTHON. Thermal Science, 25(2), 347-358. https://doi.org/10.2298/TSCI21S2347K
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 Decision Making: Applications in Management and Engineering
This work is licensed under a Creative Commons Attribution 4.0 International License.