Digitizing DNA Sequences Using Multiset-Based Nucleotide Frequencies for Machine Learning-Based Mutation Detection

Authors

DOI:

https://doi.org/10.31181/dmame7220241213

Keywords:

Multiset DNA structure, Multiset average frequency, Recurrent neural network, Gene mutations

Abstract

Investigating algebraic structures in a non-conventional framework supplements mathematics for hard-nosed practical applications to the fields of theoretical biology and computer science. One such algebraic structure is multigroup whose underlying set is a multiset. The genome is the entire set of DNA instructions found within a cell which contains all the information needed for an individual to develop and function. DNA and RNA are the hereditary materials that play a vital role in the metabolism process of living things, especially protein synthesis. In genomic database DNA sequences are stored in the form of string or text data types. The only data that works with machine learning algorithms is numerical. Thus, it is necessary to transform DNA sequence strings to numerical values. This article is organized in the following manner. Firstly, we prove that standard genetic code is a multigroup and genome architecture of the whole population can be interpreted as the sum of multisets. Next, it is described how a numerical representation of DNA bases relates to its algebraic representation. We further employed Gated Recurrent Unit, Long Short-Term Memory, and Bidirectional Long Short-Term Memory to identify changes between the DNA sequences. Experimental results show that GRU with multiset-based numerical values for DNA bases offers 98% accuracy on testing data. This novel technique will aid in the detection of mutations in complex diseases.

Downloads

Download data is not yet available.

References

Blizard, W. D. (1991). The Development of Multiset Theory. The Review of Modern Logic, 1(4), 319 – 52.

Knuth, D. E. (1982). The art of computer programming. 8th ed. Addison-Wesley.

De Bruijn, N. G. (1983). Denumerations of rooted trees and multisets. Discrete Applied Mathematics, 6(1), 25-33. https://doi.org/10.1016/0166-218X(83)90097-5

Ibrahim, A. M., & Ejegwa, P. A. (2016). A Survey on the Concept of Multigroups. Journal of the Nigerian Association of Mathematical Physics, 38, 1-8.

Nazmul, S., Majumdar, P., & Samanta, S. K. (2013). On multisets and multigroups. Annals of Fuzzy Mathematics and Informatics, 6(3), 643--656.

Ejegwa, P. A., & Ibrahim, A. M. (2017). Characteristics submultigroups of a multigroup. Gulf Journal of Mathematics, 5(4), 1-8. https://doi.org/10.56947/gjom.v5i4.115

Ejegwa, P. A., & Ibrahim, A. M. (2017). Some homomorphic properties of multigroups. Buletinul Academiei de Ştiinţe a Republicii Moldova Matematica, 1(83), 67-76.

Ejegwa, P. A., & Ibrahim, A. M. (2017). On Comultisets and Factor Multigroups. Theory and Applications of Mathematics & Computer Science, 7(2), 124-40.

Ejegwa, P. A., & Ibrahim, A. M. (2017). Normal submultigroups and comultisets of a multigroup. Quasigroups and Related Systems, 2(25), 231-244.

Ejegwa, P. A., & Ibrahim, A. M. (2017). Direct Product of Multigroups and Its Generalization. International Journal of Mathematical Combinatorics, 4(2017), 1-18.

Ibrahim, A. M. & Ejegwa, P. A. (2017). Multigroup actions on multiset. Annals of Fuzzy Mathematics and Informatics, 14(5), 515-526.

Ejegwa, P. A., & Ibrahim, A. M. (2020). Some Properties of Multigroups. Palestine Journal of Mathematics, 9(1), 31-47.

Watson, J. D., & Crick, F. H. C. (1953). Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid. Nature, 171, 737-738. https://doi.org/10.1038/171737a0

Crick, F. H. C., Barnett, L., Brenner, S., & Watts-Tobin R. J. (1961). General Nature of the Genetic Code for Proteins. Nature, 192, 1227-1232. https://doi.org/10.1038/1921227a0

Liczner, C., Duke, K., Juneau, G., Egli, M., & Wilds, C. J. (2021). Beyond ribose and phosphate: Selected nucleic acid modifications for structure-function investigations and therapeutic applications. Beilstein Journal of Organic Chemistry, 17, 908-931. https://doi.org/10.3762/bjoc.17.76

Hornos, J. E. M., Hornos, Y. M. M., & Forger, M. (1999). Symmetry and Symmetry Breaking: An Algebraic Approach to the Genetic Code. International Journal of Modern Physics B, 13(23), 2795-2885. https://doi.org/10.1142/S021797929900268X

Sánchez, R., & Grau, R. (2006). A novel algebraic structure of the genetic code over the Galois field of four DNA bases. Acta Biotheoretica, 54(1), 27-42. https://doi.org/10.1007/s10441-006-6192-9

Sánchez, R., Morgado, E., & Grau, R. (2004). The Genetic Code Boolean Lattice. arXiv preprint q-bio/0412034. https://doi.org/10.48550/arXiv.q-bio/0412034

Sánchez, R., Morgado, E., & Grau, R. (2005). A genetic code Boolean structure. I. The meaning of Boolean deductions. Bulletin of Mathematical Biology, 67, 1-14. https://doi.org/10.1016/j.bulm.2004.05.005

Sanchez, R., & Grau, R. (2005). A genetic code Boolean structure. II. The genetic information system as a Boolean information system. Bulletin of Mathematical Biology, 67(5), 1017-1029. https://doi.org/10.1016/j.bulm.2004.12.004

Grau, R., Del C. Chavez, M., Sanchez, R., Morgado, E., Casas, G., & Bonet, I. (2006, May). Boolean algebraic structures of the genetic code: possibilities of applications. In International Workshop on Knowledge Discovery and Emergent Complexity in Bioinformatics (pp. 10-21). Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-540-71037-0_2

Sánchez, R., Perfetti, L. A., Grau, R., & Morgado, E. (2004). A new DNA sequences vector space on a genetic code Galois field. arXiv preprint q-bio/0412019. https://doi.org/10.48550/arXiv.q-bio/0412019

Sanchez, R., (2018). Symmetric Group of the Genetic--Code Cubes. Effect of the Genetic--Code Architecture on the Evolutionary Process. Communications in Mathematical and in Computer Chemistry, 79, 527-560.

Sánchez, R., & Grau, R. (2009). An algebraic hypothesis about the primeval genetic code architecture. Mathematical Biosciences, 221(1), 60-76. https://doi.org/10.1016/j.mbs.2009.07.001

Aisah, I., Sayyidatunnisa, N. U., Subartini, B., & Kartiwa, A. (2019, July). Vector space of codons sequence over galois field GF (73). In IOP Conference Series: Materials Science and Engineering (Vol. 567, No. 1, p. 012019). IOP Publishing. https://doi.org/10.1088/1757-899X/567/1/012019

Sanchez, R., & Barreto, J. (2021). Genomic abelian finite groups. bioRxiv, 2021-06. https://doi.org/10.1101/2021.06.01.446543

Riaz, A., Kousar, S., Kausar, N., Pamucar, D., & Addis, G. M. (2022). Codes over Lattice‐Valued Intuitionistic Fuzzy Set Type‐3 with Application to the Complex DNA Analysis. Complexity, 2022(1), 5288187. https://doi.org/10.1155/2022/5288187

Wildberger, N. J. (2003). A new look at multisets. School of mathematics, UNSW Sydney, 2052, 1-21.

Syropoulos, A. (2001). Mathematics of multisets. In Multiset Processing: Mathematical, Computer Science, and Molecular Computing Points of View 1 (pp. 347-358). Springer Berlin Heidelberg.

Lindemann, B., Müller, T., Vietz, H., Jazdi, N., & Weyrich, M. (2021). A survey on long short-term memory networks for time series prediction. Procedia CIRP, 99, 650-655. https://doi.org/10.1016/j.procir.2021.03.102

Lugo, L., & Barreto, H. E. (2021). A Recurrent Neural Network approach for whole genome bacteria identification. Applied Artificial Intelligence, 35(9), 642-656. https://doi.org/10.1080/08839514.2021.1937161

Syropoulos, A. (2001). Mathematics of Multisets. In: Calude, C.S., PĂun, G., Rozenberg, G., Salomaa, A. (eds) Multiset Processing. WMC 2000. Lecture Notes in Computer Science, 2235. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45523-X_17

Lindemann, B., Müller, T., Vietz, H., Jazdi, N., & Weyrich, M. (2021). A survey on long short-term memory networks for time series prediction. Procedia CIRP, 99, 650-655. https://doi.org/10.1016/j.procir.2021.03.088

Lugo, L., & Hernández, E. B. (2021). A Recurrent Neural Network approach for whole genome bacteria identification. Applied Artificial Intelligence, 35(9), 642-656. https://doi.org/10.1080/08839514.2021.1922842

Kalaiarasi, K., Soundaria, R., Kausar, N., Agarwal, P., Aydie, H., & Alsamir, H. (2021). Optimization of the average monthly cost of an EOQ inventory model for deteriorating items in machine learning using PYTHON. Thermal Science, 25(2), 347-358. https://doi.org/10.2298/TSCI21S2347K

Published

2024-08-25

How to Cite

Sanaa Anjum, Kousar, S., Kausar, N., Aydin, N., Olanrewaju, O. A., & Mncwango, B. (2024). Digitizing DNA Sequences Using Multiset-Based Nucleotide Frequencies for Machine Learning-Based Mutation Detection. Decision Making: Applications in Management and Engineering, 7(2), 516–529. https://doi.org/10.31181/dmame7220241213