uhd journal of science and technology | august 2017 | vol 1 | issue 2 7 transmission control protocol global synchronization problem in wide area monitoring and control systems yahya ahmed yahya1 and ahmad t. al-hammouri2 1department of information technology, zakho technical institute, duhok polytechnic university zakho, duhok, kurdistan region, iraq, 2department of network engineering and security, jordan university of science and technology, irbid, jordan 1. introduction electric power grid is one of the most important topics in the modern societies. the sharp increase of demand on electricity, the electricity trade between the neighboring countries, and the long distances to transport electricity motivates the researchers and industries to propose and improve systems to monitor and control the electric power grid over a wide area. hence, wide area monitoring and control systems (wamc) became an important topic. the communication network must transmit measurements with low latency and with high accuracy [1]. there are many factors affecting these two requirements, for example, the bandwidth, the type of medium, and the protocol that are used for transmission. phasor measurement unit (pmu) measurements can be transmitted over different types of transmission media such as wired or wireless. however, the best medium that can be used is the fiber-optic cable [1], [2]. the reason to choose fiber optic is the advantages including high data transfer rates, immunity to electromagnetic interference, and very large channel capacity [3]. there are many protocols used with wamc system including the user datagram protocol (udp), multiprotocol label switching (mpls), resource reservation protocol (rsvp), a b s t r a c t the electrical power network is a significant element of the critical infrastructure in modern society. nowadays, wide area monitoring and control systems (wamc) are becoming increasingly an important topic that motivates several researchers to improve, develop, and find the problems that hinder progress toward wamc systems. wamc is used to monitor and control the power network so the power network can be adapt to failures in automatic way. in this work, verification of the extent found a problem in transmission control protocol (tcp) which is called global synchronization and its impact on utilizing the buffer of the routers. a simulation models had been belt of wamc system using omnet++ to study the performance of tcp in two queuing algorithms for measuring transmission of phasor measurement units and to test if global synchronization problem occurs. three scenarios were used to test the survival of this problem on the system. it is found that the problem of global synchronization occurred in two scenarios which in turn causes low utilization for a buffer of routers. index terms: global synchronization, phasor measurement units, power network, transmission control protocol, wide area monitoring and control systems corresponding author’s e-mail: yahya.ahmed@dpu.edu.krd received: 10-03-2017 accepted: 25-03-2017 published: 29-08-2017 access this article online doi: 10.21928/uhdjst.v1n2y2017.pp7-12 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2017 yahya and al-hammouri. this is an open access article distributed under the creative commons attribution noncommercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l re se a rc h a rt i c l e uhd journal of science and technology yahya ahmed yahya and ahmad t. al-hammouri: tcp global synchronization problem in wamc systems 8 uhd journal of science and technology | august 2017 | vol 1 | issue 2 and synchronous digital hierarchy. these protocols can be used individually or in combination (more than one of these protocols work with each other as one protocol), for example, using udp with mpls and rsvp as a one main protocol furnishes quality of services features. surveying the literature indicates that many protocols were used with wamc systems except tcp. however, the naspinet standard [4], [5] mentions that tcp can be used as the transport layer protocol to deliver pmu measurements. in general, architecture that is used in most wamc systems is shown in fig. 1. the aim of this work is to study the effect of the global synchronization problem when we are using tcp with wamc systems. 2. tcp global synchronization problem tcp is one of more than a few transport protocols [6], [7]. it is reliable and stream oriented. in addition, tcp is connectionoriented, meaning it establishes a connection when data need to be transferred. the data are sent in packets and in an ordered manner at transport layer. it supports flow and congestion control [7]. researchers after a deep study found a problem in tcp called global synchronization that can be defined as the pattern of each sender decreasing and increasing transmission rates at the same time as other senders [6], [7], [8]. as shown in fig. 2, we built a topology to study the problem of global synchronization using omnet++. the topology consists of the two senders, two receivers, and two routers. each sender connected to one of two routers by bandwidth link equal to 100 mbps and propagation delay equal to 3 ms. in the other hand, each receiver connects to the other router by bandwidth link equal to 100 mbps and propagation delay equal to 3 ms. the two routers were connected by a bandwidth link equal to 10 mbps and propagation delay equal to 1 ms. the buffer in the two routers was drop tail with 65-packet capacity. after running the experiment, the result of congestion window (cwnd) is as shown in fig. 3. the synchronization in both cwnd flows appeared clearly because both cwnds have to deal with the queue reaching its limits at the same time. the low utilization of buffer seen can be clearer in fig. 4. the queue length is switching between full (65 packets) to nearly (5 packets). this oscillating load is caused by the synchronization of cwnd. this problem motivates the researchers, and they proposed many methods to solve this problem. some researchers suggest adjusting router functionalities or modification in the transmission protocol parameters, and others suggested increasing the size of the router buffer capacity. however, these solutions will cause another unexpected problems because when increasing the size of router buffer, it is likely to increase queuing delays. the most adopted algorithm to reduce the global synchronization is random early drop queuing policy. fig. 1. general wide area monitoring and control architecture [4] fig. 2. simple topology to test global synchronization fig. 3. congestion windows of two senders yahya ahmed yahya and ahmad t. al-hammouri: tcp global synchronization problem in wamc systems uhd journal of science and technology | august 2017 | vol 1 | issue 2 9 it is liable in wamc that several senders or pmus can start sending in the same time. hence, this study is to show how the global synchronization problem effects on wamc. 3. design of a wamc system in omnet++ a. pmu building this was to construct the first component of wamc which is the pmu using c++ language. the pmu generates packets in a certain period of time, and each packet contains three information’s: 1. sequence number of the packet: which can be arranged systematically by giving a specific number to each packet. 2. source id: it is the number that helps to know from which pmu the packet had been generated. 3. time stamp: it is the time of when the packet is generated until it reaches to pdc. b. pdc level 1 (l1) building this was to construct the second component of wamc, which is the pdc using the c++ language. the pdc is similar to the server of the computer, as its function is to collect all the measurements sent to it from the pmus that are connected with the pdc. the pdc function is to assemble all the packets that were produced from pmus in one packet and to send it to pdc level 2. for more details about pdc l1 flowchart, pdc l1 checks each packet after arriving from pmu “if this sequence number sent before that time,” when “yes” pdc will delete this packet, and if “no” pdc check whether “if this sequence number is a first time seen.” when “yes,” pdc will add a time-out to this sequence number of the packet. the benefit of this time-out is that when the timer fires pdc does not receive all packets from all pmus, and then pdc will neglect and send only the arrived packets. after adding time-out to a packet with a sequence number, pdc will store the packet in the buffer and wait for packets that have the same sequence number. when the answer is “no,” “if this sequence number is first time seen,” pdc will store this packet in the buffer and checks whether that “all packets for one sequence number received or their time-out finish.” when “no,” wait until it is yes. when “yes,” pdc encapsulates the packets that have this sequence number in one packet that has a new sequence number, new source id, and a new time stamp, then it will be sent to pdc l2. c. pdc level 2 (l2) building this was to construct the third component of wamc, which is the pdc using the c++ language. this pdc will receive the above mentioned single packet (produced in part ii), then it will calculate the time of end-to-end latency and packet delivery ratio. 4. simulations and results in this section, three scenarios were created as follows: 1. twenty pmus scenario (high traffic rate) 2. nineteen pmus scenario (medium traffic rate) 3. eighteen pmus scenario (low traffic rate). through the study of the three scenarios, it was noticed in some cases of using the tcp protocol, the global synchronization problem occurred. therefore, the study focuses on the buffer of the router, which may cause low utilization to the link bandwidth. a. twenty pmus scenario (high traffic rate) the wamc systems topology in this scenario has 20 pmus, one pdc l1, and one pdc l2. fig. 5 shows the topology of this scenario. there were two routers in the topology. the size of buffer capacity in each router is 100 packets. the links parameters that were used in all scenarios were listed in table i. the packet size that sends from pmu is 512b, and the rate of date packet that generates from pmus is 120 p/s. a time-out of pdc is 0.024s. table i shows all scenarios parameters. the buffer result of router when using tcp with the fifo queue algorithm is shown in fig. 6.fig. 4. the queue buffer of router yahya ahmed yahya and ahmad t. al-hammouri: tcp global synchronization problem in wamc systems 10 uhd journal of science and technology | august 2017 | vol 1 | issue 2 the buffer exhibits oscillation between 0 and 100, which means that global synchronization had occurred in this scenario. when we used tcp with red queue algorithm, the results are shown in fig. 7. now, the buffer exhibits oscillation, but this oscillation did not to exceed 50 packets as before using tcp with fifo. b. nineteen pmus scenario (medium traffic rate) the parameter of this scenario is as mentioned in 20 pmus scenario. the only difference between the scenarios is the number of pmus. fig. 8 shows topology of 19 pmus scenario. for 19 pmus scenario (tcp protocol and fifo), the result of buffer is shown in fig. 9. the buffer exhibits oscillation between 0 and 100, which means that global synchronization had occurred in this scenario. however, the exhibits oscillation is lower than the scenario of 20 pmus. when we used tcp with red, the results are shown in fig. 10. fig. 5. the topology in 20 phasor measurement units scenario table i links parameters of all scenarios link between data rate (bps) propagation delay (ms) pmus and router 150 m 3-6 router and pdc l1 10 m 1 pdc l1 and router1 150 m 3 router1 and pdc l2 150 m 3 fig. 6. the buffer of router when using transmission control protocol with the fifo queue algorithm fig. 7. the buffer of router when using transmission control protocol with the red queue algorithm fig. 8. the topology in 19 phasor measurement units scenario fig. 9. the buffer of router when using transmission control protocol with the fifo queue algorithm yahya ahmed yahya and ahmad t. al-hammouri: tcp global synchronization problem in wamc systems uhd journal of science and technology | august 2017 | vol 1 | issue 2 11 the buffer exhibits oscillation, but this oscillation did not exceed 50 packets as before using tcp with fifo. the oscillation in this scenario is the same of scenario 20 pmus. c. eighteen pmus scenario (low traffic rate) the parameter of this scenario is mentioned in 20 and 19 pmus scenario. the only difference between the scenarios is the number of pmus. fig. 11 shows topology of 18 pmus scenario. in 18 pmus scenario, there is no oscillation of packets which means that there is no global synchronization in this scenario. the buffer exhibits oscillation between 0 to 10 or 11 while the buffer size is 100 packets. figs. 12 and 13 show the buffer of routers when we used tcp with fifo and red queue algorithm of router was used. 6. conclusion in this study, wamc system using omnet++ was created and applied tcp protocol (fifo and red) queue algorithm that was used to deliver the measurements and control information over the networks. the global synchronization problem occurred with the tcp protocol when the pmus sent the measurements in the synchrony manner in two scenarios (20 and 19 pmus) in other words when the traffic rate is high and medium. whereas in the scenario of 18 pmus when traffic rate is low, the global synchronization was not occurred. according to this study, it is recommended not to used tcp protocol in wamc systems to do not use tcp protocol in wamc system during have high rate traffic. references [1] c. moustafa and l. nordström. “investigation of communication delays and data incompleteness in multi-pmu wide area monitoring and control systems.” electric power and energy conversion systems, epecs’09. international conference on ieee, 2009. [2] h. erich, h. khurana and t. yardley. “exploring convergence for scada networks.” innovative smart grid technologies (isgt), fig. 10. the buffer of router when using transmission control protocol with the red queue algorithm fig. 11. the topology in 18 phasor measurement units scenario fig. 12. the buffer of router when we used transmission control protocol with the fifo queue algorithm. fig. 13. the buffer of router when we used transmission control protocol with the red queue algorithm yahya ahmed yahya and ahmad t. al-hammouri: tcp global synchronization problem in wamc systems 12 uhd journal of science and technology | august 2017 | vol 1 | issue 2 ieee pes, ieee, 2011. [3] c. moustafa, k. zhu and l. nordstrom. survey on priorities and communication requirements for pmu-based applications in the nordic region. powertech, 2009. [4] c. moustafa, a. layd and l. jordan. “pmu traffic shaping in ipbased wide area communication.” critical infrastructure (cris), 2010 5th international conference on ieee, 2010. [5] y. yorozu, m. hirano, k. oka and y. tagawa. “electron spectroscopy studies on magneto-optical media and plastic substrate interface,” ieee translation journal on magnetics in japan, vol. 2, pp. 740-741, aug. 1987. (digests 9th annual conf. magnetics japan, p. 301, 1982). [6] z. lixia and c. david. “oscillating behavior of network traffic: a case study simulation”. internetworking: research and experience, vol. 1, pp. 101-112, 1990. [7] s. chakchai. “loss synchronization of tcp connections at a shared bottleneck link.” department of computer science and engineering, st. louis: washington university, 2006. [8] h. sofiane and r. david “loss synchronization, router buffer sizing and high-speed tcp versions: adding red to the mix”. ieee 34th conference on local computer networks, 2009. . 46 uhd journal of science and technology | august 2017 | vol 1 | issue 2 1. introduction a. chatbot a chatbot is a service, powered by rules and sometimes artificial intelligence that you interact with via a chat interface [1,2]. they range from simple systems that extract a response from databases when they match certain keywords to more sophisticated ones that use natural language processing techniques [3]. b. needs for chatbot and an extraordinary focus was devoted to chatbots within the tech community in recent years [4]. there is no doubt that majority of business are going to be online; if we want to make a business online we have to locate where the people are? that place now is the zone of messenger applications as mentioned by peter rojas “people are now spending more time in messaging apps than in social media and that is a huge turning point. messaging apps are the platforms of the future and bots will be how their users access all sorts of services” [5]. any user’s interaction with an app or web page can utilize a chatbot to increase the user’s experience [6]. fig. 1 shows the size of the top 4 messaging apps and social networks; big 4 messaging apps are whatsapp, messenger, wechat, viber, big 4 social networks are facebook, instagram, twitter, and linkedin [7]. c. applications of chatbot the very basic use at the early days of chatbot was almost restricted to conversations. the first chatbot in history was eliza, a program which represents a psychologist [8]. by the time the bot provides a wide range to many important applications, some of the most important applications of chatbots are listed below: 1. customer service 2. mobile personal assistants 3. advertisements 4. games and entertainment applications 5. talking toys 6. call centers. building kurdish chatbot using free open source platforms kanaan m. kaka-khan department of computer science, university of human development, iraq a b s t r a c t chatbot is a program that utilizes natural language understanding and processing technology to have a human-like conversation. nowadays chatbots are capable to interact with users in world’s majority languages. unfortunately, bots that interact with kurdish users are rare. this paper is an attempt to bridge the gap between chatbots and kurdish users. this paper tries to implement a free open source platform (pandorabots) to build a kurdish chatbot. i present a number of challenges for kurdish chatbot at the last section of this work. index terms: artificial intelligence, artificial intelligence markup language, chatbot, pandorabots corresponding author’s e-mail: kanaan.mikael@uhd.edu.iq received: 09-08-201 accepted: 24-08-2017 published: 30-08-2017 access this article online doi: 10.21928/uhdjst.v1n2y2017.pp46-50 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2017 kaka-khan. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l re se a rc h a rt i c l e uhd journal of science and technology kanaan m. kaka-khan: building kurdish chatbot using free open source platforms uhd journal of science and technology | august 2017 | vol 1 | issue 2 47 the crucial aim of this work is to build a bot that is capable of working as a guide who is sitting on the uhd website and giving information about the university of human development to any user whenever asked. 2. chatbot history the concept of natural language processing generally and chatbots specifically can be originated to alan turing question “can machines think?” who asked in 1950 [9]. alan’s question (which is called turing test now) is nothing just asking questions to human and machine subjects, to identify the human. we say the machine can think if the human and machine responses are indistinguishable. in 1966, eliza (the first chatbot) was created by joseph weizenbaum at mit. for generating proper responses, eliza uses a set of pre-programmed rules to identify keywords and pattern match those keywords from an input sentence [8]. in 1995, a new more complex bot (a.l.i.c.e) created by richard wallace. alice makes use of artificial intelligence markup language (aiml) to represents conversations as sets of patterns (inputs) and templates (outputs). alice got loebner prize (yearly chatbot competition) thrice and award the most intelligent chatbot [10]. advances in natural language processing and machine learning played important roles in improving chatbot technology; modern chatbots include microsoft’s cortana, amazon’s echo and alexa, and apple’s [11]. 3. related works and methodology as in many natural language processing applications, there are many approaches to developing chatbot: using a set of predefined rules [12], semi automatically learning conversational pattern from data [13], and full automatic chatbot (under researching). each approach has its own merits and demerits, through manual approach more control over the language and the chatbot can be achieved, but it needs more effort to maintain a huge set of rules. the second approach which also is called corpus-based is challenged by the need to construct coherent personas using data created by different people [botta]. due to lack of kurdish corpus (at least it is not available for me even if it exists), i chose manually written rules by making use of aiml, a popular programming language to represents conversations as a set of patterns (inputs) and templates (outputs). as in other nlp applications, in the area of kurdish chatbot, unfortunately, we find related works rarely. with the best of my knowledge this is the first kurdish chatbot which is created academically, so sometimes i obliged to relate my work with arabic or persian languages. most notably, in 2016, dana and habash developed botta, the first arabic dialect chatbot, botta explore the challenges of creating a conversational agent that aims to stimulate friendly conversations using the egyptian arabic dialect [3]. playground and programming language are the two basic requirements for creating chatbots. playground can be defined as a sandbox or an integrated development environment for the programming language [1]. in this work, i chose pandorabots as a playground (creating, deploying, talking with the bot) and aiml (for making conversation) as a programming language for creating kurdish chatbot, alice, an award-winning free chatbot was created using aiml [12]. after login into pandorabots playground with facebook account, the work will be shown in the following steps: • step 1: i gave “kuri zanko” as the bot name. • step 2: in the bot editor space, i created a file named “uhd” which is aiml file to involve all the patterns (inputs) and templates (outputs). • step 3: i started writing an expected user input in <pattern></pattern> tag and the bot answer in <template></template> tag, both pattern and template are enclosed in a <category></category>, a category is the basic unit of knowledge in aiml [1]. • step 4: after writing each category, i train (test) the bot to know whether it gives the correct answer. • step 5: after writing all the categories, the bot will be published in the pandorabots clubhouse (a public place where users can talk to the bots). fig. 1. users for top 4 messaging apps and social networks in million [7] kanaan m. kaka-khan: building kurdish chatbot using free open source platforms 48 uhd journal of science and technology | august 2017 | vol 1 | issue 2 4. result and discussion for the simple and direct user input the bot can give the answer easily, for example: user: ساڵو bot: ساڵو لە بەڕێزتان،خۆتان بناسێنن a. pattern matching to form a user input matching, the bot searches through its aiml file (categories). it may happen, a user input does not match any of the pattern defined in our bot, so a default answer should be provided which is called ultimate default category: <pattern>*</pattern> <template> <li>ببورە بەڕێزم، وەاڵمی پرسیارەکەتم النیە</li> </template> the star (*) determines that a user input does not match any of the bot patterns, relying on one default answer is extremely tedious for the clients. this obliges us to think about random responses to provide different responses for the same user input. <random> <li>ببورە بەڕێزم، وەاڵمی پرسیارەکەتم النیە</li> <li>بەڕێزم پرسیارەکەت بەجۆرێکی تر بکەرەوە</li> <li>بەڕێزم پرسیارەکەت ڕون نیە</li> <li>ببورە لە پرسیارەکەت نەگەشتم</li> </random> these random responses make sense that the user is chatting with a human, not a bot. b. wildcards wildcards are used to capture many inputs using only a single category [1]. through wildcards bots can be more intelligence. there are many wildcards but (* and ^) are the most two ones which are used in this work: <pattern>ناوم *</pattern in this example, the star(*) stands for any name that is given by the user. <pattern>زانکۆی گەشەپێدان *</pattern> in the second example, the star stands for any words or sentences which appear after the name “زانکۆی گەشەپێدان”. <pattern>^ کۆمپیوتەر ^</pattern> the (^) wildcard lets the bot to capture any input containing the word “کۆمپیوتەر” and gives a proper answer. wildcards should be used carefully because their priority is different, fig. 2 shows wildcard and exact matching priorities. a category with # wild card will be matched first and * wildcard will be matched last, for example: when a user even types “ساڵو لە ئێوە”the response will be taken from “#ساڵو” pattern not “ساڵو لە ئێوە” pattern. c. variables bot intelligence can also be achieved through variables. variables can be used to store information about your bot and the users; this gives the user a sense that he/she is chatting with a human being. fig. 3 shows a short conversation between my bot and a user. d. recursion recursion means writing a template that is calling another category, and this leads to minimizing the number of categories in our bot aiml file. <pattern>های</pattern> <template><srai>ساڵو</srai></template> fig. 2. chatbot simple flow diagram fig. 3. wildcards priority kanaan m. kaka-khan: building kurdish chatbot using free open source platforms uhd journal of science and technology | august 2017 | vol 1 | issue 2 49 through using recursion, no need to rewrite a new category to input “های”, we just refer to the template “ساڵو” using <srai> tag, and the bot answers the user exactly as he/she said “ساڵو” to the bot. e. context to make our bot capable of doing human-like conversation, it should remember the things that have been previously said. my bot is capable of remembering the last sentence it said. (fig. 4-6) shows different conversations regarding context. f. challenges • challenge 1: the first and greatest challenge for kurdish chatbot is the lack of platform designed specifically to kurdish language, kurdish structure extremely differs from english or any other languages, kurdish word order is sov [subject+ object+ verb] [14]. the reason behind the slow progress in arabic nlp is the complexity of the arabic language [3], same to kurdish. hence, it is very tough to have a very intelligent kurdish bot using free open source platforms. • challenge 2: dialectal variation, kurdish language has many different dialects; the gap among dialects sometimes reaches a level that speakers of a dialect do not understand another dialect, and it means that it is quite tough to build a bot capable of chatting with all different kurdish dialects. • challenge 3: normalization is one of the important processes in developing bots, normalization includes sentence splitting, correcting spelling errors, person, and gender substitution. wanna -> want to isn’t -> is not how r u -> how are you with you -> with me the user may be bad in spelling, he/she may type “how r u” instead of “how are you”. these changes (normalization fig. 4. a sample conversation between a user and the bot fig. 5. a sample conversation regarding context fig. 6. detailed conversation between a user and the bot kanaan m. kaka-khan: building kurdish chatbot using free open source platforms 50 uhd journal of science and technology | august 2017 | vol 1 | issue 2 and substitution) can be done easily in english and make the bot to interact with the user as a human not a bot, while it’s a bit difficult to perform the same for kurdish because the bot components (aiml files, set files, and map files) are already exist for english language while not for kurdish, it requires vast effort from both computer science and linguistic people to maintain such files. • challenge 4: in spite of majority of platforms claiming for language agnosticism, practically we face issues for kurdish due to its own structure. for example, when a name is given, as “alan” to the bot and later on he asks the bot about his name it says “your name is alan.” while the same name is given in kurdish language“ئاالن” to the bot and i ask the bot for his name, it should tell “تۆ ناوت ئاالنە” a suffix will be seen “ە” with the name “ئاالن”, this seems to be an easy task but really needs a hard work to do. 5. conclusion and future work chatbots are online human-computer dialog system[s] with natural language [15]. i have presented the first kurdish chatbot and described some of the challenges for kurdish chatbot. building chatbot from scratch is extremely tough, time consuming, costly. this reason led me to go for free open source platform (pandorabots). this work aims to be a basic structure for kurdish dialect, providing future kurdish bot masters with a base chatbot which contains basic files, general knowledge. 6. biography kanaan m. kaka-khan is an associate professor in the computer science department at human development university, sulaimaniya, iraq. born in iraq 1982. kanaan m. khan had his bachelor degree in computer science from sulaimaniya university, and master degree in it from bam university, india. his research interest area includes natural language processing, machine translation, chatbot, and information security. references [1] “how to build a bot using the playground ui”. available: https:// www.playground.pandorabots.com/en/tutorial. [last accessed on 2017 aug 25]. [2] “the complete beginner’s guide to chatbots.” matt schlicht, founder of chatbots magazine, apr. 20, 2016. available: https:// www.chatbotsmagazine.com/the-complete-beginner-s-guide-tochatbots-8280b7b906ca. [last accessed on 2017 aug 25]. [3] “botta: an arabic dialect chatbot.” dana abu ali and nizar habash, proceedings of coling 2016, the 26th international conference on computational linguistics: system demonstrations, osaka, japan, pp. 208-212, dec. 11, 17, 2016. [4] “best uses of chatbots in the uk.” charlotte jee. available: http:// www.techworld.com/picture-gallery/apps-wearables/9-best-usesof-chatbots-in-business-in-uk-3641500. jun. 08, 2017. [5] “chatbot survey 2017.” ayush jain, co-founder and ceo at mindbowser. available: https://www.slideshare.net/mobileappszen/ chatbots-survey-2017-chatbot-market-research-report. [feb. 08, 2017. [6] “chatbot applications and considerations.” josef ondrejcka. available: http://ramseysolutions.com/chatbot-applications-andconsiderations. [sep. 19, 2016]. [7] “messaging apps are now bigger than social networks.” bi intelligence. available: http://www.businessinsider.com/themessaging-app-report-2015-11. [sep. 20, 2016]. [8] j. weizenbaum. “eliza-a computer program for the study of natural language communication between man and machine.” communications of the acm, vol. 9, no. 1, pp. 36-45, 1966. [9] a. m. turing. “computing machinery and intelligence.” mind, vol. 59, no. 236, pp. 433-460, 1950. [10] r. s. wallace. “the anatomy of a.l.i.c.e.” available: http://www. alicebot.org/anatomy.html. [last accessed on 2017 aug 25]. [11] m. weinberger. why amazon’s echo is totally dominating-and what google, microsoft, and apple have to do to catch up. available: http://www.businessinsider.com/amazon-echo-googlehome-microsoft-cortana-apple-siri-2017-1. [jan. 14, 2017]. [12] r. wallace. the elements of aiml style, san francisco: alice ai foundation, 2003. [13] b. a. shawar and e. atwell. “using dialogue corpora to train a chatbot.” in proceedings of the corpus linguistics 2003 conference, pp. 681-690, 2003. [14] “evaluation of in kurdish machine translation system.” kanaan and fatima, proceedings of uhd 2017, the 4th international scientific conference, sulaimanya, iraq, pp. 862-868, jun. 2017. [15] j. cahn. “chatbot: architecture, design, and development.” university of pennsylvania school of engineering and applied science department of computer and information science, apr. 26, 2017. . uhd journal of science and technology | august 2017 | vol 1 | issue 2 25 model dependent controller for underwater vehicle wesam m. jasim department of information technology, college of computer science and information technology, university of anbar, ramadi, iraq 1. introduction an auv is a robot able to work in six degrees of freedom with actuators and sensors diving autonomously under the water to perform tasks. the dynamics of the underwater vehicle are nonlinear and acted to disturbances. however, to perform its tasks quickly and accurately, two important problems faced the researchers. they are the problem of identifying the accurate model and designing the suitable control techniques. therefore, researchers in the robotic control field have been attracted to build their algorithms to solve these problems. in this paper, the problem of building the control algorithm of the underwater vehicle is addressed. recently, different control techniques were presented, the h ∞ approach for linear parameter varying polytopic systems was addressed to guarantee the performance of the vehicle [1]. an optimal control with game theory was presented for position control problem of the underwater vehicle in patel et al. [2]. several controllers were discussed in ferreira et al. [3], some of them were based on lyapunov control theory, and the others were gathered with linear or nonlinear control theory for underwater vehicle horizontal and vertical motions. a feedback control algorithm was presented in vervoort [4] to stabilize the underwater vehicle with the linearized model. a linear quadratic regulator (lqr) algorithm was implemented in prasad and swarup [5] for underwater vehicle stabilization combined with model predictive control (mpc) for position and velocity control. the simulation results showed a stable response compared with a sliding mode controller performance. while a lqr controller was presented for depth control problem of an underwater vehicle in joo and zu [6]. the simulation results show the success of the proposed algorithm. authors in mohd-mokhtar et al.’s [7] study present a pid controller for underwater vehicle identified model. the simulation results show that the controller performs accurately when the identified model error was 98% compared with that when the error was 70%. a b s t r a c t in this work, a model dependent control design method based feedback scheme was investigated for autonomous underwater vehicle control. the controller was designed with the nonlinear terms inertia, hydrodynamic damping, and gravitational, and buoyancy of the vehicle dynamical model consideration. then, the model independent controller (pd) was also investigated, with no nonlinear terms consideration. the stability analysis of the proposed model dependent feedback controller was obtained based on a lyapunov function. the simulation results of the proposed controller were compared with that of pd controller. the comparison shows the validation of the proposed controller. index terms: model dependent controller, pd controller, underwater vehicle corresponding author’s e-mail: wmj_r@yahoo.com received: 10-03-2017 accepted: 25-03-2017 published: 29-08-2017 access this article online doi: 10.21928/uhdjst.v1n2y2017.pp25-30 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2017 jasim. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l re se a rc h a rt i c l e uhd journal of science and technology wesam m. jasim: model dependent controller for underwater vehicle 26 uhd journal of science and technology | august 2017 | vol 1 | issue 2 nonlinear control laws were developed in elnashar [8] for the six degrees of freedom of an underwater vehicle in several motion strategies. the stability of the system was analyzed based on phase plane analysis. an adaptive signal for the unknown forces compensation gathering with a following controller in a limited space for an autonomous vehicle was proposed in mukheriee et al. [9] without state velocity measurement. a mpc controller was presented for an auv low speed tracking control in steenson et al. [10]. the controller was tested in simulation, and the results were verified by testing the strategy in a tank at zero speed. authors in rathore and kumer [11] study proposed a pid controller for an underwater vehicle steering control. the pid controller parameters were optimized based on genetic algorithm and harmonic search method. the simulation results show the robustness of the proposed controller. the sliding mode control strategy was proposed for an underwater vehicle position control in tabar et al. [12]. the controller was applied to overcome the effect of the disturbances. zhou et al. [13] proposed a state feedback sliding mode controller for a nonlinear dynamic system of an underwater vehicle with disturbance consideration. the simulation results show a good performance. in this paper, a model dependent controller for autonomous underwater vehicle is proposed. the controller was developed to include the nonlinear dynamics of the vehicle. then, a model independent controller was presented, and it is results were compared with that of the former controller. in the following, section ii presents the underwater vehicle dynamical model. section iii provides the description of the designed nonlinear feedback control algorithm. section iv provides simulation results. our conclusion and future work are given in section v. 2. auv modelling the nonlinear dynamical model of a 6dof underwater vehicle can be described based on two reference frames; fixed reference frame (inertial reference frame) i and the body frame (motion reference frame) b, shown in fig. 1. the dynamics and kinematics of the vehicle are expressed as follows [14]: mv c v v d v v g j v ( ) ( ) ( ) ( ) + + + η = τ η = η   (1) where, m=mt is a positive r6×6 inertia matrix with the added masses, c(v)=−c(v)t is r6×6 coriolis and centripetal matrix, d(v) is a positive r6×6 hydrodynamic damping matrix, g(η) is r6×1 gravitational and buoyancy vector, τ=[τ x , τ y , τ z , τ k , τ m , τ n ]t is r6×1 forces and torque input vector, v=[u,v,w,p,q,r]t is the linear and angular velocity vector, η=[x,y,z,φ,θ,ψ]t is r6×1 motion vector in surge, sway, heave, roll, pitch, and yaw, respectively, and j(η) is r6×6 body frame to inertial frame transformation matrix. c c s c c s s c s c s s s c s s s c s s s c c s s c s c s j s t c t c s s c c c ( ) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 θ − ψ φ + ψ θ φ ψ θ φ + ψ φ  ψ θ ψ θ φ + ψ φ ψ θ φ − ψ φ  − θ θ φ θ φ =           φ θ φ θ φ − φ  φ φ  θ θ  η  where, s, c, and t are sine, cosine, and tan. assuming that the vehicle is symmetry about the three planes, the vehicle is operate in low speed, roll, and pitch movement is neglected, the body frame is considered to be at same position of the center of gravity, no disturbance is considered, and all the dynamic states can be decoupled, and the dynamical system eq. (1) can be rewritten as follows: fig. 1. underwater vehicle frames wesam m. jasim: model dependent controller for underwater vehicle uhd journal of science and technology | august 2017 | vol 1 | issue 2 27 mv d v v g j v ( ) ( ) ( ) + + = =   η τ η η (2) where, u v w x p y q z r m x m y m z m i k i m i n 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +  +  + =  +          +  +        and u u u v v v w w w p p p q q q r r r x x u y y v z z w d k k p k k q n n r 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +  +  +=           +  +  +  and only four degrees of freedom are considered to control the vehicle, i.e., control (x,y,z) and ψ states. 3. controller design in this section, the aim is to design a feedback control algorithm for the path following problem of the underwater vehicle. to this end, the first equation of eq.2 will be used. our control scheme consists of two approaches. the first approach is to design a model dependent controller to find the desired control vector τ. the second approach is to design a model independent control algorithm to obtain the desired control vector τ. the controllers’ stability analyses are guaranteed based on lyapunov function as exponential and asymptotic stability, respectively. the main task is to derive the underwater vehicle toward the desired position ηd from the initial position to satisfy the following equilibrium point. d t t lim lim( ) 0 →∞ →∞ η = η − η = (3) now, the following theorem can be addressed: theorem 1: considering the dynamics of eq.2 under the feedback control law of the form: mv d( v )v g( ) vτ = + + η + η+   (4) where, pkη = − η  , dv k v= −  , kp and kd are diagonal matrices, dη = η − η is the motion vector error, and is the linear and angular velocity error. then, the closed loop system of eq.2 and eq.4 is exponentially stable. proof: let us suggest the following lyapunov candidate: t t1 1v v v 2 2 = η η+    (5) calculating the time derivative of the proposed lyapunov function we obtain: t tv v v= η η+     (6) substituting the value of η and v into eq.6 we get: t t p dv k v k v 0= −η η− ≤     (7) then, eq.7 is less than zero leaded kp and kd are positive definite diagonal matrices, and it can be concluded based on the barbalat’s lemma [15] that the closed loop system eq.2 and the control law eq.4 is globally asymptotically stable, which meets the condition of eq.3. now, the model independent feedback control law, i.e., it is a pd controller without the effect of the hydrodynamic wesam m. jasim: model dependent controller for underwater vehicle 28 uhd journal of science and technology | august 2017 | vol 1 | issue 2 damping, gravitational and buoyancy forces, and the inertia terms are: 1 2 vτ = −γ η− γ  (8) where, γ 1 and γ 2 are positive diagonal matrices. 4. simulations the model dependent controller eq.4 has been applied for the four state to be controlled of the autonomous underwater vehicle, which presented in singh and chowdhury [16]. a vehicle matlab simulator was implemented with the following matrices: 99 0 0 0 0 0 0 108.5 0 0 0 0 0 0 126 0 0 0 m 0 0 0 1.05 0 0 0 0 0 0 1.002 0 0 0 0 0 0 29.1         =            1 2 3 4 5 6 d 0 0 0 0 0 0 d 0 0 0 0 0 0 d 0 0 0 d 0 0 0 d 0 0 0 0 0 0 d 0 0 0 0 0 0 d         =            with [ ] 1 2 3 4 5 6 d 10 227.18 u d 405.41 v d 10 227.18 w d 0.05 5.21 p d 0.025 3.22 q d 1.603 12.937 r = + = = + = + = + = + 0 0 19.6 g( ) 0 0 0        − η =            to validate the proposed model dependent control performance, two paths are considered. then, the vehicle retested under the model independent pd controller eq.8 and its results are compared with that of the model dependent controller eq.4. first, the following desired path was tested: d d d d x 20 sin(t / 10) y 20 cos(t / 10) z 10 / 2 =  =  =  ψ = π then, the following desired path was tested: d d d d x 5t y 3t z 20 / 4 =  =  = ψ = π the vehicle was started from zero initial condition in both cases. figs. 2-5 show the typical results when the vehicle was commanded to follow the first desired path under the proposed controller compared with that of the pd controller. while figs. 6-9 present the results obtained when the second path was used. figs. 2 and 6 present the motion of the vehicle toward x-axis under the two controllers in the first and the second fig. 2. motion of the vehicle toward x-direction in first path wesam m. jasim: model dependent controller for underwater vehicle uhd journal of science and technology | august 2017 | vol 1 | issue 2 29 fig. 3. motion of the vehicle toward y-direction in first path fig. 4. motion of the vehicle toward z-direction in first path fig. 5. rotating of the vehicle along z-axis in first path fig. 6. motion of the vehicle toward x-direction in second path fig. 7. motion of the vehicle toward y-direction in second path fig. 8. motion of the vehicle toward z-direction in second path wesam m. jasim: model dependent controller for underwater vehicle 30 uhd journal of science and technology | august 2017 | vol 1 | issue 2 fig. 9. rotation of the vehicle along z-axis in second path cases, respectively. it can be seen that the performance of the vehicle under the proposed controller is faster than that of pd controller to catch the desired path with smaller oscillation. from figs. 3, 4, 7, and 8, one can conclude that the vehicle moved toward y and z-axes with very small error when the proposed controller was used compared with some over shoot when the pd controller was used. the performance of the rotation angle ψ was obtained in figs. 5 and 9 for the first and the second cases, respectively. in these figures, no oscillation was appearing in both cases, but the proposed controller performs faster than the pd one. it is quite obvious that the performance of the vehicle path following under the proposed controller is much better than the performance of the vehicle under the pd controller. 6. conclusions the design of a model dependent controller of an underwater vehicle has been addressed in this paper. the proposed controller includes the vehicle nonlinear dynamic terms. the control system stability was guaranteed through lyapunov theory. the simulation results of using the proposed controller show better performance compared with that of pd controller. our work toward this subject is to apply the proposed controller for a swarm of underwater vehicle control problem. references [1] e. roche, o. sename and d. simon. “lpv/h∞ control of an autonomous underwater vehicle auv,” procedings of the european control conference, 2009. [2] n. m. patel, s. e. gano and j. e. renaud. “simulation model of an autonomous underwater vehicle for design optimization,” 45th aiaa/asme/asce/ahs/asc structures, structural dynamics and materials conference, pp. 1-15, apr. 2004. [3] b. ferreira, m. pinto, a. matos and n. cruz. “control of the mares autonomous underwater vehicle,” in oceans, pp. 1-10. oct. 2009. [4] j. h. a. vervoort. “modeling and control of an unmanned underwater vehicle.” master thesis, university of technology eindhoven, 2008. [5] m. p. r. prasad and a. swarup.” position and velocity control of remotely operated underwater vehicle using model predictive control.” indian journal of geo-marine sciences, vol. 44, no. 12, pp. 1920-1927, 2015. [6] m. g. joo and z. qu.” an autonomous underwater vehicle as an underwater glider and its depth control.” international journal of control, automation, and systems, vol. 13, no. 5, pp. 1212-1220, 2015. [7] r. mohd-mokhtar, m. h. r. aziz, m. r. arshad, a. b. husaini and m. m. noh. “model identification and control analysis for underwater thruster system.” indian journal of geo-marine sciences, vol. 42, no. 8, pp. 992-998, 2013. [8] g. a. elnashar. “dynamics modelling, performance evaluation and stability analysis of an autonomous underwater vehicle.” international journal of modelling, identification and control, vol. 21, no. 3, pp. 306-320, 2014. [9] k. mukheriee, i. n. kar and r. k. p. bhatt. “adaptive gravity compensation and region tracking control of an auv without velocity measurement.” international journal of modelling, identification and control, vol. 25, no. 2, pp. 154-163, 2016. [10] l. v. steenson, s. r. turnock, a. b. phillips, c. harris, m. e. furlong, e. rogers and l. wang. “model predictive control of a hybrid autonomous underwater vehicle with experimental verification,” in proc. of the institution of mechanical engineers, part m: journal of engineering for the maritimeenvironment, 2013. [11] a. rathore and m. kumer. “robust steering control of autonomous underwater vehicle: based on pid tuning evolutionary optimization technique.” international journal of computer applications, vol. 117, no. 18, pp. 1-6, 2015. [12] a. f. tabar, m. azadi and a. alesaadi. “sliding mode control of autonomous underwater vehicles.” international journal of computer, electrical, automation, control and information engineering, vol. 8, no. 3, pp. 546-549, 2014. [13] h. y. zhou, k. z. liu and x. s. feng. “state feedback sliding mode control without chattering by constructing hurwitz matrix for auv movement.” international journal of automation and computing, vol. 8, no. 2, pp. 262-268, 2011. [14] t. i. fossen, guidance and control of ocean vehicles. new york: john wiley & sons, 1994. [15] j. j. slotine, and li, w. applied nonlinear control. new jersey: prentice hall, 1991. [16] m. p. singh and b. chowdhury. control of autonomous underwater vehicles. rourkela: bachelor of technology in electrical engineering, national institute of technology, 2011. . uhd journal of science and technology | august 2017 | vol 1 | issue 2 31 1. introduction the convenience of credit cards is common in modern day community. credit card utilization has expanded among the clients since credit card installment is key one and it is helpful to pay the amount. it is utilized either online or conventional shopping. due to the expansion and fast advancement in the fields such as e-commerce, the utilization of credit card is also expanded radically [1]. as the use of credit card is development, the credit card fraud is additionally increments. the fraud is characterized as a restricted movement by a client for whom the record was not anticipated [2]. the clients who are utilizing the credit card not having the associations with the cardholder and has no goal of making the repayments for the obtain they done. at present, commercial fraud is turning into a serious issue, and successful identification of credit card is a troublesome effort for the experts [3]. identifying credit card fraud is a tough effort when applying traditional methods; therefore, the growth of the credit card fraud discovery model has matured off significance, either in the educational or trades the society recently. credit card fraud detection belongs to the classification and identification problem with a large number of non-linear a hybrid simulated annealing and back-propagation algorithm for feed-forward neural network to detect credit card fraud ardalan husin awlla ministry of education, sulaimani 46001, iraq a b s t r a c t due to the ascent and fast development of e-commerce, utilization of credit cards for online buys has significantly expanded, and it brought about a blast in the credit card fraud. as credit card turns into the most prevalent method of installment for both online and also normal buy, cases of fraud associated with it are additionally rising. in actuality, false exchanges are scattered with veritable exchanges, and basic example for coordinating procedures is not frequently adequate to identify those frauds accurately. usage of effective fraud recognition frameworks has in this manner gotten to be basic for all credit card distributing banks to decrease their losses. many current systems based on artificial intelligence, fuzzy logic, machine learning, data mining, sequence alignment, genetic programming, and so on have advanced in distinguishing different credit card fake transactions. a reasonable seeing on all these methodologies will absolutely lead to an efficient credit card fraud detection framework. this paper suggested an anomaly detection model based on a hybrid simulated annealing (sa) and back-propagation algorithm for feed-forward neural network (ffnn), which joined the significant global searching capability of sa with the precise local searching element of back-propagation ffnns to improve the initial weights of a neural network toward getting a better result for detection fraud. index terms: artificial neural network, back-propagation, back-propagation feed-forward neural network, feedforward neural network, simulated annealing, simulated annealing-back-propagation feed-forward neural network corresponding author’s e-mail: ardalan.husin@gmail.com received: 10-03-2017 accepted: 25-03-2017 published: 29-08-2017 access this article online doi: 10.21928/uhdjst.v1n2y2017.pp31-36 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2017 awlla. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l re se a rc h a rt i c l e uhd journal of science and technology ardalan husin awlla: credit card fraud detecting using hybrid simulated annealing 32 uhd journal of science and technology | august 2017 | vol 1 | issue 2 situations, which cause it significant to consider non-linear integrated ways to explaining the problem [4]. artificial neural network (ann) is a mathematical description of the network of neurons in the mind and share relationships functionalities, such as accepting inputs, processing it, and then produces output [5]. it follows a combined graph of nodes, which are twisted by the weighted links related to the biological neurons. there are different models ann, for example, feedforward neural network (ffnn), multiple-layered perceptron, and kohonen network. adaptive resonance network and the initial two nets work as a classifier, i.e. these can learn from patterns, and the knowledge can be immediately supervised. although the other nets learn from attention and later update the network weights, through serve unsupervised learning system seen in a case of clustering. in this paper, a ffnn has been improved for classification intention. ffnn allows the information to pass from the input to output layer in a feedforward path through the hidden layer(s) [6]. all ffnns, as stated, possibly trained in a supervised way so that it can learn the feature pattern accessible within the data. to achieve the wanted accuracy in class prediction, fit training is compulsory. while training, the purpose is to catch the network learning feature as the best, which is mirrored by reducing the squared error (i.e., the squared variation between the calculated and the wanted output). there are various algorithms to optimize such learning method. backpropagation (bp) is one of the standard traditional ann training algorithms for supervising learning. the weights are adjusted and updated with a statement delta rule to minimize the prediction error during iterations. the weight improvement methodology covers bp the errors from output layer into hidden layer, so obtaining the optimal set of weights [7]. simulated annealing (sa) is a probabilistic meta-algorithm for global optimization [8]. it is parallel to the physical method where a solid is casually begin cooled till it is construction is in a cold state, which occurs at a minimum energy form [9]. similarly, bp algorithm, in sa, the weight has to go into some configuration on the rule till it leads the global minimum [10]. there are besides various other optimization methods such as evolutionary algorithm, for example, genetic algorithm, practical swarm optimization, genetic programming (gp), and so on, there are behind scope of this paper. the principal purpose of this paper is work to experiment the achievement hybrid of sa and bp compare with bp in the ffnn structure for detection credit card fraud. 2. feature selection the essential step in developing credit card fraud detection is how to extract the key features. they will influence in recognition rate and improved false alarms. by flattering feature, the data reservation will also be enhanced, so the training and time for data set will be more able for classification that runs under constant environment. the example dataset that we are running was obtained from a data mining blog. this dataset includes the rundown of the transactions of 20,000 dynamic credit card holders recent months. the input fields incorporate credit card id, authentication type, current balance, average bank balance, book balance, total number credit card used, and 8 distinctive cardholder classifications such as overdraft, average overdraft, number of location usage, and so on. the data set essentially gives the analysis of the cardholders’ exchanges without expressing whether the exchanges were legal or fraudulent. concerning a given cardholder the dataset based on the following critical values, we can identify which exchange is legal or fraud: 1. based on credit card usage frequency: frequency can be found as total number card used/credit cardholder age, if the result <0.2, it implies this property is not relevant for fraud. 2. based on a number of location credit card usage: number of locations credit card used per day so far achieved from the dataset, if location is <5, it means this property is not relevant for fraud. 3. based on credit card average overdraft: with respect to card used happened so far considers, the average overdraft can be found as number of overdraft/total number of card used, if overdraft with respect to card used is <0.02, it means this property is not relevant for fraud. 4. based on credit card book balance: regular book balance can be found as current book balance/average book balance, if book balance is equal or <0.25, it implies that this property is not relevant for fraud (table i). 3. ffnn structure according to chosen features from the dataset, we created different networks. the number of hidden layers of every network is restricted to one for active and manageable calculation. the amount of neurons in the hidden layer is changed to test the results [5]. fig. 1 illustrates the last proper structure of the network achieved it among them. ardalan husin awlla: credit card fraud detecting using hybrid simulated annealing uhd journal of science and technology | august 2017 | vol 1 | issue 2 33 the log sigmoid function in equation 1 is applied as the transfer function connected by the neurons in hidden and output layers to achieve the outputs. f(x)=(1+e1−x)−1 (1) 4. algorithms a. back-propagation algorithm bp is a common method for training ann. the algorithm operates in two forms. first, a training input pattern is given to the input layer, which is forward to the hidden layer then into output layer to produce the network output. mean square error (mse) is next calculated by analyzing the estimated output and the target output for all inputs as we explained in equation 3, where “n” indicates the number of entire instances. n 2 1 0 1 ms ( t0 e0) n = −∑ (2) in the next step, besides the mse, the network information back propagates from the output layer to the input layer, and specific connector weights are updated utilizing a “generalized ∆ rule” that is held of learning rate (η) and momentum constant (α) [11]. equations 3 and 4 display the rule of weight updating. in particular equations, the characters “w” means the weights between the connectors “i” and “j” and “t” is the position iteration. an excellent style manual for science writers is [7]. t t i.j i.jt i.j mse w . .w w ∂ ∆ = −η + α ∂ (3) t 1 t t i.j i.j i.jw w w + = ∆ + (4) for achieving the average η which gives the least mse, rigorous parametric research has been conducted in this research [11]. the η at which the error is minimal is determined for the association with the sa algorithm. in this research, the momentum constant (α) is fixed equal to 0.9 for all states to speed up the learning method. the epoch size is fixed to as 1500. b. simulated annulling algorithm the critical parameter for sa is a temperature (t) which is the similarity of the t in physical system. beginning at a high t, the algorithm ends the minimum t with continuous decrease with attaining of a thermal equilibrium status at each t [8]. at any t, the weights are randomized. a recent set of weights is accepted as the new optimized set if the mse with this set is under than the prior set or with a possibility that the present set of weights will reach to the global minimum. as estimated in bp, cost function and transfer functions are utilized. this table i sample of dataset transaction no. 1 2 3 4 5 credit card id 11111 11112 11113 11114 11115 authentication type 111 112 113 114 115 current balance 20000 25000 15000 100000 15000 average bank balance 80000 55000 70000 60000 61000 book balance 0.25 0.4545 0.214 1.6666 0.245 total number card used 13 40 21 90 85 overdraft 4 20 3 29 17 average overdraft 0.3076 0.5 0.142 0.3222 0.2 number of location usage 3 4 2 11 3 amount of transaction 9000 15000 8500 12000 19000 card holder age 25 64 50 21 43 average daily balance 2666 1833 2333 2000 2033 card frequency 0.52 0.625 0.16 4.2857 0.18 card holder marital status 0 1 0 1 1 fig. 1. description of the ffnn developed ardalan husin awlla: credit card fraud detecting using hybrid simulated annealing 34 uhd journal of science and technology | august 2017 | vol 1 | issue 2 research is expected if the number of adjustment in the weight set is more than 10 either the number of iterations is more than 1500 then the equilibrium state at a critical t is supposed to be done. the primary t is determined as 10°c, and ultimate t is 1°c randomly. in addition, the t is reduced with by a determinant of 0.95 random because it is challenging to obtain the accurate values of initial, ultimate, and more the threatening of t. the implementation algorithm is as follows (e(s)) is an actual function. sa algorithm 1) set initial solution in s 2) set initial solution t 3) while not terminate do 4) repeat k times 5) chose s′ a randomly element from n(s i ) 6) δe = e (s′) – e (s) 7) if (δe ≤ 0) then 8) s i+1 s′ 9) else if s i +1 s′ with probability e(-e/t) 10) s i+1 s′ 11) end if 12) end repeat 13) decreased t 14) end do c. hybrid algorithms to defeat the local minimum issue of bp because of initial random weight parameters of the network, various optimization algorithms have been attempted by numerous researchers, which enhance the execution of the classification at the cost of more impalement time. in this paper i, hybridized two algorithms, joining global search sa algorithm, and local search gradient algorithm that defeats the local minimum issue with high speculation and quick union speed. the hybrid sa-bp is a training algorithm joining the sa algorithm with the bp algorithm. sa is global optimization algorithm, which has a powerful capability to investigate the whole search space. this algorithm has a drawback that the search over the global optimum solution is slow. in opposite, the bp has exact and quick local searching capacity to investigate locally the optimum result, but it gets stuck to discover global optimum result in complex pursuit space. by joining the sa and the gradient-based bp algorithm, another algorithm alluded to as hybrid sa-bp algorithm as shown in fig. 2. the suggested hybrid algorithm has two stages: initial one a global search stage, the ffnn is trained utilizing the sa algorithm for few pre-characterized temperature or training error is less than some predefined value, then training mechanism changed to the second stage for searching locally utilizing a deterministic technique the bp algorithm. in this paper, it achieved sa-bp hybrid training algorithm as a strong option way to deal with bp algorithm. following steps is the pseudo code for the hybrid sa-bp algorithm: 1. randomly initialize the weights of the ffnn system appeared in fig. 1 2. evaluate weights using sa used in the neural network follow a temperature annealing schedule with the algorithm 3. while first temperature value is under or equal to minimum error then select the best solution for mlp then go to step 7 4. select a moving method with some probability 5. try a new solution 6. evaluate 7. select the best solution 8. initialize parameters of bp learning algorithm 9. initialize weights of the mlp utilizing best solution of sa 10. while new epoch is under or equal to maximum epoch or error converges to minimum error do 11. using bp update weights to minimize error with training data 12. end while 13. assess execution of classification with test data 14. end while 5. experimental study the network packets that are obtained are separated into two sections. the first section about eight hundred records is used to train sa and bp neural network module. the second section is about two hundred records applied to test the credit card fraud detection. the efficiency of the neural network relies on the number, type, and amount of features and learning algorithm applied to train the neural network. hence, as to evaluate the execution of a credit card fraud recognition strategy; we have to display a quantitative estimate. in our credit card fraud detection system, we mostly classify the network traffic into two categories, which they are normal and abnormal network traffic. hence, we need to realize the true positive, true negative, false positive, and finally false negative to define true-positive rate (tpr) and false-negative rate (fnr). tpr and fnr) can be calculated using the following mathematical equations [5], [6]. ardalan husin awlla: credit card fraud detecting using hybrid simulated annealing uhd journal of science and technology | august 2017 | vol 1 | issue 2 35 tpr=tp/(tp+fn) (5) the tpr measures the performance of credit card fraud detection technique concerning the possibility of a suspect data reported correctly as abnormal data. then, again the fpr measures the performance of credit card fraud detection technique as far as the possibility of a normal traffic reported as abnormal data. as introduced, the length of parameter temperature has been taken from 10°c to 1°c. the balanced state for it includes of each 10 changes made in set of weights or 1500 iterations, the momentum is 0.9, learning rate is 0.7, maximum error to reaches is 0.01, and weight and threshold values are randomly initialized before training. consequently, figs. 2 and 3 show the result of training and test case for the detection rate and fpr of bp and sa-bp (table ii). experimental result in fig. 3. clearly show that sa-bpffnn more secure in detection credit card farud in comparison to bpffnn. furthermore, from the fig. 4. sa-bpffnn significantly reduce the false-positive rate compare to bpffnn. 6. conclusion as utilization of credit cards turn out to be increasingly regular in each field of the everyday life, master card or fig. 2. structure of hybrid algorithm for classification fig. 3. detection rate of sa-bpffnn and bpffnn ardalan husin awlla: credit card fraud detecting using hybrid simulated annealing 36 uhd journal of science and technology | august 2017 | vol 1 | issue 2 credit card fraud has turned out to be much more rampant. to enhance security of the financial transaction frameworks in an automated and successful way, constructing an accurate and effective credit card fraud detection framework is one of the key efforts for the financial institutions. credit card fraud detection refers to the classification and recognition issues. the paper hybrids the sa algorithm with bpffnn for fraud detection where the simulated neural network can learn knowledge from a large number of a dataset for training and examining the result of detection. the analysis result illustrated that the using bp is a simple local minimum algorithm, and sa is a good global search algorithm or optimization algorithm based on the analysis, the experimental results indicate that the accuracy of bpffnn is under than applied sa to bpffnn algorithm. references [1] n. s. halvaiee and m. k. akbari. “a novel model for credit card fraud detection using artificial immune systems.” applied soft computing, vol. 24, pp. 40-49, nov. 2014. [2] c. yin, a. h. awlla, z. yin and j. wang. “botnet detection based on genetic neural network.” international journal of security and its applications, vol. 9, pp. 97-104, nov. 2015. [3] v. van vlasselaer, c. bravo, o. caelen and b. baesens. “a novel approach for automated credit card transaction fraud detection using network-based extensions.” decision support systems, vol. 75, pp. 38-48, jul. 2015. [4] d. sanchez, m. a. vila, l. cerda and j. m. serrano. “association rules applied to credit card fraud detection.” expert systems with applications,vol. 36, pp. 3630-3640, 2009. [5] s. suganya and n. kamalraj. “a survey on credit card fraud detection.” international journal of computer science and mobile computing, vol. 4, pp. 241-244, nov. 2015. [6] j. bernal and j. torres-jimenez. “sagrad: a program for neural network training with simulated annealing and the conjugate gradient method.” journal of research of the national institute of standards and technology, vol. 120, pp. 113-128, 2015. [7] s. j. subavathi and t. kathirvalavakumar, “adaptive modified backpropagation algorithm based on differential errors.” international journal of computer science, engineering and applications,vol. 1, no. 5, pp. 21-33, oct. 2011. [8] a. t. kalai. “simulated annealing for convex optimization.” mathematics of operations research, vol. 31, pp. 253-266, 2006. [9] c. m. tan, ed. simulated annealing. vienna, austria: in-teh is croatian branch of i-tech education and publishing kg, sep. 2008. [10] s. h. zhan, j. lin, z. j. zhang and y. w. zhong. “list-based simulated annealing algorithm for traveling salesman problem.” computational intelligence and neuroscience, vol. 2016, pp. 12, mar. 2016. [11] n. a. hamid, n. m. nawi, r. ghazali and m. n. m. salleh. “solving local minima problem in back propagation algorithm using adaptive gain, adaptive momentum and adaptive learning rate on classification problems,” international conference mathematical and computational biology. malacca, malaysia, pp. 448-455, apr. 2011. table ii summaries the result during training and testing desired output sa-bp yes 0 1 0 1 0 no 0 0 1 0 1 maybe 1 0 0 0 0 actual output sa_bp yes 0.033 0.9723 0.023 0.9865 0.0146 no 0.3687 0.0173 0.969 0.0075 0.867 maybe 0.5983 0.0104 0.008 0.006 0.1184 desired output bp yes 1 0 1 0 0 no 0 1 0 1 0 maybe 0 0 0 0 1 actual output bp yes 0.0035 0.9148 0.0253 0.9517 0.1277 no 0.4019 0.073 0.8770 0.0303 0.870 maybe 0.5946 0.0122 0.0977 0.0180 0.0023 fig. 4. false positive rate of sa-bpffnn and bpffnn tx_1~abs:at/tx_2:abs~at uhd journal of science and technology | jan 2023 | vol 7 | issue 2 1 1. introduction the sheep population in iraq in 2020 was about 7 million head [1]. most of this population (99.9%) is owned by the private sector [2] and is distributed all over the iraq. the native breeds include the awassi, arabi, karadi, and hamadni sheep. one of the important native species of sheep in kifri region is the awassi sheep, which is abundant in this region. the condition of herding in kifri city and the presence of a large nomadic population in this area indicates that most of the sheep grazing is done in the pastures and the ranchers tried to make the most of it in the hot seasons. because ticks spend a relatively short time of their life cycle on the host, and they spend a long time apart from the host on the surface of pastures. as the climate of the region becomes favorable for the growth and appearance of ticks during the period of livestock grazing in the pastures, various types of blood protozoa cause contamination and the sheep suffer from protozoan diseases, especially identification of blood protozoa infestation transmitted by vector tikes among awassi sheep herds in kifri city, kurdistan region of iraq mahmood ahmad hossein* department of animal production, collage of agricultural engineering science, university of garmian, kalar, as-sulaymaniyah, krg, iraq a b s t r a c t blood protozoan disease is a common disease among animals in the kifri city, kurdistan region of iraq that this disease is mostly transmitted by ticks. therefore, the present study aimed to investigate the level of blood protozoan and to identify vector ticks in the native breed sheep (awassi sheep) in kifri city. for this purpose, blood samples were taken from 150 sheep suspected suffering from protozoan infection according to their clinical symptoms. in the present study, we prepared blood slides from suspected sheep and stained with giemsa staining, and then at the same time, hard ticks were collected from the sheep’s body. then, the protozoan type was diagnosed and the vector tick species were identified by microscopically. the obtained results were statistically analyzed by the chi-square test. the results showed that 35 (23.33%) of that samples were infected with babesia protozoa as 25 samples (16.66%) were infected with babesia ovis, seven samples (4.66%) with babesia mutasi, and three samples (2%) with b. ovis and b. mutasi. no infestation with theileria and anaplasma species was found. rhipicephalus, hyalomma, dermacentor, and haemaphysalis ticks were isolated and identified from the studied sheep. the results showed that the presence of the rhipicephalus bursa tick is significantly (p < 0.05) related to the existence of babesiosis disease in sheep. this study concluded that most of the studied sheep in kifri city are infected with babesia protozoa, especially b. ovis. index terms: babesia ovis, babesia mutasi, kifri, rhipicephalus bursa, sheep corresponding author’s e-mail: dr. mahmood ahmad hossein, assistant professor, department of animal production, college of agricultural engineering science, university of garmian, kalar, as-sulaymaniyah, krg, iraq. e-mail: mahmood.ahmad@garmian.edu.krd received: 28-11-2022 accepted: 17-06-2023 published: 08-08-2023 access this article online doi: 10.21928/uhdjst.v7n2y2023.pp1-5 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2023 mahmood ahmad hossein. this is an open access article distributed under the creative commons attribution noncommercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l re se a rc h a rt i c l e uhd journal of science and technology hossein: identification of blood protozoa infestation transmitted by vector ticks 2 uhd journal of science and technology | jan 2023 | vol 7 | issue 2 babesiosis. babesia ovis and babesia mutasi are among the most common causes of babesiosis in sheeps [3], [4]. babesia crassa from iraq, babesia foliata from india, and babesia taylori from pakistan have been reported as non-pathogenic babesia [5]. b. mutasi is found in southern europe, southern africa, the middle east, caucasus, southeast asia, mediterranean coastal areas, and other regions with warm and moderate climates [6], [7]. sheep and goats are considered the main hosts for them. haemaphysalis punctata, rhipicephalus bursa, rhipicephalus sanguineous, and ixodes ricinus ticks are vector parasites [8], [9]. sheep and goats are the main hosts of b. ovis. this parasite is spread throughout the tropical and subtropical regions, as well as in southern europe, the former soviet union, eastern europe, north africa, the equatorial region, and western asia [10], [11]. the vector of babesia ripe is cephalus bursa tick, which is a two-host tick [12]. the hyalomma anatolicum excavatum, i. ricinus, rhipicephalus turanicus, and rhipicephalus sanguineus ticks were also reported as vectors of b. ovis [8]. b. ovis is the most important cause of babesiosis in europe [13]. theileria hirci is the cause of malignant theileriosis in sheep and goats, and the ticks c. bursa and hyalomma anatomical are its vectors. these protozoa are found in lymphocytes and red blood cells of small ruminants. theileria ovis causes a mild disease in small ruminants and is transmitted by species of c. bursa tick. based on the results of the studies, diagnosis of parasites is possible by preparing slides from blood and lymphatic glands [14]. the disease caused by anaplasma ovis is called tropical anaplasmosis of small ruminants. the distribution of this parasite is related to the distribution of its most important carriers, including the rhipicephalus bursa in the mediterranean region and the rhipicephalus ortisi in the tropical regions of africa [6]. other studies suggested that the distribution of b. mutasi was reported to be limited to the northwestern regions of iraq [15]. mosqueda et al. also believe that sheep babesiosis caused by b. ovis is spread all over iraq and is considered an acute disease in iraqi sheep [16]. survey of seroepidemiology of b. ovis in sheep in climatic regions of iraq using indirect brilliant antibody test shows that 36% of sheep had a positive serum titer [17]. considering the economic losses due to protozoan diseases, especially babesiosis in sheep, paid for this. for this reason, the present study was conducted to investigate the contamination of blood protozoa and to identify the vector ticks in awassi sheep in kifri region. 2. materials and methods this study was conducted in the summer of 2020 in the villages of kifri city, kalar, kurdistan region of iraq. sampling carried out on 150 awassi sheep (39 male and 111 female sheeps) that were suspected of protozoan infestation and had the disease symptoms. general clinical examinations were performed on the sheep introduced by the owner. sampling was collected only from the sheep that had symptoms of illness such as depression, anorexia, high fever (40–41°c) or had jaundice, and urine nails and also had respiratory symptoms such as tachypnea and tachycardia. after sampling, one slide was prepared from each sample. the slides were dried in the air and sent to the laboratory. in the laboratory, the slides were stained with giemsa’s stain and then examined. if objects were observed in the desired slide, the parasites were measured in microns with a calibrated optical micrometer. to collect the tick sample, the target sheep was laid on the ground. then, first, the area below and around the tail were visually inspected, and in the second step, in the side, chest, around the chest, back of the legs, and ears, respectively. the ticks were collected by the angle they were attached to the host so that their oral appendages remain intact. then, they were transferred to the sampling container containing 10% formalin and the containers were labeled. during sampling, animal characteristics such as the area, the date of sampling, the animal owner, the number of samples and clinical symptoms, the presence or absence of jaundice, and blood from the animal’s urine were recorded in the sampling handbook. in this study, babesia and ticks species were identified morphologically based on the guidelines of william et al. [18] and zajac and conboy [19]. the data of the present study were analyzed using sas software. 3. results the results of the present study showed that 35 (23.33%) the samples were infected with babesia protozoa and that 25 samples (16.66%) were infected with b. ovis, seven samples (4.66%) with b. mutasi, three samples (2%) with b. ovis, and b. mutasi (fig. 1). in this study, the samples infected with babesia theileria and babesia anaplasma were not found. based on the results of our findings, b. mutasi is pear-shaped, 2.5–4 microns long and two microns wide, and b. ovis is mostly round and has 1–1.5-micron red blood cells on the sides. there is a hole in the center of the parasite, and thus, it takes the shape of a ring. pear-shaped bodies are relatively rare and are seen as pairs with open angles in the margin of red blood cells (figs. 2 and 3). hossein: identification of blood protozoa infestation transmitted by vector ticks uhd journal of science and technology | jan 2023 | vol 7 | issue 2 3 fig. 1. the rate of infection of babesia protozoa among native sheep in kifri city. out of 150 samples infected with babesia protozoa, 39 samples were from male sheep (26%), and 111 samples were from female sheep (74%) (table 1). out of 39 samples of male sheep infected by babesia protozoa, seven samples (4.66%) were infected with b. ovis. out of 111 samples of female sheep infected by babesia protozoa, 24 samples (68.58%) were infected with b. ovis, one sample (2.58%) with b. mutasi, and three samples (8.57%) with b. ovis and b. mutasi (table 1). out of 150 samples of infected sheep in this study, 96 samples of sheep were infected with ticks, and a total of 204 ticks were isolated from them. out of this number, 130 rhipicephalus ticks (63.72%) were found among hard ticks, and the highest percentage of sheep infection with ticks in kifri city is attributed to rhipicephalus ticks. in addition to rhipicephalus tick, other species of ticks were detected on the infected sheep that their infection percentages are as follows: hyalomma tick 51 samples (25%), dermacentor tick 13 samples (6.37%), and haemaphysalis tick 10 samples (4.9%) (fig. 4). out of 130 samples of rhipicephalus ticks, 112 samples of r. bursa, 17 samples of r. sanguineus, and one sample of r. turanicus were identified. thirteen samples of dermacentor tick belonged to the species dermacentor marginatus and ten samples of haemaphysalis tick belonged to the species haemaphysalis punctata. out of 51 hyalomma ticks, 26 samples were hyalomma asiaticum asiaticum, 17 samples were h. anatolicum anatolicum, seven samples were hyalomma marginatum and one sample was hyalomma atatolicum exquatum. the mean of intensity of ticks on each head of the sheep in kifri city was 1.36 ticks, and the mean of intensity of ticks on each head of the sheep infested with babesia protozoa was 2.7 ticks. 4. discussion b. ovis is highly pathogenic, especially in sheep and causes a severe infection that is characterized by fever, anemia, icterus, and hemoglobinuria with mortality rates ranging from 30% to 50% in the susceptible host during field infections [20], [21]. due to its severe effect on the homeotic system, it has caused significant losses among small ruminants, especially sheep in kifri city. therefore, the present study aimed to investigate the infestation of blood protozoa and to identify the vector ticks in awassi sheep in kifri region. the results of the present study showed that the sheep in kifri region are mostly infected with b. ovis species (16.66%), and the highest percentage of infection with external hard ticks is fig. 2. the blood film of sheep stained with giemsa contains the trophozoite of babesia mutasi (×100). fig. 3. the blood film of sheep stained with giemsa contains the trophozoite of babesia ovis (×100). hossein: identification of blood protozoa infestation transmitted by vector ticks 4 uhd journal of science and technology | jan 2023 | vol 7 | issue 2 related to rhipicephalus (63.72%). the results of the present study indicate the predominance of b. ovis species in sheep infected with babesia protozoa in the kifri area. these results are consistent with the results of tousli and rahbari [22], which reported that 41.6% of sheep in the kurdistan region of iran were infected with b. ovis. infestation with b. ovis is severe in some areas. the infection of sheep in greece with b. ovis was reported to be 52% [23]. furthermore, 72% of sheep in the samson region of turkey were infected with b. ovis [24]. as mentioned, the results obtained from this research are consistent with the results reported from iran and turkey, and the dominant species of this protozoan in these regions is b. ovis. one of the main reasons for this issue is the neighborhood of these areas. due to the closeness of these areas, there are a lot of transfers and sales of sheep between ranchers. paying attention to the fact that the information obtained from this research, from a statistical point of view, is mostly qualitative data. hence, if we calculate the probability of disease transmission by all the hard ticks found in the area in comparison with the disease transmission by the statistical population of rhipicephalus species by the chi-square test, there is a significant difference between the transmission of babesiosis disease by the rhipicephalus tick compared to its transmission by all other ixodidae ticks (dermacentor, haemaphysalis, and hyalomma ticks) in the region (p < 0.05). considering that the transmission of babesia disease by ticks has been proven, it can be assumed that the sheep that are infected with babesia and are tick-free; there is a possibility that the tick was separated from the host after feeding. furthermore, in cases where the animal shows the symptoms of the disease, but the protozoa have not been isolated from its blood, such a case cannot be a negative reason for babesiosis disease in this sheep. this probably indicates the presence of a small number of babesia protozoa inside the sheep erythrocytes, which makes their identification difficult at this stage. in this case, it is better to repeat the sampling with a longer time interval. there are different opinions about the severity and pathogenicity of the babesia species. the reason for these reports is probably the long-ter m contamination of livestock in the region and finally the creation of relative immunity against some strains of protozoa. therefore, there are strains with less intensity than any of the species of b. mutasi and b. ovis in different regions. however, in case of double infestation (b. mutasi and b. ovis), the disease will appear in a more severe form iqbal et al. [17]. the investigations carried out at the time of sampling as well as the results obtained in the present study showed that the seasonal abundance of ticks on sheep starts from the end of january and reaches its peak in the middle of march. it seems that due to the warm weather in the kifri region, the activity time of ticks is shorter and the maximum infection with babesia in sheep is in february. in totally, babesiosis in sheep specially caused by b. ovis can be considered as an emerging disease in kifri city. 5. conclusion our finding showed that the common blood protozoan that causes sheep infection is b. ovis in the kifri area. furthermore, the predominant tick among infected sheep in the study area is rhipicephalus tick, and the infection rate of the sheep with the tick was higher than babesiosis species in kifri area. table 1: distribution of absolute and relative frequency of sheep infected with babesia protozoa, separated by species of sheep and babesia species the number of samples (male and female animal) babesia species infected male sheep infected female sheep number % number % 150 babesia ovis 7 4.66 24 68.58 babesia mutasi 1 2.58 babesia ovis and babesia mutasi 3 8.57 fig. 4. frequency of hard ticks identified from infected sheep in the present study. hossein: identification of blood protozoa infestation transmitted by vector ticks uhd journal of science and technology | jan 2023 | vol 7 | issue 2 5 6. acknowledgment the authors would like to deeply thank the all ranchers who allowed us to gather specimens from their husbandry and equally grateful to the authorities of the head of veterinary lab of garmian university who allow us free access to laboratory facilities which led to the performing of the current research. references [1] fao. “quarterly bulletin of statistics”. vol. 1. fao, rome, italy, 2020, p. 234. [2] ministry of planning, “means and prospects and sheep and goat development in iraq”, 2022, p. 124. [3] q. liu, y. q. zhou and d. n. zhou. “semi-nested pcr detection of babesia orientalis in its natural hosts rhipicephalus haemaphysaloides and buffalo”. veterinary parasitology, vol. 143, pp. 260-266, 2007. [4] j. y. kim, s. h. cho, h. n. joo, m. s. r. cho, m. tsuji, i. j. park, g. t. chung, j. w. ju, h. i. cheun, h. w. lee, y. h. lee and t. s. kim. “first case of human babesiosis in korea: detection and characterization of a novel type of babesia sp. (ko1) similar to ovine babesia”. journal of clinical microbiology, vol. 45, pp. 20842087, 2015. [5] s. naz, a. maqbool, s. ahmed, k. ashraf, n. ahmed, k. saeed, m. latif, j. iqbal, z. ali, k. shafi and i. a. nagra. “prevalence of theileriosis in small ruminants lahore-pakistan”. journal of veterinary and animal science, vol. 2, pp. 16-20, 2012. [6] k. altay, m. aktas and n. dumanli. “detection of babesia ovis by pcr in rhipicephalus bursa collected from naturally infested sheep and goats”. research in veterinary science, vol. 85, pp. 116-119, 2007. [7] a. cakmack, a. inci and z. kararer. “seroprevalence of babesia ovis in sheep and goats on cankiri region”. acta parasitologica turcica, vol. 22, pp. 73-76, 2020. [8] e. j. l. soulsby. “helminth, arthropoda and protozoa of domesticated animals”. vol. 14. bailler tindall, london, 1982, pp. 456-471. [9] b. fivaz, t. petney and i. horak. “tick vector biology medicine and veterinary aspects”. vol. 45. springer-verlag, berlin heidelberg, 2020, p. 28. [10] b. a. allsopp, h. a. baylis, m. t. allsopp, t. cavalier-smith, r. p. bishop, d. m. carrington, b. sohanpal and p. spooner. “discrimination between six species of theileria using oligonucleotide probes which detect small subunit ribosomal rna sequences”. parasitology, vol. 107, pp. 157-165, 1993. [11] s. durrani, z. khan, r. m. khattak, m. andleeb, m. ali, h. hameed, a. taqddas, m. faryal, s. kiran, m. riaz, r. s. shiek, m. ali, f. iqbal and m. andleeb. “a comparison of the presence of theileria ovis by pcr amplification of their ssu rrna gene in small ruminants from two provinces of pakistan”. asian pacific journal of tropical disease, vol. 2, pp. 43-47, 2012. [12] a. inci, a. ica, a. yildirim and o. duzlu. “identification of babesia and theileria species in small ruminants in central anatolia (turkey) via reverse line blotting”. turkish journal of veterinary and animal sciences, vol. 34, pp. 205-210, 2010. [13] k. t. freiedhoff. “tick-borne disease of sheep and goats caused by babesia, theileria or anaplasma spp”. parassitologia, vol. 39, pp. 99-109, 1997. [14] d. nagore, j. garcía-sanmartín, a. l. garcía-pírez and r. a. juste and a. hurtado. “identification, genetic diversity and prevalence of theileria and babesia species in a sheep population from northern spain”. international journal for parasitology, vol. 34, pp. 10591067, 2004. [15] a. rafiai. “veterinary and comparative entomology”. current medicinal chemistry, vol. 19, pp. 1504-1518, 2012. [16] j. mosqueda, a. olvera-ramirez, g. aguilar-tipacamu and g. j. canto. “current advances in detection and treatment of babesiosis”. current medicinal chemistry, vol. 19, pp. 1504-1518, 2012. [17] f. iqbal, m. ali, m. fatima, s. shahnawaz, s. zulifqar, r. fatima, r. s. shaikh, a. s. shaikh, m. aktas and m. ali. “a study on prevalence and determination of the risk factors of infection with babesia ovis in small ruminants from southern punjab (pakistan) by pcr amplification”. parasite, vol. 18, pp. 229-234, 2011. [18] l. william, n. nicholson, n. richard and m. brown. in: “medical and veterinary entomology”. 3rd ed. georgia southern university, statesboro, ga, united states, 2019, pp. 51-65. [19] a. m. zajac and g. a. conboy. “veterinary clinical parasitology”. vol. 7. blackwell publishing ltd., uk, 2000, pp. 172-175. [20] s. kage, g. s. mamatha, j. n. lakkundi and b. p. shivashankar. “detection of incidence of babesia spp. in sheep and goats by parasitological diagnostic techniques”. journal of parasitic diseases, vol. 43, pp. 452-457, 2019. [21] z. s. dehkordi, s. zakeri, s. nabian, a. bahonar, f. ghasemi and f. noorollahi. “molecular and biomorphometrical identification of ovine babesiosis in iran”. iranian journal of parasitology, vol. 5, pp. 21-30, 2010. [22] m. tousli and s. rahbari. “investigation of seroepidemiology in sheep in different regions of iran”. veterinary journal, vol. 53, pp. 57-65, 1998. [23] b. papadopoulos, n. m. perie and g. uilenberg. “piroplasms of domestic animals in the macdonia region of greece. 1. serological cross-reactions”. veterinary parasitology, vol. 63, pp. 41-56, 1995. [24] a. clmak, s. dincer and z. karer. “studies on the serological diagnosis of babesia ovis infection in samsun area”. ankara üniversitesi veteriner fakültesi dergisi, vol. 38, pp. 242-251, 2018. . uhd journal of science and technology | may 2018 | vol 2 | issue 2 1 1. introduction in the past three decades, invasive life-threatening fungal infections have severely increased due to several reasons including broad-spectrum antibiotics, antagonistic surgery, and the use of immunosuppressive and antineoplastic agents [1]-[5]. until the 1940s, comparatively few antifungal agents were available for the treatment of fungal infections. in addition, development in the growth of new antifungals agents was lagged behind the antibacterial investigation, from the year 2000 number of agents existing to treat fungal infections has increased by 30%. nevertheless, still, only 15 agents are approved for clinical use at present [6], [7]. the most common human fungal infection is oral candidiasis (also called oral thrush), which is characterized by an overgrowth of candida species in the superficial epithelium of the oral mucosa [8], [9]. treatment for oral thrush varies, polyenes, allylamines, and azoles are three classes of antifungal agents that used most frequently for treatments of oral thrush [10]. nystatin and amphotericin-b both belong to the polyene’ class of antifungals drug. these class of drugs act by binding to ergosterol in the cell membranes of the fungal; then, this causes in the membrane depolarization and pores formation which increases permeability to proteins and (mono and divalent) cations, disrupting metabolism, and eventually causing cell death [11]. both antifungal agents are poorly absorbed by the gastrointestinal tract and are widely used for the topical treatment of oral candidal infections [12]. intravenous forms of amphotericin-b are used in the treatment of systemic fungal infections. similarly, nystatin has low oral bioavailability profile; therefore, it is generally used in inhibiting colonization with candida albicans in the gut or as a topical treatment for thrush [13], [14]. sweetened current antifungal drug recommendations to treat oral thrush in sulaimani city-iraq hezha o. rasul department of chemistry, college of science, university of sulaimani, iraq a b s t r a c t oral thrush or oral candidosis is one of the most widespread fungal infections of the mucous membranes in human. this study aims to evaluate the pattern of recommending three antifungal drugs as follows: nystatin, amphotericin b, fluconazole, and miconazole by the pharmacists and assistant pharmacists, which are used to treat oral thrush. a questionnaire was circulated to a random selection of pharmacies in sulaimani city of iraq between march 2017 and june 2017, and responses to the questionnaire were received from 101 pharmacies. the results were analyzed and demonstrated as the absolute and relative frequencies using statistical package for the social sciences program version 21. among the participants, 65.3% were male, and 34.7% were female. the participant’s age range was 21–70 years. the majority (52.3%) holds a postgraduate degree as their highest educational level, and they graduated after 2010. miconazole and nystatin (70.3%) were the most popular choices of an antifungal agent that pharmacists would use, followed by fluconazole (31.7%) and amphotericin-b (11.9%). index terms: amphotericin b, antifungal agents, fluconazole, nystatin corresponding author’s e-mail: hezha o. rasul, department of chemistry, college of science, university of sulaimani, iraq. e-mail: hezha.rasul@univsul.edu.iq received: 13-11-2017 accepted: 10-05-2018 published: 25-07-2018 access this article online doi: 10.21928/uhdjst.v2n2y2018.pp1-6 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2018 rasul. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l r e s e a r c h a r t i c l e uhd journal of science and technology hezha o. rasul: current antifungal drug recommendations to treat oral thrush in sulaimani city-iraq 2 uhd journal of science and technology | may 2018 | vol 2 | issue 2 pastille has been developed to overcome the problem of the unpleasant taste of nystatin [15]. the azole antifungals (miconazole and fluconazole) work through inhibiting cytochrome p-450 enzyme in the fungal [16]. miconazole was the first available azole; fluconazole is a more recently found systemic antifungal agent, which has a long half-life and as a result can be administered in a single daily dose [17]. chlorhexidine is other antimicrobial agents that available for topical administration in oral candidiasis as mouthwash. it is effective against fungal yeasts, which can be used as an adjunctive therapy or as a primary treatment [18]. the aim of the present study was to examine the current practice of antifungal recommending pattern and attitude toward the treatment of oral candidiasis among pharmacists in sulaimani city-iraq during 2017. hence, this project will commence with the treatment of oral thrush by using different types and form of antifungal agents. 2. materials and methods a hard copy questionnaire circulated to a random selection of 120 pharmacies. a complete data from 101 participants were returned and integrated into the analysis with 84.1% response rate. data collection was carried out between march 2017 and june 2017, both males and females pharmacies were involved in the different street of the sulaimani city. the pharmacies were visited and asked questions based on their interest to take part in the study; each of these pharmacists was given an explanatory letter of a questionnaire (fig. 1). the questionnaire that was used for data collection in this study was specially created through a search of the relevant literature. the questionnaire was tested initially to estimate approximately the length of the questionnaire in minutes, verify the participant’s interpretation of questions, and develop the questionnaire consequently. these questionnaires were tested in independent data sets; however, these candidate questionnaires were excluded from the concluding analysis. however, the final version of the survey was conducted in sulaimani city. the final version of the questionnaire included eight questions and required approximately 2 min to complete. approved by the ethics committee of university of sulaimani (sulaimani, iraq) was obtained. the selfadministered questionnaire was composed of two sections. the first section of the questionnaire was comprised of seven questions about sociodemographic data, such as gender, age, university degree and year of the last qualification, workplace (private sector vs. public sector), professional practice, and country of the first-degree qualification. various antifungal drug options were integrated into the second section of the questionnaire about pharmacists’ recommendation to treat oral candidal infections. data from the completed questionnaires were entered into a computer database and analyzed using statistical package for the social sciences program version 21. following the statistical evaluation of data and the summarization of frequencies and percentages were produced. 3. results with the use of the hard copies of the questionnaires, different pharmacies have been participated in sulaimani city, and 101 questionnaires were returned completed (84.1% response rate), 65.3% were male, and 34.7% were female pharmacist as shown in table 1. the majority of participants (70.3%) graduated after 2010, while 19.8% graduated between 2000 and 2009. moreover, the participants, who graduated between 1990 and 1999 recorded 6.9%, with a lower proportion (2%) graduating between 1980 and 1989. only 1% graduated between 1970 and 1979. there were no respondents from earlier than 1970. the range of the participant’s age was 21–70 years; more than 70% were aged between 21 and 30 years. the majority (47.5%) holds table 1 sociodemographic data of the participated pharmacists sociodemographic data frequency (%) gender male 66 (65.3) female 35 (34.7) age 21–30 73 (72) 31–40 21 (21) 41–50 3 (3) 51–60 3 (3) 61–70 1 (1) first-degree graduation year after 2010 71 (70.3) 2000–2009 20 (19.8) 1990–1999 7 (6.9) 1980–1989 2 (2) 1970–1979 1 (1) educational level diploma 48 (47.5) undergraduate 30 (29.5) postgraduate (msc, phd) 23 (22.8) workplace private sector 60 (59.4) public sector 3 (3) both (private and public) 38 (37.6) professional practice pharmacist 54 (53.5) assistant pharmacist 47 (46.5) hezha o. rasul: current antifungal drug recommendations to treat oral thrush in sulaimani city-iraq uhd journal of science and technology | may 2018 | vol 2 | issue 2 3 a diploma degree as their highest educational level; while an undergraduate and postgraduate level of education observed as 29.5% and 22.8%, respectively. the participants were questioned about their workplace. approximately 60% of the respondents have worked in the private sector whereas public sector recorded only 3%. moreover, 37.6% of the participants were worked in both private and public sectors at the same time. more than half of the participants were pharmacists whereas 46.5% were an assistant pharmacist. the most popular antifungal recommended (table 2), in any form, was nystatin and miconazole each recorded 70.3%, followed by fluconazole and chlorhexidine as 31.7%. moreover, the recommendation for amphotericin was recorded 11.9%. the combination of using miconazole and hydrocortisone cream by the respondents were only 7.9%. however, many participants chose more than one type and/or form of an antifungal drug. in addition, the nature of the questionnaire determined the distinction between participants using simultaneous administration of chlorhexidine and participants using different antifungals for different manifestations of oral candidal infection. the participants who recommended chlorhexidine only 19.8% of them were using it as adjunctive therapy. with regard to the results of the questionnaire as mentioned earlier one of the most popular antifungals recommended was nystatin. in addition to that, the oral suspension was the most fig. 1. the questionnaire. hezha o. rasul: current antifungal drug recommendations to treat oral thrush in sulaimani city-iraq 4 uhd journal of science and technology | may 2018 | vol 2 | issue 2 popular form with 73% of those recommending nystatin considering this formulation. about 24% of those suggesting nystatin would consider recommending it in the form of an ointment. only 3% was observed for pastille form of nystatin suggestion. however, capsules were the most common form of fluconazole considered for recommendation (91%). a lozenge form of amphotericin drug was recommended by the participants more than oral suspension form (as shown in fig. 2). only 6% of respondents cited other treatment options, which included clotrimazole, terbinafine, econazole triamcinolone, and anginovag spray. 4. discussion the present study investigated the currently antifungal drugs recommendation at pharmacies in sulaimani city, iraq, in relation to the sociodemographic details as illustrated in a study by martínez-beneyto et al. [19]. the previous studies similar to this kind in the united kingdom and jordan were conducted; however, they were conducted among the general dental practitioners instead of pharmacists. the first study was undertaken in the uk in 1987 and reported in 1989 [20]. the second study that conducted in the uk reported in 2004 [21].furthermore, another study was undertaken in jordan in 2015 [22].in accordance with those studies like the present study, nystatin was the most popular antifungal agent recommended (70.3%). in addition, nystatin oral suspension was selected by 73% of the respondents who suggested nystatin. however, in this study, miconazole was recorded as one of the most frequently recommended antifungal agents also (70.3%). there has also been a visible increase in the proportion of participants recommending miconazole in the present survey compared to the previous studies, and it has now become more popular than amphotericin. in addition, miconazole and nystatin were also the commonly employed antifungals in studies that have been done by other researchers [19], [21], [22]. this is because these drugs may cause less intestinal irritation and other side effects. however, one of the limitations of using topical formulations of nystatin is high sucrose content, which may reduce the amount of practice in diabetes, steroid use, or an immunocompromised state [9]. the triazoles constitute fluconazole being suggested by 31.7% of the participants. fluconazole in the form of suspension and with different dosages has been used for the treatment of oropharyngeal candidiasis. the theoretical benefit of using topical fluconazole is that a higher concentration of the active drug is delivered to the oral mucosa without the untoward systemic side effects [23], [24]. however, most of the participants recommended capsule form of fluconazole 91% whereas only 9% of the respondents suggested oral suspension form of the drug. fluconazole oral suspension is fig. 2. different form of antifungal recommended by participants. table 2 choice of antifungal agents. numbers (%) of pharmacists choosing each antifungal (n=101) antifungal agentsa responses % of casesn (%) nystatin 71 (31.4) 70.3 amphotericin 12 (5.3) 11.9 fluconazole 32 (14.2) 31.7 chlorhexidine 32 (14.2) 31.7 miconazole oral gel 71 (31.4) 70.3 miconazole and hydrocortisone cream 8 (3.5) 7.9 total 226 (100) 223.8 adichotomy group tabulated at value 1 hezha o. rasul: current antifungal drug recommendations to treat oral thrush in sulaimani city-iraq uhd journal of science and technology | may 2018 | vol 2 | issue 2 5 administered in a dosage of 10 mg/ml aqueous suspension. various studies show that fluconazole is a very effective drug, and it has a rapid symptomatic response [25]. chlorhexidine mouth rinse formulations are widely used for decreasing the microbial burden in the oral cavity. for example, chlorhexidine gluconate with 0.2% concentration is used as an antiseptic oral rinse because of its activity against a broad range of oral microbial species including candida[26]. chlorhexidine should not be used simultaneously with nystatin as they interact and render each other ineffective, even though it is suggested as a practical addition to the antifungal agents [27]. in this study also, chlorhexidine was recommended by pharmacists and assistant pharmacist (31.7%) along with other antifungal agents as an adjunctive therapeutic agent. in this study, the result of amphotericin was less frequently recommended (11.9%), and 58% of the participants suggested lozenges form of the drug. this recommendation was very similar to the previous study which demonstrated by anand et al. [28]. miconazole in combination with hydrocortisone was recommended by 7.9% of the respondents. however, in general, the diagnosis of oral candidiasis is based on clinical features and symptoms in conjunction with a detailed medical history [29]. despite the above-mentioned results, this study has several limitations. the small sample size was the main limitation of this questionnaire. therefore, the future studies with larger sample size covering a wider data may provide better. furthermore, the possible improvement in the methodology could be the insertion of doctors’ recommendation and compare both results. differentiation between respondents recommending antifungals based on their knowledge or recommending it based on doctor’s prescription. 5. conclusion and recommendation in summary, nystatin and miconazole are the most popular antifungal agents prescribed in sulaimani city, iraq. there appears to be a trend toward the use of miconazole, particularly among more recent graduates. the majority of the participant suggested nystatin as a type of oral suspension and miconazole as an oral gel. we suggest that collecting more data in different cities concerning the use of antifungal drugs could turn into a strong motivation in the near future for the implementation of policies for prevention and treatment of oral thrush fungal infections. 6. acknowledgment the author would like to acknowledge the support obtained from all pharmacists and assistant pharmacists participated in this study. this work was supported by chemistry department in college of science at university of sulaimani. references [1] d. enoch. “invasive fungal infections: a review of epidemiology and management options”. journal of medical microbiology, vol. 55, no. 7, pp. 809-818, 2006. [2] p. eggimann, j. garbino and d. pittet. “epidemiology of candida species infections in critically ill non-immunosuppressed patients”. the lancet infectious diseases, vol. 3, no. 11, pp. 685-702, 2003. [3] m. tumbareloo, e. tacconelli, l. pagano, e. ortuabarbera, g. morace, r. cauda, g. leone and l. ortona. “comparative analysis of prognostic indicators of aspergillosis in haematological malignancies and hiv infection”. journal of infection, vol. 34, no. 1, pp. 55-60, 1997. [4] m. hudson. “antifungal resistance and over-the-counter availability in the uk: a current perspective”. journal of antimicrobial chemotherapy, vol. 48, no. 3, pp. 345-350, 2001. [5] s. sundriyal, r. sharma and r. jain. “current advances in antifungal targets and drug development”. current medicinal chemistry, vol. 13, no. 11, pp. 1321-1335, 2006. [6] j. maertens. “history of the development of azole derivatives”. clinical microbiology and infection, vol. 10, pp. 1-10, 2004. [7] g. thompson, j. cadena and t. patterson. “overview of antifungal agents”. clinics in chest medicine, vol. 30, no. 2, pp. 203-215, 2009. [8] a. melkoumov, m. goupil, f. louhichi, m. raymond, l. de repentigny and g. leclair. “nystatin nanosizing enhances in vitro and in vivo antifungal activity against candida albicans”. journal of antimicrobial chemotherapy, vol. 68, no. 9, pp. 2099-2105, 2013. [9] a. akpan. “oral candidiasis”. postgraduate medical journal, vol. 78, no. 922, pp. 455-459, 2002. [10] a. darwazeh and t. darwazeh. “what makes oral candidiasis recurrent infection? a clinical view”. journal of mycology, vol. 2014, pp. 1-5, 2014. [11] j. meis and p. verweij. “current management of fungal infections”. drugs, vol. 61, no. 1, pp. 13-25, 2001. [12] j. bagg. essentials of microbiology for dental students. oxford: oxford university press, 2006. [13] j. bolard. “how do the polyene macrolide antibiotics affect the cellular membrane properties?” biochimica et biophysica acta (bba) reviews on biomembranes, vol. 864, no. 3-4, pp. 257-304, 1986. [14] m. schäfer-korting, j. blechschmidt and h. korting. “clinical use of oral nystatin in the prevention of systemic candidosis in patients at particular risk”. mycoses, vol. 39, no. 9-10, pp. 329-339, 1996. [15] e. budtz-jörgensen and t. lombardi. “antifungal therapy in the oral cavity”. periodontology 2000, vol. 10, no. 1, pp. 89-106, 1996. [16] m. kathiravan, a. salake, a. chothe, p. dudhe, r. watode, m. mukta and s. gadhwe. “the biology and chemistry of antifungal agents”. bioorganic and medicinal chemistry, vol. 20, pp. 56785698, 2012. hezha o. rasul: current antifungal drug recommendations to treat oral thrush in sulaimani city-iraq 6 uhd journal of science and technology | may 2018 | vol 2 | issue 2 [17] m. martin. “the use of fluconazole and itraconazole in the treatment of candida albicans infections: a review”. journal of antimicrobial chemotherapy, vol. 44, no. 4, pp. 429-437, 1999. [18] t. meiller, j. kelley, m. jabra-rizk, l. depaola, a. baqui and w. falkler. “in vitro studies of the efficacy of antimicrobials against fungi”. oral surgery, oral medicine, oral pathology, oral radiology, and endodontology, vol. 91, no. 6, pp. 663-670, 2001. [19] y. martãnez-beneyto, p. lã³pez-jornet, a. velandrino-nicolã¡s and v. jornet-garcãa. “use of antifungal agents for oral candidiasis: results of a national survey”. international journal of dental hygiene, vol. 8, no. 1, pp. 47-52, 2010. [20] m. lewis, c. meechan, t. macfarlane, p. lamey and e. kay. “presentation and antimicrobial treatment of acute orofacial infections in general dental practice”. british dental journal, vol. 166, no. 2, pp. 41-45, 1989. [21] r. oliver, h. dhaliwal, e. theaker and m. pemberton. “patterns of antifungal prescribing in general dental practice”. british dental journal, vol. 196, no. 11, pp. 701-703, 2004. [22] m. al-shayyab, o. abu-hammad, m. al-omiri and n. dar-odeh. “antifungal prescribing pattern and attitude towards the treatment of oral candidiasis among dentists in jordan”. international dental journal, vol. 65, no. 4, pp. 216-226, 2015. [23] j. epstein, m. gorsky and j. caldwell. “fluconazole mouthrinses for oral candidiasis in postirradiation, transplant, and other patients”. oral surgery, oral medicine, oral pathology, oral radiology, and endodontology, vol. 93, no. 6, pp. 671-675, 2002. [24] m. martins. “fluconazole suspension for oropharyngeal candidiasis unresponsive to tablets”. annals of internal medicine, vol. 126, no. 4, p. 332, 1997. [25] c. garcia-cuesta, m. sarrion-perez and j. bagan. “current treatment of oral candidiasis: a literature review”. journal of clinical and experimental dentistry, pp. vol. 6, no. 5, e576-e582, 2014. [26] a. salem, d. adams, h. newman and l. rawle. “antimicrobial properties of 2 aliphatic amines and chlorhexidine in vitro and in saliva”. journal of clinical periodontology, vol. 14, no. 1, pp. 44-47, 1987. [27] p. barkvoll and a. attramadal. “effect of nystatin and chlorhexidine digluconate on candida albicans”. oral surgery, oral medicine, oral pathology, vol. 67, no. 3, pp. 279-281, 1989. [28] a. anand, m. ambooken, j. mathew, k. harish kumar, k. vidya and l. koshy. “antifungal-prescribing pattern and attitude toward the treatment of oral candidiasis among dentists in and around kothamangalam in kerala: a survey”. indian journal of multidisciplinary dentistry, vol. 6, no. 2, pp. 77, 2016. [29] h. terai, t. ueno, y. suwa, m. omori, k. yamamoto and s. kasuya. “candida is a protractive factor of chronic oral ulcers among usual outpatients”. japanese dental science review, vol. 54, no. 2, pp. 52-58, 2018. tx_1~abs:at/tx_2:abs~at uhd journal of science and technology | jan 2023 | vol 7 | issue 1 1 1. introduction vitamin d is one of the fat-soluble compounds that are divided into two for ms erg ocalciferol (d 2 ) and cholecalciferol (d 3 ) in relation to human health. vitamin d 2 is derived from the diet, such as cod liver oil and fatty fish while d 3 is synthesized in the skin from its precursor as exposed to ultraviolet irradiation [1]. vitamin d in the human body is converted to 25-hydroxy vitamin d (25(oh)d) which is a storage and circulating form of vitamin d, and then to an active form (1,25-dihydroxy vitamin d) by liver and kidney enzymes [2]. the classical function of vitamin d is enhancing calcium absorption from the gut to maintain optimum calcium and phosphorus concentration in the blood, which is required to maintain many physiological functions such as muscle contraction, blood clotting, and enzyme activation [3]. other biological activities of vitamin d have been proposed by different studies, including enhancing insulin production, responding to many immune and inflammatory triggers, and cell growth and differentiation [4]. over the last decades, huge numbers of articles have been published worldwide, confirming several vitamin d health benefits [5]. the action of vitamin d during pregnancy is still under study; however, vitamin d is an essential element for the development of healthy fetal bone during pregnancy [5]. vitamin d deficiency in pregnant women increases the risk of gestational diabetes mellitus and preeclampsia for the mother and increases the chances of being small for gestational age, neonatal rickets, and tetany prevalence of vitamin d deficiency among pregnant women in sulaimaneyah city-iraq hasan qader sofihussein department of pharmacy, sulaimani polytechnic university, sulaimani technical institutes, iraq a b s t r a c t hypovitaminosis d during pregnancy has a negative impact on the mother and infant’s health status. the main source of vitamin d is sunshine and ultraviolet b for most humans and food sources are often inadequate. the present work has been carried out to demonstrate the prevalence of vitamin d deficiency among pregnant women in the sulaimaneyah city/ kurdistan region of iraq. serum samples were collected from 261 pregnant women who attended the teaching maternity hospital and met inclusion criteria and were examined for 25-hydroxyvitamin d using the roche elecsys vitamin d 3 assay. different information included, including sociodemography, body mass index, and obstetric history, was collected using a specific questionnaire form. the study showed a high prevalence of hypovitaminosis d (71.3%) among pregnant women. high socioeconomic classes, blood group a-, and advanced gestational age have been significantly associated with higher vitamin d levels. vitamin d deficiency is prevalent in pregnant women in sulaimani city. because of the many risk factors of vitamin d deficiency and a series of health consequences, the government needs to take a step to address the problem, including raising awareness among the community about the burden of the situation and how to increase obtaining optimum vitamin d from different sources. index terms: vitamin d, pregnant women, hypovitaminosis d, sulaimaniyah corresponding author’s e-mail: hasan.sofi@spu.edu.iq received: 28-09-2022 accepted: 13-11-2022 published: 02-01-2023 access this article online doi: 10.21928/uhdjst.v7n1y2023.1-6 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2022 sofihussein. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l re se a rc h a rt i c l e uhd journal of science and technology sofihussein: vitamins d deficiency 2 uhd journal of science and technology | jan 2023 | vol 7 | issue 1 for offspring [6], [7]. several studies reported vitamin d deficiency in countries with plenty of sunshine for the majority of the time of the year such as india and saudi arabia [8], [9]. for the majority of people, getting exposure to sunshine between 09.00 am and 03.00 pm (depending on solar time) can be considered the main source of vitamin d [10]. although in high altitudes because of elevation in solar angle and ambient uvb levels are mostly low, getting an optimal vitamin d from the sunshine is unworkable, especially in the cooler season [11]. a high prevalence of vitamin d deficiency has been reported among pregnant chinese women [11]. vitamin d deficiency occurs as a result of long-term inadequate intake of vitamin d from food sources, impaired vitamin d absorption from the intestine, liver, or kidney diseases, which affect the metabolism of vitamin d to its active form and inadequate sun exposure. the vast majority of these cases can be corrected by determining underpinning factors associated with vitamin d deficiency during pregnancy [12]. studies concluded that taking vitamin d supplementation during pregnancy must be considered to protect pregnant women and offspring from complications due to vitamin d deficiency, [13]. in some countries, vitamin d supplementation is offered for free for pregnant women, unfortunately, it is not available for pregnant women in iraq. the present study was carried out to explore the prevalence of vitamin d deficiency among a group of pregnant women who were assumed to be a representative group of pregnant women in sulaimani city. moreover, the study also will try to investigate the association between vitamin d level age, body mass index, and blood groups. 2. methods 2.1. study design and population the design of the present work is a cross-sectional study carried out from december 2018 to february 2019. prespecified inclusion criteria include pregnant women with a gestational age of more than 24 weeks and not on vitamin d supplements even before pregnancy. furthermore, women with a pre-pregnancy bmi of more than 35 and pregnant age more than 40-years-old were excluded from the study. the study samples were drowned by a systematic random sampling method from all patients who met inclusion criteria and visited the antenatal care unit in the maternity teaching hospital in sulaimani city. totally, 261 pregnant women were successfully recruited to participate in the current crosssectional study. 2.2. data collection trained persons collected data using face-to-face interviews. the questionnaire was divided into three main parts 1, sociodemographic data such as age, address, occupation, and income. 2, obstetric history, such as gravidity, and parity 3, dietary history, such as the quantity of routine milk and fish consumption recorded. outdoor activity and exercise were considered. sun exposure was defined as exposure to sunshine directly with uncovering body parts and not behind windows. to control some confounding factors, which have an effect on the vitamin d level, this study excludes pregnant women with high bmi (more than 35 kg/m2), liver and kidney disease, and fat malabsorption disorders. the blood sample was taken from the eligible pregnant women and centrifuged at 5000 rpm for 5 min then the serum was separated and stored at −80°c in deep freeze until they were used for analyzing serum 25 dehydroxyl vitamin d measurement. serum vitamin d level was carried out using roche cobas e411 immunoassay analyzer using the roche elecsys vitamin d 3 assay (roche diagnosis, mannheim, germany). a serum level of <20 ng/ml was considered vitamin d deficiency, between 20 ng/ml and 30 ng/ml was considered insufficiency and more than 30 ng/ml was regarded to be the optimal level. content validity was determined through a pane; experts were 12 experts; and reliability was measured using the correlation coefficient of (1 = 0.884 = 0.88.4) (statistically adequate). a pilot study was conducted with 20 pregnant women who attended maternity teaching hospital. 3. results totally 261 pregnant women were recruited for the present study. more than 93% of the participants were at an age between 20 and 40-years-old, 3.4% were <20-year-old and the rest were above 40 years (1.3%). more than 44% of the pregnant women had a body mass index of more than 30 kg/m2, and 32.6% of participants had normal weight only 16.1% were categorized as obese, and 5% had morbid obesity according to the body mass index category and 1.3% of participants were underweight. the majority of the participants had an o+ blood group (39.1%). in addition, 25.3% were a+ and the rest had other blood groups. the majority of the pregnant women (77.8%) identified themselves as a housewife. nearly half of the participants (46.3%) graduated from secondary school and only 29.1% of the participants had postgraduate degrees. two hundred sofihussein: vitamins d deficiency uhd journal of science and technology | jan 2023 | vol 7 | issue 1 3 and eighteen (83.5) of the 261 participants were from the urban area of sulaimani city (table 1). table 1 showed the demographical data expressed as number (%), median; chi-square was used for categorical variables and t-test for continuous variables. differences were considered statistically significant at p < 0.05. bmi: body mass index. more than 70% of the cases got married at the ages of 20–29 years. the majority of the participants were in the second (55.9%) and third trimester (43.0%) of the pregnancy and only 1.1% had a gestational age of fewer than 20 weeks. a 170 (65.1%) of the 261 participants practised hijab and 34.9% had partly covering clothes. about 67.5% of the participants had more than one pregnancy and 32.5% were primigravida. the majority of the pregnant women were primipara (77.3%) and 22.7% of the participants had a history of more than one childbirth (table 2). table 2 distribution of the study sample according to reproductive history. table 2 showed the reproductive data expressed as number (%), and median; chi-square was used for categorical variables and a t-test for continuous variables. differences were considered statistically significant at p < 0.05. the result of the study showed a high prevalence of vitamin d deficiency among pregnant women (71.3%). it was concluded that 18.0% were insufficient (mean = 24.46 ng/ml, s. d = 2.80) and 10.7% of the participants had sufficient serum levels of 25-dihydroxy vitamin d (mean = 48.29 ng/ml, s. d = 20.12) (table 3). table 3 showed the serum 25(oh) levels data expressed as frequency, percent (%) and mean. vitamin d <20 ng/ml was considered deficient, between 20 ng/ml and 30 ng/ml considered insufficient and optimum levels above 30 differences were considered statistically significant at p < 0.05. according to (table 4), the mean vitamin d level was almost at the same level among different age groups (<20 years = 16.4, 20–29 years = 16.9, 30.39 years = 16.09), with the exception of ages more than 40 years, which was 26.4 ± 23.2. likewise, positive blood groups had similar mean for serum vitamin d levels (a+ = 19.36, b+ = 15.70, ab+ = 18.70, o+ = 14.60). higher vitamin d levels can be seen among participants with blood group a– (mean = 33.53 ng/ml, s. d = 38.7). b– and o– blood groups had a mean of 8.52 ± 1.60 and 11.58 ± 8.06, respectively. a significant association was found between the blood group and vitamin d status (p = 0.009). furthermore, the result showed that higher socioeconomic status had higher vitamin d levels with a significant association (p = 0.007). there were no significant differences in vitamin d status among participants with different bmi. there were no significant differences in vitamin d levels between pregnant women with different employment states, educational levels, and residency. table 4 demonstrates association between serum vitamin d level and sociodemographic variables. differences were considered statistically significant at p < 0.05. as shown in table 5, there was not any significant association found between serum vitamin d levels among pregnant women of different ages at marriage. a significant association was found between gestational age and vitamin d status (0.000), higher gestational age had higher vitamin d levels. pregnant women with partly covered clothes had significantly higher vitamin d concentrations (mean = 19.04 ± 18.16, p = 0.049). vitamin d levels between participants with different gravida and para did not show any significant correlation. the type of delivery has no impact on the vitamin d level. table 1: distribution of the study sample according to sociodemographic characteristics variables frequency percent mean±sd age <20 years 9 3.4 28.8±4.96 20–29 years 127 48.6 30–39 years 122 46.7 40 years and more 3 1.3 blood group a+ 66 25.3 b+ 50 19.2 ab+ 21 8.0 o+ 102 39.1 a7 2.7 b4 1.5 ab0 0.0 o11 4.2 bmi underweight 4 1.5 26.64±4.62 normal 85 32.6 overweight 117 44.8 obese 42 16.1 morbid obese 13 5.0 occupation employee 58 22.2 non employed 203 77.8 educational status illiterate 6 2.3 read and write 12 4.6 primary school graduate 44 16.9 secondary school graduate 121 46.3 postgraduate 76 29.1 others 2 0.8 residency urban 218 83.5 sub urban 37 14.2 rural 6 2.3 sofihussein: vitamins d deficiency 4 uhd journal of science and technology | jan 2023 | vol 7 | issue 1 table 3: vitamin d distribution vitamin d class frequency percent mean s. d 95% confidence interval for mean minimum maximum lower bound upper bound deficient 186 71.3 9.91 4.91 9.20 10.62 0.0 19.80 insufficient 47 18.0 24.46 2.80 23.64 25.28 20.50 29.8 sufficient 28 10.7 48.29 20.12 40.49 56.09 30.90 98.0 total 261 100.0 table 4: the association of vitamin d status with sociodemographic data variables mean±s.d std. error f-test p‑value sig. age <20 years 16.4±9.48 3.16 0.731 0.534 no significance 20–29 years 16.9±16.8 1.49 30–39 years 16.09±11.8 1.07 40 years and more 28.4±23.2 13.37 blood group a+ 19.36±17.7 2.19 2.910 0.009 significance b+ 15.70±12.6 1.79 ab+ 18.70±12.5 2.72 o+ 14.60±10.0 0.99 a– 33.53±38.7 14.66 b– 8.52±1.60 0.80 ab– 0 0 o– 11.98±8.06 2.43 socioeconomic status low class 13.29±10.3 1.52 5.09 0.007 significance middle class 16.63±14.6 1.04 high class 26.57±19.1 4.78 bmi underweight 13.42±10.8 5.44 0.468 0.759 no significance normal 16.94±17.6 1.91 overweight 17.31±14.07 1.29 obese 15.95±10.69 1.67 morbid obese 12.07±6.56 1.81 employment employee 14.08±9.58 1.29 -1.529 0.127 no significance non employed 17.38±15.5 1.09 educational status illiterate 17.18±18.7 7.66 0.578 0.717 no significance read and write 19.80±9.6 2.77 primary school graduate 15.53±17.1 2.58 secondary school graduate 16.05±14.0 1.27 high education 18.01±14.3 1.64 others 5.29±0.55 0.39 residency urban 17.24±14.9 1.01 1.862 0.157 no significance sub urban 12.56±10.8 1.78 rural 20.4±17.7 7.22 table 2: distribution of the study sample according to reproductive history. variables frequency percent mean±sd age at marriegeless than 20 years 62 23.7 22.18 ± 4.14 2029 years 183 70.1 30 years and over 16 6.2 gestational ageless than 20 week 3 1.1 29.6 ± 4.39 2029 week 146 55.9 3039 week 112 43.0 dressingpartly covered 91 34.9 fully covered 170 65.1 gravidaequal to one 85 32.5 more than one 176 67.5 paraone and less 202 77.3 sofihussein: vitamins d deficiency uhd journal of science and technology | jan 2023 | vol 7 | issue 1 5 in this group of participants, vitamin d levels significantly increased as the pregnancy progressed (p = 0.000). likewise, pregnant women with partly covered clothes had a significantly higher amount of vitamin d (p = 0.049) (table 5). 4. discussion nowadays, vitamin d attracts the attention of many researchers as many studies have elucidated the role of vitamin d in various mechanisms in the body. serum 25(oh)d level can precisely measure vitamin d status because it is reflective of both exogenous and endogenous vitamin d production. the work can be regarded as the first study conducted in sulaimani city in iraq, focusing on the prevalence of vitamin d deficiency among pregnant women. the percentage rates of vitamin d deficiency among pregnant women who were included in the present study were relatively very high (71.3%). optimal vitamin d level (25(oh)d 30 ng/ ml) was observed in 10.7 percent of pregnant women. related observations were reported in several studies carried out among south asian pregnant women [8], [9], [11]. exposing skin to ultraviolet b can be considered the main source of vitamin d; therefore, the optimal level of vitamin d among people who live in countries at or near the equator is expected which is not supported by the result of studies. despite plenty of sunshine in the region, vitamin d deficiency has got highly prevalent in this area. there are several factors with significant impacts on vitamin d synthesis including geographical region, seasons, daytime, weather, air pollution, and skin pigmentation also skin covered with sunscreen [14]. a number of these factors may apply to this region. because of the impact of cultural and religious beliefs, most of the body parts are covered with clothes, which may partially play a role in limitations of the skin exposure to sunlight that negatively can affect the optimum level of vitamin d synthesis. although there are limited numbers of studies in the region, the observations were reported in saudi arabia, which recorded a relatively high prevalence of vitamin d deficiency among the whole population and women including pregnant and non-pregnant ones [15], [16]. due to the closeness of the culture and region or beliefs, these results can support our conclusion and interpretations about vitamin d deficiency in sulaimani city. this signifies that a tropical climate does not automatically provide optimum vitamin d for the residents. in this study, serum vitamin d levels were significantly higher among pregnant women and those who do not practice hijab (covering all body parts except the face and hand). in one study, participants were divided into three groups: 1. receiving only dietary advice for vitamin d from the healthcare professional, 2. taking vitamin d supplementation along with dietary advice, and 3. receiving a combination of dietary advice, supplementation and exercise in the sports centre. the result showed that serum vitamin d in the first group had a negligible change with a 70% rise in the second group and in the third group vitamin level increased by 300% compared to baseline [16]. unfortunately, outdoor exercise or activity is not common among women in the region, which may be another critical reason behind widespread vitamin d deficiency. a strong adverse relationship was obser ved between vitamin d deficiency and obesity [17], [18]. obesity (bmi ≥30) may increase the risk of vitamin d deficiency table 5: the association of vitamin d status with reproductive history variable mean±s.d std-error f-test p‑value sig. age at marriage <20 years 18.90±19.6 2.49 0.991 0.373 no significance 20–29 years 15.89±12.7 0.93 30 years and over 16.64±10.8 2.71 gestational age <20 week 8.05±1.85 1.07 2.063 0.000 significance 20–29 week 17.37±15.8 1.30 30–39 week 15.93±12.8 1.21 clothing partly covered 19.04±18.16 1.90 1.99 0.049 significance fully covered 15.36±12.06 0.92 gravida equal to one 16.05±14.9 1.61 -0.460 0.646 no significance more than one 16.94±14.4 1.08 para equal to one 17.40±15.4 1.08 1.550 0.122 no significance more than one 14.07±10.6 1.39 type of delivery normal vaginal delivery 15.23±10.4 1.17 0.391 0.677 no significance assisted delivery 15.24±14.8 3.98 caesarean section 16.96±13.6 1.63 sofihussein: vitamins d deficiency 6 uhd journal of science and technology | jan 2023 | vol 7 | issue 1 because increased subcutaneous fat sequesters more vitamin d and changes its release into the bloodstream [19]. in contrast, the relationship between body mass index and serum vitamin d concentration did not observe this study. higher levels of vitamin d were seen among participants in blood group aand lower levels in participants in blood group b–. furthemore, higher socioeconomic status had significantly higher serum levels of vitamin d. it demonstrated that the diet of women with a low socioeconomic state is high in phytate and low in calcium leading to increase demand for vitamin d. the exact time of getting exposed to the sunshine to get optimum levels of vitamin d is not provided yet because of the differences in the amount of vitamin d, the person can get from the different latitudes, seasons, skin pigmentation, and age. however, some studies recommend that to get optimal vitamin d through sunlight, skin (face, arms, legs, or back) should be exposed to direct sunshine twice a week for 30 min from 10:00 am to 03:00 pm [20], [21]. the dietary reference intake of vitamin d is 400 iu for women during pregnancy. 5. conclusion because of the combined risk factors for vitamin d insufficiency among pregnant women in this region, the government must inform the public about the magnitude of the problem and the impact of vitamin d deficiency on overall health. this can be accomplished by educating individuals about the benefits of receiving vitamin d from sunlight and offering free vitamin d supplementation through pre-conceptional counselling. because of the high frequency of vitamin d insufficiency in this region, as well as the huge impact of vitamin d deficiency on health status, the findings of this study should be regarded more seriously. at present, folic acid is the sole supplement offered to pregnant women in the sulaimani region’s prenatal care facility. references [1] b.w. hollis. circulating 25-hydroxyvitamin d levels indicative of vitamin d sufficiency: implications for establishing a new effective dietary intake recommendation for vitamin d. journal of nutrition, vol. 135, pp. 317-322, 2005. [2] b. w. hollis and c. l. wagner. vitamin d supplementation during pregnancy: improvements in birth outcomes and complications through direct genomic alteration. molecular and cellular endocrinology, vol. 453, pp. 113-130, 2017. [3] r. bouillon and t. suda. vitamin d: calcium and bone homeostasis during evolution. bonekey reports, vol. 3, p. 480, 2014. [4] h. wolden-kirk, c. gysemans, a. verstuyf, m. chantal. extraskeletal effects of vitamin d. endocrinology and metabolism clinics of north america, vol. 41, no. (3), pp. 571-594, 2012. [5] l. s. weinert and s. p. silveiro. maternal-fetal impact of vitamin d deficiency: a critical review. maternal and child health journal, vol. 19, no. 1, pp. 94-101, 2014. [6] n. principi, s. bianchini, e. baggi and s. esposito. implications of maternal vitamin d deficiency for the fetus, the neonate and the young infant. european journal of nutrition, vol. 52, no. 3, pp. 859-867, 2013. [7] e. e. delvin, b. l. salle, f. h. glorieux, p. adeleine and l. s. david. vitamin d supplementation during pregnancy: effect on neonatal calcium homeostasis. the journal of pediatrics, vol. 109, pp. 328-334, 1986. [8] n. a. al-faris. high prevalence of vitamin d deficiency among pregnant saudi women. nutrients, vol. 8, no. 2, 6-15, 2016. [9] h. j. w. farrant, g. v. krishnaveni, j. c. hill, b. j. boucher, d. j. fisher, k. noonan and c. osmond. vitamin d insufficiency is common in indian mothers but is not associated with gestational diabetes or variation in newborn size. european journal of clinical nutrition, vol. 63, no. 5, pp. 646-652, 2009. [10] a. r. webb and o. engelsen. calculated ultraviolet exposure levels for a healthy vitamin d status. photochemistry and photobiology, vol. 82, pp. 1697-1703, 2006. [11] c. yun, j. chen, y. he, d. mao, r. wang, y. zhang and x. yang. vitamin d deficiency prevalence and risk factors among pregnant chinese women. public health nutrition, vol. 20, no. 10, pp. 1746-1754, 2017. [12] a. dawodu and h. akinbi. vitamin d nutrition in pregnancy: current opinion. international journal of womens health, vol. 5, pp. 333343, 2013. [13] c. palacios, l. m. de-regil, l. k. lombardo and j. p. peña-rosas. vitamin d supplementation during pregnancy: an updated metaanalysis on maternal outcomes. journal of steroid biochemistry and molecular biology, vol. 164, pp. 148-155, 2016. [14] m. f. holick and t. c. chen. vitamin d deficiency: a worldwide problem with health consequences. the american journal of clinical nutrition, vol. 87, no. 4, pp. 1080-1086, 2008. [15] m. al-zoughool, a. alshehri, a. alqarni, a. alarfaj and w. tamimi. vitamin d status of patients visiting health care centers in the coastal and inland cities of saudi arabia. journal of public health and development series, vol. 1, pp. 14-21, 2015. [16] m. tuffaha, c. el bcheraoui, f. daoud, h. a. al hussaini, f. alamri, m. al saeedi, m. basulaiman, z. a. memish, m. a. almazroa, a. a. al rabeeah and a. h. mokdad. deficiencies under plenty of suns: vitamin d status among adults in the kingdom of saudi arabia, 2013. north american journal of medical sciences, vol. 7, pp. 467-475, 2015. [17] h. alfawaz, h. tamim, s. alharbi, s. aljaser and w. tamimi. vitamin d status among patients visiting a tertiary care centre in riyadh, saudi arabia: a retrospective review of 3475 cases. bmc public health, vol. 14, p. 159, 2014. [18] a. h. al-elq, m. sadat-ali, h. a. al-turki, f. a. al-mulheim and a. k. al-ali. is there a relationship between body mass index and serum vitamin d levels? saudi medical journal, vol. 30, pp. 1542-1546, 2009. [19] s. konradsen, h. ag, f. lindberg, s. hexeberg and r. jorde. serum 1,25-dihydroxy vitamin d is inversely associated with body mass index. european journal of nutrition, vol. 47, pp. 87-91, 2008. [20] m. f. holick, t. c. chen, z. lu and e. sauter. vitamin d and skin physiology: a d-lightful story. journal of bone and mineral research, vol. 22, pp. v28-33, 2007. [21] m. f. holick. sunlight and vitamin d for bone health and prevention of autoimmune diseases, cancers and cardiovascular disease. the american journal of clinical nutrition, vol. 80, pp. 1678s-1688s, 2004. . 10 uhd journal of science and technology | jul 2019 | vol 3 | issue 2 1. introduction a decade ago, the quantity of higher education universities and institutes has multiplied manifolds. massive numbers of graduates and postgraduates are produced consistently. universities and institutes can also comply with the quality of the pedagogies; but nevertheless, they face the problem of dropout students, low achievers, and jobless students. understanding and breaking down the variables for negative overall performance is a complex and unremitting procedure, hidden in beyond and present facts congregated from educational overall performance and college students’ behavior. effective tools are required to research and expect the performance of college students scientifically. although universities and institutions gather a huge amount of students’ information, this fact remains unutilized and does not help in any decisions or coverage making to enhance the performance of college students. if universities could distinguish the circumstance for low execution prior and can predict students’ conduct, this knowledge can help them in taking genius dynamic activities, to enhance the execution of such students. it will be a win circumstance for every one of the partners of universities and institutions, i.e. administration, educators, students, and parents. students could be able to performance analysis and prediction student performance to build effective student using data mining techniques sirwan m. aziz1, ardalan h. awlla2 1department of computer science, darbandikhan technical institute spu, darbandikhan, kurdistan region iraq, 2department of information technology, kurdistan technical institute, sulaimani heights, behind kurdsat tv, 46001 sulaimania, kurdistan region – iraq a b s t r a c t in this period of computerization, schooling has additionally remodeled itself and is not restrained to old lecture technique. the everyday quest is onto discover better approaches to make it more successful and productive for students. these days, masses of data are gathered in educational databases; however, it stays unutilized. to be able to get required advantages from such major information, effective tools are required. data mining is a developing capable tool for examination and expectation. it is effectively applied in the field of fraud detection, marketing, promoting, forecast, and loan assessment. however, it is an incipient stage in the area of education. in this paper, data mining techniques have been applied to construct a classification model to predict the performance of students. for the classification model, the cross-industry standard process for data mining was used as the classification model, the decision tree algorithm used as the main data mining tool to build the classification model. index terms: classification, data mining, decision tree, naïve bayes, student performance access this article online doi: 10.21928/uhdjst.v3n2y2019.pp10-15 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2019 aziz and awlla. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) r e v i e w a r t i c l e uhd journal of science and technology corresponding author’s e-mail: ardalan h. awlla, department of information technology, kurdistan technical institute, sulaimani heights, behind kurdsat tv, 46001 sulaimania, kurdistan region – iraq. e-mail: ardalan.awlla@kti.edu.krd received: 10-05-2019 accepted: 10-06-2019 published: 20-06-2019 aziz and awlla: using data mining to predict student performance uhd journal of science and technology | jul 2019 | vol 3 | issue 2 11 pick out their shortcomings in advance and can enhance themselves. teachers could be in a position to plan their lectures as according to the need of students and can give better direction to such students. data mining includes a fixed set of methods that can be utilized to extract appropriate and exciting knowledge from data. data mining has numerous responsibilities, for instance, prediction, classification, association rule mining, and clustering. classification strategies are supervised learning procedures that classify data object into a predefined class name. it is a standout among the most helpful strategies in data mining to create classification models from an input data set. the utilized classification procedures usually construct models that are utilized to predict future data patterns. there is the various algorithm used for data classification, for instance, naïve bayes classifiers and decision tree. with class, the created model could be able to predict a class for given data relying on earlier learned data from historical data. decision tree is a standout amongst the most utilized methods since it makes the decision tree from the records given utilizing clear conditions depending principally on the calculation of the gain ratio, which gives naturally a type of weights to attributes utilized, and the researcher can certainly distinguish the best attributes on the anticipated target. due to this procedure, a decision tree would be worked with classification rules created from it. another classification method is naïve bayes classifier that is utilized to predict a target class. it relies on in its calculations on probabilities, particularly bayesian theorem. due to this use, the outcome from this classifier is more precise and efficient, and more delicate to new data added to the dataset. investigation and prediction with the assistance of data mining systems have demonstrated imperative outcomes in the area of predicting consumer conduct, fraud detection, financial marketplace, loan assessment, intrusion detection, bankruptcy prediction, and forecast prediction. it may be extremely powerful in education system also. it is a very effective tool to uncover hidden patterns and valuable information, which otherwise may not be identified and hard to discover and recognize with the assistance of statistical techniques. in general, this paper tries to use data mining ideas, especially classification, to assist the universities and institutions directors and decision makers by assessing student’ data to think about the primary characteristics that may influence the student’ performance. this paper is organized as follows in section 2; literature review is discussed, in section 3 an entire detail of the study is introduced, in section 4 modeling and experiments are discussed, and in section 5 results and discussion presented. finally, section 6 presents our conclusions. 2. literature review all researches conducted previously, discover some huge areas in the education sector, where expectation by data mining has gained benefits; like, finding some students with weak points [1], select the points that students such as the exact course [2] evaluation of college [3], overall student evaluation [4], [5], class teaching language behavior [6], expecting students’ retraction [7], [8], plan for course registration [9], guessing the enrollment headcount [10], and cooperate activity evaluation [11]. some researchers indicate that there have been strong relationships between the student’s personality likings and their work characteristics [12]. it is detected that there is detailed expertise needed to have once graduates to gain occupation and that these expertise are important to academic education generally. characteristics such as sensitive cleverness, self-management development, and life work experience also are significant reasons for work development [13]. employers try to differentiate the highest and lowest importance with soft skill and academic reputation [14]. using machine learning techniques to predict the performance of a student in upcoming courses [15]. overview of the data mining techniques that have been used to predict students’ performance [16]. the performance of the students is predicted using the behaviors and results of previous passed out students [17]. 3. building the classification model the fundamental target of the planned methodology is to fabricate the classification model that tests certain attributes that may influence student performance. to achieve this goal, the cross-industry standard process for data mining was used to construct a classification model. it comprises five stages that include: data understanding, preparing data, business understanding, modeling, assessment, and deployment, as seen in fig. 1. aziz and awlla: using data mining to predict student performance 12 uhd journal of science and technology | jul 2019 | vol 3 | issue 2 3. 1. data classification preliminaries in general, data classification consists of two-advanced process. in the initial step, which is known as the learning step, a model that describes planned classes or ideas is constructed by examining a set of training dataset instances. each instance is pretended to have a place with a predefined class. within the second step, the model is tested utilizing an alternate different dataset that is utilized to assess the classification accuracy of the model. if the accuracy of the model is viewed as adequate, the model can be utilized to classify future data instances for which the class label is not notable. ultimately, the model goes about as a classifier within the decisionmaking process. there are many strategies which can be utilized for classification, for instance, bayesian techniques, neural networks, rule-based algorithms, and decision tree. decision tree classifiers are very well known procedures because the development of tree does not need any parameter setting or domain knowledgeable data and is acceptable for exploratory data discovery. a decision tree can deliver a model with rules that are comprehensible and interpretable. the decision tree has the benefits of simple clarification and understanding for decision makers to match with their domain information for approval and justify their decision. a number of decision tree classifiers are c4.5/c5.0/j4.8, nbtree, etc. the c4.5 method is a type of the decision tree families that can deliver decision tree and rule sets, and develop a tree to improve expectation accuracy. the c4.5, c5.0, and j48 classifier is among the most famous and effective decision tree classifiers. c4.5 makes an initial tree utilizing the partition and conquer algorithm. the entire depiction of the algorithm can be discovered in data mining or machine learning books, for example, c4.5: programs for machine learning. weka contains a collection of machine learning and data mining algorithms for analyzing data and predicting modeling, together with the graphical user for simple access to these functions. it developed at the university of waikato in new zealand written in java. weka contains tools for classification, regression, clustering, association rules, data pre-processing, and visualization. 3. 2. data collection process and data understanding while the concept of the paper came into mind, it means to apply a classification model for predicting performance relying on a dataset from a certain educational institute. with the goal that some other factors in regard to the studying environment, administration, conditions, and colleagues would have a comparable impact on all students, the impact of gathered attributes would be more evident and less difficult to classify. the data collected from three different educational institutes. to gather the necessary data, a questionnaire was organized and delivered either by email or manually to the students of all institutions. then, it was additionally shared on the web, to be filled by students in any university or institutions. the survey was filled by 130 students, from the first, the second, the third institutions, and the rest from a few different institutions using the net questionnaire. in the questionnaire, several attributes have been asked that may expect the performance class. the rundown of the gathered attributes is presented in table 1. 3.3. data preparation after the surveys were gathered, the method of preparing the data was completed. first, the information inside the questionnaires has been conveyed to (arff) to be appropriate with the weka data mining tool. 3. 4. business understanding we have defined a classification model to predict if a student might show excellent performance. this issue is interesting since there are many universities/institutions interested in recognizing students with outstanding performance. for the input records for the prediction, the model use the data describing pupil conduct and the data defined student behavior as described in the previous table. the dataset includes 260 instances. our model class label is a binary attribute, which separated students passed from first attempt exam (label value 1), for the students passed from second attempt exam (label value 0).fig. 1. cross-industry standard process for data mining. aziz and awlla: using data mining to predict student performance uhd journal of science and technology | jul 2019 | vol 3 | issue 2 13 the fundamental usage of this model could identify wellperforming students on a course. individuals who ought to gain this model would be: 1. instructors, for the qualification of students who can work together with; 2. students, for checking if there is a requirement for more attempt to accomplish better outcomes; 3. business people, for early attractive with students who are probably going to end up outstanding on a selected subject. 4. modeling and experiment after the data had been arranged, the classification model has been created. utilizing the decision tree method on this technique, the gain ratio measure is used to signify the weight of influences of every attribute at the tested class, and thus, the ordering of tree nodes is specified. the results are discussed in the below section. referring to the analysis of earlier studies, and as defined in table (1), a set of attributes has been selected to be tested against their influence on student performance. these attributes consist of (1) personal information such as gender, love, sleeping, (2) education environment such as number of punishment, coming time to class, (3) parent information such as parent’s education levels, and parent’s financial levels. these attributes were used to predict student performance. three types of the technique have been applied to the dataset reachable to construct the classification model. the techniques are the naïve bayes classifier and decision tree with version id3 (j4.8 in weka). the experiment, accuracy was assessed using 10-folds pass-validation, and hold-out technique. table 2 shows the accuracy rates for each of those techniques. the time attributes, which is student attendance to the class, have the maximum gain ratio, which made it the starting node and most efficient attribute. other attributes cooperate inside the decision tree were parent lives, which is student’s parent live, father, parent education, study, and accommodation. rest of other attributes appeared in other parts of the decision tree. the tree demonstrated that every one of these attributes has a type of impact on the student performance, but table 1: description of attributes used for predicting the student performance attribute description possible values gender student’s gender male, female time coming time to class never, once a week, twice a week, more than twice a week punishment number of punishment i have never been punished, about twice, more than 3 times, very often family total number of family members between 3 and 5, between 6 and 10, more than 10 parent live my father and mother live harmoniously strongly agree, agree, natural, disagree parent education parent’s education levels up to university, up to diploma, up to secondary school, up to primary, did not go to school parent financial parent’s financial levels strongly agree, agree, natural, disagree environment the community around supports building of classrooms, library, toilets, etc. strongly agree, agree, natural, disagree encouragement our teachers inspire us to work hard strongly agree, agree, natural, disagree absent our teachers are never absent without a good reason strongly agree, agree, natural, disagree help our teachers are available and wiling to assist us in our studies strongly agree, agree, natural, disagree father does your father alive? yes, no mother does your mother alive? yes, no love do you have relationship love? yes, no accommodation are you stay at home or dormitory home, dormitory work are you working with your study? yes, no study how many hours do you study per a day? about 1 h, about 2 h, about 3 h, more than 3 h sleeping are you sleeping well? yes, no pass are you passing in the first trial or second trial? yes, no table 2: accuracy rate for predicting performance method 10-fold cross validation (%) hold-out (60%) c4.5 (j4.8) 42.3 48.1 naïve bayes 40.7 44.2 aziz and awlla: using data mining to predict student performance 14 uhd journal of science and technology | jul 2019 | vol 3 | issue 2 the biggest attributes had been: time, parent live, father, accommodation and parent education, as seen in fig. 2, according to the dataset we collected in three different institutes and universities. it means if a student is never late to class, his or her parents live together harmoniously; their father is not dead, students stay at home not in a dormitory and their parents educated those students are passed in the first trial exam. the death of their mothers also impact student’s performance, but in iraqi kurdistan, father’s death affects students’ performance more because fathers are the main financial providers for the family usually. wherever love (romantic relationship) is considered, students who do not fall in love have better performance than those having romantic relationship. furthermore, the interesting attribute, which is home study (homework) does not have big effect, because if a student does not have a good environment no matter how many hours she or he studies, it does not have much effect to students’ performance, as shown in a fig. 3 the tree produced the use of the c4.5 algorithm showed that the time attribute is the most effective attribute. the naïve bayes classifier does not demonstrate the weights of every attribute incorporated into the classification; however, it has been used in comparison with the consequences generated from c4.5, as shown in table 2, it can be seen that the efficiency percentage ranges about 36%–45%, which are low percentages. due to deep of the tree produced by j4.8 in weka, the visualization tree image is not clear here, we could show only a part of it, but if anyone is interested, they can download the dataset from the link [25] to do the experiment in weka. 5. results and discussion the study has shown that numerous elements may have a high impact on students’ performance. a standout among the best is the student attendance in class. various family factors also seemed to have an influence on the students’ performance. a parent living together is one of the greatest positive factors in performance. it means if students’ parents live within a good relationship, students’ performance also increase because experiment fig. 2. a decision tree generated by the c4.5 algorithm for predicting performance. fig. 3. high impact factors on students’ performance. aziz and awlla: using data mining to predict student performance uhd journal of science and technology | jul 2019 | vol 3 | issue 2 15 indicates that those students whose parents live together harmoniously are passed in the first trial exam. in addition, some other attributes after time and parents life attributes are father and mother education. if a student is never late to class, their parents are living together, their father is not dead and their parents have bachelor degree are in the second rank passed in the first trail, as explained before here in our community in iraqi kurdistan, fathers usually take financial responsibility of family not mothers, it means students do not need to work, otherwise students should work to pay for their life. the rank attribute has shown an interesting influence on performance; it was not included as a high-efficiency factor. it was noticed in the experiment, the study factor. this is natural as no matter how long a student might study or prepare himself if they are not living in a good and secure house; they still perform very poorly in the exams. 6. conclusion and future work this paper has focused on the probability of constructing a classification model for predicting student performance. numerous attributes had been tested, and a number of them are found powerful on the performance prediction. the student attendance in class was the strongest attribute, then the parent living together harmoniously, father and mother education level, with the moderate impact of student performance. the student punishment, sleeping hours, and family members did not show any clear effect on student performance while the no love relationship, parent strong financial status and student encouragement to study beside teachers have shown some effect for predicting the student performance. for universities and institutes, this model, or an enhanced one, can be utilized in predicting the newly applicant student performance. as future work, it is recommended to gather more appropriate data from several universities and institutions to have the right performance rate for students. when the proper model is collected, the software could be created to be used by the universities and institutions, including the rules generated for foreseeing the performance of students. references [1] a. hicheur, a. cairns, m. fhima and b. gueni. “towards customdesigned professional training contents and curriculums through educational process mining.” immm; 2014. the fourth international conference on advances in information mining and management, 2014. [2] b. n. a. abu, a. mustapha and k. nasir. “clustering analysis for empowering skills in graduate employability model.” australian journal of basic and applied sciences, vol. 7, no. 14, pp. 21-28, 2013. [3] p. k. srimani and malini m. patil. “a classification model for edumining”. psrc-icics conference proceedings, 2012. [4] y. he and z. shunli. “application of data mining on students’ quality evaluation. intelligent systems and applications (isa)”. 2011 3rd international workshop on. ieee, 2011. [5] s. yoshitaka, s. tsuruta and r. knauf. “success chances estimation of university curricula based on educational history, self-estimated intellectual traits and vocational ambitions”. advanced learning technologies (icalt). 2011 11th ieee international conference on. ieee, 2011. [6] p. u. kumar and s. pal. “a data mining view on class room teaching language.” international journal of computer science, vol. 8, no. 2, pp. 277-282, 2011. [7] v. dorien, n. de cuyper, e. peeters and h. de witte. “defining perceived employability: a psychological approach.” personnel review, vol. 43, no. 4, pp. 592-605, 2014. [8] a. s. svetlana, d. zhang and m. lu. “enrollment prediction through data mining”. information reuse and integration, 2006 ieee international conference on. ieee, 2006. [9] p. a. alejandro. “educational data mining: a survey and a data mining-based analysis of recent works.” expert systems with applications, vol. 41, no. 4, pp. 1432-1462, 2014. [10] e. a. s. bagley. “stop talking and type: mentoring in a virtual and face-to-face environmental education environment.” ph. d thesis. university of wisconsin-madison, madison, 2011. [11] j. bangsuk and c. f. tsai. “the application of data mining to build classification model for predicting graduate employment.” international journal of computer science and information security, vol. 10, pp. 1-7, 2013. [12] m. backenköhler and v. wolf. “student performance prediction and optimal course selection: an mdp approach” international conference on software engineering and formal methods, pp. 40-47, 2017. [13] a. m. shahiri, w. husainand and n. a. rashid. “a review on predicting student’s performance using data mining techniques.” procedia computer science, vol. 72, pp. 414-422, 2015. [14] p. shruthi and b. p. chaitra. “student performance prediction in education sector using data mining” international journal of advanced research in computer science and software engineering, vol. 6, no. 3, pp. 212-218, 2016. [15] p. l. dacre, p. qualter and p. j. sewell. “exploring the factor structure of the career edge employability development profile.” education training, vol. 56, no. 4, pp. 303-313. [16] s. saranya, r. ayyappan and n. kumar. “student progress analysis and educational institutional growth prognosis using data mining.” international journal of engineering sciences and research technology, vol. 3, pp. 1982-1987, 2014. [17] a. e. poropat. “a meta-analysis of the five-factor model of personality and academic performance”. psychological bulletin, vol. 135, no. 2, pp. 322-338, 2009. tx_1~abs:at/tx_2:abs~at uhd journal of science and technology | jan 2023 | vol 7 | issue 1 71 1. introduction cloud computing is a new technology for a large-scale environments. hence, it faces many challenges and the main problem of cloud computing is load balancing which lowering the performance of the computing resources [1]. management is the key to balancing performance and management costs along with service availability. when cloud data centres (cdcs) are configured and utilized effectively, they offer huge benefits of computational power while reducing cost and saving energy. cloud computing has three types of services: infrastructure as a service (iaas), platform as a service (paas), and software as a service (saas). fundamental resources can be accessed through iaas. paas provides the application runtime environment, besides development and deployment tools. saas enables the provision of software applications as a service to end users. virtual entities are created for all hardware infrastructure elements. virtualization is a technique that allows multiple operating systems (oss) to coexist on a single physical machine (pm). these oss are separated from one another and from the underlying physical infrastructure by a special middleware abstraction known as virtual machine (vm). the software that manages these multiple vms on pm is known as the vm kernel [2]. with the help of virtualization technology, cdcs are able to share a few hpc resources and their services among many users, but virtualization limitations of load balancing and performance analysis processes and algorithms in cloud computing asan baker kanbar1,2*, kamaran faraj3,4 1technical college of informatics, sulaimani polytechnic university , sulaimani 46001, kurdistan region, iraq, 2department of computer science,cihan university sulaimaniya, sulaimaniya 46001, kurdistan region, iraq, 3department of computer science, university of sulaimani, sulaimani, 46001, kurdistan region, iraq, 4department of computer engineering, collage of engineering and computer science, lebanse frence university, erbil, iraq a b s t r a c t in the modern it industry, cloud computing is a cutting-edge technology. since it faces various challenges, the most significant problem of cloud computing is load balancing, which degrades the performance of the computing resources. in earlier research studies, the management of the workload to address all resource allocation challenges that caused by the participation of a large number of users has received important attention. when several people are attempting to access a given web application at once, managing all of those users becomes exceedingly difficult. one of the elements affecting the performance stability of cloud computing is load balancing. this article evaluates and discusses load balancing, the drawbacks of the numerous methods that have been suggested to distribute load among nodes, and the variables that are taken into account when determining the best load balancing algorithm. index terms: cloud computing, load balancing, task scheduling, resource allocation, task allocation, performance stability corresponding author’s e-mail: asan baker kanbar, assistant lecturer, department of computer science, cihan university sulaimaniya, sulaimaniya 46001, kurdistan region, iraq, asan. e-mail: asan.baker@sulicihan.edu.krd received: 21-11-2022 accepted: 02-03-2023 published: 18-03-2023 access this article online doi: 10.21928/uhdjst.v7n1y2023.pp71-77 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2023 kanbar and faraj. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l r e s e a r c h a r t i c l e uhd journal of science and technology kanbar and faraj: limitations of load balancing algorithms in cloud computing 72 uhd journal of science and technology | jan 2023 | vol 7 | issue 1 increases the complexity of resource management. task scheduling is one of the key issues considered for efficient resource management. it aims to allocate incoming task(s) to available computing resources, and it belongs to the set of np-class problems. therefore, heuristic and meta-heuristicbased approaches are commonly used to generate scheduling solutions while optimizing one or more goals such as makespan, resource utilization, number of active servers, through-put, temperature effects, energy consumption, etc. customers in the cloud can access resources at any time through the web and only pay for the services they use. with the dramatic increase in cloud users, decreasing task completion time is beneficial for improving user experience. the primary goals of task scheduling are to reduce task completion time and energ y consumption while also improving resource utilization and load balancing ability [3]. improving load balancing ability contributes to fully utilizing vms to prevent execution efficiency from decreasing due to resource overload or waste caused by excessive idle resources. various algorithms have been proposed to balance the load between multiple cloud resources, but there is currently no algorithm that can balance the load in the cloud without degrading performance. load balancing is a method used to improve the performance of networking by distributing the workload among the various resources involved in computing network tasks. the load here can be processor capacity, memory, network load, etc. load balancing optimizes resource usage, reduces response time, and avoids system overload by distributing the load across several components. many researchers are working on the problem of load balancing, and as a result of their research, many algorithms are proposed every day. in this paper, we overview some of the optimistic algorithms that have shown some improvement in load balancing and increased the level of performance. besides we will also show the limitations of these algorithms. 2. load blancing in cloud computing load balancing is performed for resource allocation and managing load in each data center, as illustrated in fig. 1. load balancing in a cloud computing environment has a significant impact on performance; good load balancing can make cloud computing more efficient and improve user satisfaction. load balancing is a relatively new technology that allows networks and resources to deliver maximum throughput with a minimum response time. good load balancing helps to optimize the use of available resources, thus minimizing resource consumption. by sharing traffic between servers, you can send and receive data without experiencing significant delays. different types of algorithms can be used to help reduce traffic load between available servers. a basic example of load balancing in everyday life can be related to websites. without load balancing, users may experience delays, timeouts, and the system may become less responsive. by dividing the traffic among servers, data can be sent and received without significant delay. load balancing is done using a load balancer (fig. 2), where each incoming request is redirected and transparent to the requesting client. based on specified parameters such as availability and current load, the load balancer uses various scheduling algorithms to find which server should handle the request and sends the request to the selected server. to make a final decision, the load balancer obtains information about the candidate server’s state and current workload to validate its ability to respond to this request [4]. fig. 1. model of load balancing. kanbar and faraj: limitations of load balancing algorithms in cloud computing uhd journal of science and technology | jan 2023 | vol 7 | issue 1 73 3. challenges in cloud computing load balancing before we could review the current load balancing approaches for cloud computing, we need to identify the main issues and challenges involved and that could affect how the algorithm would perform. here, we discuss the challenges to be addressed when attempting to propose an optimal solution to the issue of load balancing in cloud computing. these challenges are summarized in the following points. 3.1. cloud node distribution many algorithms have been proposed for load balancing in cloud computing; among them, some algorithms can provide efficient results in small networks or networks with nodes close to each other. such algorithms are not suitable for large networks because they cannot produce the same efficient results when applied to larger networks. the development of a system to regulate load balancing while being able to tolerating significant delays across all the geographical distributed nodes is necessary [5]. however, it is difficult to design a load balancing algorithm suitable for spatially distributed nodes. some load-balancing techniques are designed for a smaller area where they do not consider the factors such as network delay, communication delay, distance between the distributed computing nodes, distance between user and resources, and so on. nodes located at very distant locations are a challenge, as these algorithms are not suitable for this environment. thus, designing loadbalancing algorithms for distantly located nodes should be taken into account [6]. it is used in large-scale applications such as twitter and facebook. the ds of the processors in the cloud computing environment is very useful for maintaining system efficiency and handling fault tolerance well. the geographical distribution has a significant impact on the overall performance of any real-time cloud environment. 3.2. storage/replication a full replication algorithm does not take efficient storage utilization into account. this is because all replication nodes store the same data. full replication algorithms impose higher costs because more storage capacity is required. however, partial replication algorithms may store partial datasets in each node (with some degree of overlap) depending on each node’s capabilities (such as power and processing capacity) [7]. this can lead to better usability, but it increases the complexity of load balancing algorithms as they try to account for the availability of parts of the dataset on different cloud nodes. 3.3. migration time cloud computing follows a service-on-demand model, which means when there is a demand for a resource, the service will be provided to the required client. therefore, while providing services based on the needs of our customers, we sometimes have to migrate resources from remote locations due to the unavailability of nearby locations. in such cases, the time of migration of the resources from far locations will be more which will affect system performance. when developing algorithms, it is important to note that resource migration time is an important factor affecting system performance. 3.4. point of failure controlling the load balancing and collecting data about the various nodes must be designed in a way that avoids having a single point of failure in the algorithm. if the algorithm’s patterns are properly created, they can also help provide effective and efficient techniques to address load balancing problems. using a single controller to balance the load is a major difficulty because, failure might have severe consequences and lead to overloading and under-loading issues. this difficulty must be addressed in the design of any load balancing algorithm [8]. distributed load balancing algorithms seem to offer a better approach, but they are much more complex and require more coordination and control to work properly. 3.5. system performance this does not mean that if the complexity of the algorithm is high, then the system performance will be very high. any time a load balancing algorithm must be simple to implement and easy to operate. if the complexity of the algorithm is fig. 2. load balancer. kanbar and faraj: limitations of load balancing algorithms in cloud computing 74 uhd journal of science and technology | jan 2023 | vol 7 | issue 1 high, then the implementation cost will also be higher and even after implementing the system, performance will be decreased due to the increased delays in the functionality of the algorithm. 3.6. algorithm complexity in ter ms of implementation and operation, the load balancing algorithm is preferably not that complicated. higher implementation complexity will lead to more complex procedures, which can lead to negative performance issues furthermore, when the algorithms require more information and higher communication for monitoring and control, delays would cause more problems and reduce efficiency. therefore, to reduce overhead on cloud computing services, load-balancing algorithms should be as simple and effective as possible [9]. 3.7. energy management a load balancing algorithm should be designed in such a way that the operational cost and energy consumption of the algorithm must be low. increasing energy consumption is one of the biggest issues facing cloud computing today. even though using energy efficient hardware architectures which slows down the processor speed and turn off machines that are not under use the energy management is becoming difficult. hence, to achieve better results in energy management, the load balancing algorithm should be designed according to energy aware job scheduling methodology [10]. 3.8. security security is one of the problems that cloud computing has as its top priority. the cloud is always vulnerable in one way or the other way to security attacks like ddos attacks, etc. while balancing the load there are many operations that take place like vm migration, etc. at that time there is a high probability of security attacks. hence, an efficient load balancing algorithm must be strong enough to reduce security attacks but should not be vulnerable. 4. related works and limitations of used algorithms and processes the author [11] proposed a hybrid optimization algorithm for load balancing. this is firefly optimization and enhanced multi-criteria based on the particle swarm optimization (pso) algorithm called (fimpso). to initialize the population in pso, the firefly algorithm is used, since it gives the optimal solution. only two parameters are considered here, such as task arrival time and task execution time. the results are executed, taking into account parameters such as run time, resource consumption, reliability, makespan, and throughput. limitations: hybrid algorithms require high latency to run. in particular, pso falls into a local optimum problem when processing a large number of requests, and the convergence speed is low. overloading occurs here because more iteration is needed to achieve the optimal solution. in the paper [12], propose the use of three-layer cooperative fog to reduce bandwidth cost and delay in cloud computing environments, this article discusses the composite objective function of bandwidth cost reduction and load balancing, where we consider both link bandwidth and server cpu processing levels. assign weights to every objective of the composite objective function to determine priority. the minimum bandwidth cost has a higher priority and runs first on layer1 fog. however, the load balancer gets the priority it used to reduce latency. the milp (mixed-integer linear programming) algorithm is used to minimize the composite objective function. two types of resources are used, one is a network resource (bandwidth) and the other is a server resource (cpu processing layer). limitations: this work is not suitable for real-time applications, because it takes a high execution time for selecting the bandwidth and cpu. it only focuses on reducing bandwidth costs and load balancing, so it takes a long time to find the optimal solution. priority is based on the minimum bandwidth utilization in a large scale environments, many regions are used the minimum bandwidth utilization so congestion is occurring; it takes much time to execute the task, which also reduces the qos values. author [13] task offloading and resource allocation were proposed for iot fog cloud architecture based on energy and time efficiency. the etcora algorithm is used to improve energy efficiency and request completion time. it performs two tasks. one is computational offload selection and the other is transmitting power allocation. three layers are presented in this work. the first tier contains some iot devices. the second tier is the fog tier, which consists of fog servers and controllers located in different geographic locations. the third tier is the cloud tier, which consists of cloud servers. however, the entire task is outsourced within the fog layer, so the fog layer is also overloaded. in many regions, a request is sent to the users at a certain time, the fog layer cannot control the load balancing. all users in the region access the cloud server, which triggers load balancing. the author [14] proposed using probabilistic load balancing to avoid congestion due to vm migration and also to minimize congestion across migrations. for vm migration, this paper takes into account the distance between the source pm and the destination pm. the architecture features a vm migration kanbar and faraj: limitations of load balancing algorithms in cloud computing uhd journal of science and technology | jan 2023 | vol 7 | issue 1 75 controller, stochastic demand forecasting, hotspot detection, and vms, pms. load balancing is addressed by profiling resource demand, hotspot demand, and hotspot migration. resource demand profiling tracked the following: vm resource utilization on cpu, memory, network bandwidth, and disk i/o. it is used to update the periodic information to the balancer. for discovering the hotspot they periodically change the resource allocation status from the vms and pms’ resource demands. the hotspot migration process uses the hotspot migration algorithm. author [15] proposed a static load balancing algorithm totally based on discrete pso for distributed simulations in cloud computing. for static load balancing, adaptive pbest discrete pso (apdpso) is used. pso updates particle velocity and position vectors. the distance metric is used to update the velocity and position vectors from the pbest and gbest values. non-dominated genetic sorting algorithm ii (nsga ii) is one of the evolutionary algorithms that preserves the optimal solution. for each iteration, nsga ii considers three important processes selection, mutation and crossover. however, pso suffers from local optima and poor convergence when handling a large number of requests, resulting in increased latency. in paper [16], the author proposed multi-goal task scheduling based on sla and processing time, which is suitable for cloud environments. this article proposes two scheduling algorithms called the threshold based task scheduling (tbts) algorithm and the service level agreement load balancing (sla-lb) algorithm. tbts scheduled a task for a batch tnts threshold (expected time of completion) generated from etc. sla-lb is based on an online model that dynamically schedules a task based deadline and budget criteria. sla-lb is used to find the required system to reduce the makespan and increasing the cloud usage. this paper discuses following performance metrics such as makespan, penalty, achieve cost, and vm utilization. the results are shows that the proposed method is superior when compared to existing algorithm in terms of both scalability and vms. however, the value of threshold is based on the completion time, if assuming that completion time is increased, the threshold value will be burst. it reduces the sla and qos values. author [17] proposed a multi-agent system for dynamic consolidation of vms with optimized energy efficiency in cloud computing. this proposed system eliminates the centralized failure, so that, the decentralized server presented with gossip control (gc) with a multiagent framework the gc has two protocols: gossip and contract network protocol. with the assist of gc developed dvms (dynamic vm consolidation) compared two sercon strategies centralized strategy and an eco cloud distributed strategy. sercon is used to minimize server count and vm migration. during integration, eco cloud considers two processes: first is the migration procedure and the second is the allocation procedure gc-based strategy works best for sla violations and power consumption. in paper [18] using cloud theory for wind power uncertainty, the author proposed a multi-objective feeder reconfiguration problem. proposed used the cloud theory properties of qualitative– quantitative bidirectional transmission to solve the problem of multi-objective feeder reconfiguration with the backward and forward cloud generator algorithm, proposed system used a fuzzy decision-making algorithm to get the best solution. authors in [19] proposed an approach to perform cloud resource provisioning and scheduling based on metaheuristic algorithms. to design the supporting model for the autonomic resource that schedules the applications effectively, the binary pso (bpso) algorithm is used. this work consists of three consecutive phases as user-level phase, the cloud provisioning and scheduling phase, and the infrastructure level phase. finally, experimental evaluation was performed by modifying the bpso algorithm’s transfer function to achieve high exploration and exploitation. authors in [20] introduced intelligent scheduling and provisioning of resource methods for cloud computing environments. this work overcomes the existing problems of poor quality of service, increased execution time, and high costs during service. existing problems are addressed by an intelligent optimization framework that schedules jobs for users using spider monkey optimization. this will result in faster execution time, lower cost, and better qos for the user. here, the job is scheduled by the spider-monkey optimization algorithm. however, job sensitivity (i.e., risk or non-risk) is not considered, resulting in poor qoe. author [21] proposed to consider a multi-objective optimization for energy-efficient allocation of virtual clusters in cdcs. this research paper describes four optimization goals related to vc and data centers: availability, power consumption, average resource utilization, and resource load balancing. the architecture contains three-layers which are the core layer, aggregation layer, and edge layer. in the core, layer contains a core switch, which is classified into many aggregated switches. the aggregation layer contains an edge switch and pms. the edge switches are connected to the pms. the pms are connected to the vms cluster. if the edge switch is failed, it will not access the pms and vms cluster. 5. discussion the existing works were addressed the issues of task scheduling and load balancing in iot-fog-cloud environment. many of the kanbar and faraj: limitations of load balancing algorithms in cloud computing 76 uhd journal of science and technology | jan 2023 | vol 7 | issue 1 research aims to reduced makespan, energy consumption, and latency during task scheduling, allocation, and vm migration for load balancing. however, the existing works consider limited features for tasks scheduling and allocation which leads to poor scheduling and qos. in addition, some of the works selects target vm for migration by considering only load which was not enough for optimal vm migration. due to lack of significant features also increases frequent migration, which increase high overload and latency during load balancing. the existing works used optimization algorithm with slow convergence such as pso, genetic algorithm, etc. for task scheduling and vm migration which leads to high latency and overload during load balancing in iot-fog-cloud environment; hence, we need to addressed these issues for providing efficient task scheduling and load balancing results. 6. conclusion cloud computing is growing rapidly and users are demanding more and more services, that’s why cloud computing load balancing has become such a thoughtful impetus and important research area. load on the cloud is growing extremely with the expansion of new applications [22] to overcome the load because of huge requests and increase the quality of service many load balancing techniques are used. in this paper, we surveyed many cloud load balancing techniques and focusing to the limitations of each to help the researchers to propose new methods to overcome the limitations to solve the problem of load balancing in cloud computing environment. table 1 shows the summary of related works the used load balancing algorithms and processes limitations. table 1: summary of related works and limitations of used algorithms and processes references task classification task scheduling load balancing task allocation algorithm/process used limitations [11] x x  x firefly improved multi-objective particle swarm optimization (fimpso). • more number of iterations and low convergence rate performs [12] x x  x when traffic exceeds a region's capacity, the fog layer performs load balancing. • long execution time is observed • task requirements were not considered resulting in sla violation. [13] x x   etcora algorithm for energy efficient offloading • constraints exist when it comes to increasing the scalability of tasks in a particular region. [14] x x  x resource requirements for each task are profiled and offloaded. • throughput is affected because of frequent migrations • high migration time because of long congestion in link. [15] x x  x load balancing based on adaptive pbest discrete pso (apdpso) to reduce communication costs • blind spot problem • throughput is affected because of frequent migrations [16] x   x tbts and sla-lb to dynamically perform scheduling and load balancing. • ineffective determination of threshold value increases complexity [17] x x  x gossip control based dynamic virtual machine consolidation for effective load balancing • lack of security of data during migration. • high migration time because of long congestion in link. [18] x  x  bpso based resource provisioning and scheduling • high privacy leakage during data migration [19] x x  x cloud theory based optimization of load in order improve qos • increased latency and end to end delay because centralized processing in cloud layer [20] x  x  arps framework based scheduling and provisioning of resources • high congestion occurs due data overloading [21] x x x  virtual clustering based multi objective task allocation is carried out using ibbbo. • long execution time • slow convergence of ibbbo algorithm kanbar and faraj: limitations of load balancing algorithms in cloud computing uhd journal of science and technology | jan 2023 | vol 7 | issue 1 77 references [1] a. b. kanbar and k faraj. “regional aware dynamic task scheduling and resource virtualization for load balancing in iot-fog multicloud environment”. future generation computer systems, 137c, pp. 70-86, 2022. [2] h. nashaat, n. ashry and r. rizk. “smart elastic scheduling algorithm for virtual machine migration in cloud computing”. the journal of supercomputing, vol. 75, pp. 3842-3865, 2019. [3] h. gao, h. miao, l. liu, j. kai and k. zhao. “automated quantitative verification for service-based system design: a visualization transform tool perspective”. international journal of software engineering and knowledge engineering, vol. 28, no. 10, pp. 13691397, 2018. [4] s. v. pius and t. s. shilpa. “survey on load balancing in cloud computing”. in: international conference on computing, communication and energy systems, 2014. [5] p. jain and s. choudhary. “a review of load balancing and its challenges in cloud computing”. international journal of innovative research in computer and communication engineering, vol. 5, no. 4, pp. 9275-9281, 2017. [6] p. kumar and r. kumar. “issues and challenges of load balancing techniques in cloud computing: a survey”. acm computing surveys, vol. 51, no. 6, pp. 1-35, 2019. [7] m. alam and z. a. khan. “issues and challenges of load balancing algorithm in cloud computing environment”. indian journal of science and technology, vol. 10, no. 25, pp. 1-12, 2017. [8] h. kaur and k. kaur. “load balancing and its challenges in cloud computing: a review”. in: m. s. kaiser, j. xie and v. s. rathore, editors. information and communication technology for competitive strategies (ictcs 2020). lecture notes in networks and systems, vol. 190. springer, singapore, 2021. [9] g. k. sriram. “challenges of cloud compute load balancing algorithms”. international research journal of modernization in engineering technology and science, vol. 4, no. 1, p. 6, 2022. [10] h. chen, f. wang, n. helian and g. akanmu. “user-priority guided min-min scheduling algorithm for load balancing in cloud computing”. in: proceeding national conference on parallel computing technologies (parcomptech), ieee. pp. 1-8, 2013. [11] a. f. devaraj, m. elhoseny, s. dhanasekaran, e. l lydia and k. shankar. “hybridization of firefly and improved multi-objective particle swarm optimization algorithm for energy efficient load balancing in cloud computing environments”. journal of parallel and distributed computing, vol. 142, pp. 36-45, 2020. [12] m. m. maswood, m. r. rahman, a. g. alharbi and d. medhi. “a novel strategy to achieve bandwidth cost reduction and load balancing in a cooperative three-layer fog-cloud computing environment”. ieee access, vol. 8, pp. 113737-113750, 2020. [13] h. sun, h. yu, g. fan and l. chen. “energy and time efficient task offloading and resource allocation on the generic iot-fog-cloud architecture”. peer-to-peer networking and applications, vol. 13, no. 2, pp. 548-563, 2020. [14] l. yu, l. chen, z. cai, h. shen, y. liang and y. pan. “stochastic load balancing for virtual resource management in datacenters”. ieee transactions on cloud computing, vol. 8, pp. 459-472, 2020. [15] z. miao, p. yong, y. mei, y. quanjun and x. xu. “a discrete psobased static load balancing algorithm for distributed simulations in a cloud environment”. future generation computer systems, vol. 115, no. 3, pp. 497-516, 2021. [16] d. singh, p. s. saikrishna, r. pasumarthy and d. krishnamurthy. “decentralized lpv-mpc controller with heuristic load balancing for a private cloud hosted application”. control engineering practice, vol. 100, no. 4, p. 104438, 2020. [17] n. m. donnell, e. howley and j. duggan. “dynamic virtual machine consolidation using a multi-agent system to optimise energy efficiency in cloud computing”. future generation computer systems, vol. 108, pp. 288-301, 2020. [18] f. hosseini, a. safari and m. farrokhifar. “cloud theory-based multi-objective feeder reconfiguration problem considering wind power uncertainty”. renewable energy, vol. 161, pp. 1130-1139, 2020. [19] m. kumar, s. c. sharma, s. s. goel, s. k. mishra and a. husain. “autonomic cloud resource provisioning and scheduling using meta-heuristic algorithm”. neural computing and applications, vol. 32, pp. 18285-18303, 2020. [20] m. kumar, a. kishor, j. abawajy, p. agarwal, a. singh and a. y. zomaya. “arps: an autonomic resource provisioning and scheduling framework for cloud platforms”. ieee transactions on sustainable computing, vol. 7, no. 2, pp. 386-399, 2021. [21] x. liu, b. cheng and s. wang. “availability-aware and energyefficient virtual cluster allocation based on multi-objective optimization in cloud datacenters”. ieee transactions on network and service management, vol. 17, no. 2, pp. 972-985, 2020. [22] a. b. kanbar and k. faraj. “modern load balancing techniques and their effects on cloud computing”. journal of hunan university (natural sciences), vol. 49, no.7, pp. 37-43, 2022. tx_1:abs~at/tx_2:abs~at 32 uhd journal of science and technology | may 2018 | vol 2 | issue 2 1. introduction this paper presents a rule-based machine translation (rbmt) system for the kurdish language. the goals of this paper are two-fold: first, we build mt system using a free/open-source platform (apertium). second, we evaluate the translation of proposed system with “inkurdish” translation for the same set of data through manual evaluation method. the kurdish language belongs to the group of indoeuropean languages. the kurdish dialects are divided, according to the linguistic and geographical facts, into four main dialects. they are the north kurmanji, middle kurmanji, south kurmanji, and gurani [1]. kurdish is written using four different scripts, which are modified persian/arabic, latin, yekgirtu (unified), and cyrillic. latin script uses a single character while persian/arabic and yekgirtu in a few cases use two characters for one letter. the persian/arabic script is even more complex with its rtl and concatenated writing style [2]. mt, perhaps the earliest nlp application, is the translation of text units from one natural language to another using computers [3]. achieving error-free translation is a difficult task, instead an improvement in completely automatic, high quality, and general-purpose translations is required. the better mt evaluation metrics will be surely helpful to the development of better mt systems [4]. the mt evaluation has both automatic and manual (human) evaluation methods; the human evaluation criteria include the fluency, adequacy, intelligibility, fidelity, informativeness, task-oriented measures, and post-editing. the automatic evaluation method criteria include precision, recall, f-measure, edit distance, word order, part of speech tag, sentence structures, phrase types, named entity, synonyms, paraphrase, semantic roles, and language models. for this work, manual evaluation method has been used to evaluate the accuracy of both the systems. english to kurdish rule-based machine translation system kanaan m. kaka-khan department of computer science, university of human development, kurdistan region, iraq a b s t r a c t machine translation (mt) is a gaining ever more attention as a solution to overcome language barriers in information exchange and knowledge sharing. in this paper, we present a rule-based mt system developed to translate simple english sentences to kurdish. the system is based on the apertuim free open-source engine that provides the environment and the required tools to develop a mt system. the developed system is used to translate some simple sentence, compound sentence, phrases, and idioms from english to kurdish. the resulting translation is then evaluated manually for accuracy and completeness compared to the result produced by the popular (in kurdish) english to kurdish mt system. the result shows that our system is more accurate than in kurdish system. this paper contributes toward the ongoing effort to achieve full machine-based translation in general and english to kurdish mt in specific. index terms: apertuim, inkurdish, machine translation, morphological, rule-based machine translation corresponding author’s e-mail: kanaan m. kaka-khan, department of computer science, university of human development, kurdistan region, iraq. e-mail kanaan.mikael@uhd.edu.iq received: 11-08-2018 accepted: 17-08-2018 published: 02-09-2018 access this article online doi: 10.21928/uhdjst.v2n2y2018.pp32-39 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2018 kaka-khan. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) re v i e w a r t i c l e uhd journal of science and technology kanaan m. kaka-khan: english to kurdish rule-based machine translation system uhd journal of science and technology | may 2018 | vol 2 | issue 2 33 we have used a platform called apertium; apertuim defines itself as a free/open-source mt platform, initially aimed at related-language pairs but expanded to deal with more divergent language pairs and provide a languageindependent mt engine and tools to manage the linguistic data [5]. apertium originated as one of the mt engines in the project opentrad, which was funded by the spanish government and developed by the transducens research group at the universitat d’ alacanat. at present, apertium has released 40 stable language pairs. being an open-source project, apertium provides tools for potential developers to build their own language pair and contribute to the project. although translators without borders (twb) claimed that they have developed offline mt engines for sorani and kurmanji, specifically for translating content for refugees using apertium, their work had not been published academically. although apertium was founded initially to provide an english/catalan converter, it can also be used to right to left languages with more efforts specifically in creating transfer rules. the rest of this paper is organized in the following way: next, we present mt survey in section 2. we describe methodology in section 3. we then show and explain the results in section 4, followed by the conclusion in the last section. 2. mt survey 2.1. general mt survey a very early mt system returned to 1950s [6]. the development of computer with high storage and performance in one side and availability of bilingual and multilingual corpora in other side led to gain rapid mt development since the 1990s [7]. in 1993, ibm watson research group did many important achievements in mt areas such as designing five statistical mt models and the techniques to estimate the model parameters using bilingual corpora [8]. in 2003, franz josef presented minimum error rate training for statistical mt systems [9] and koehn et al. proposed statistical rbmt model [10]; in 2005, koehn and monz presented a shared task of building statistical mt systems for four european languages [11], and david chiang proposed a hierarchical phrase-based smt model that is learned from a bitext without syntactic information [12]; menezes et al. used global reordering and dependency tree to build english-to-spanish statistical mt in 2006 [13]. in 2007, koehn et al. did a great achievement which was developing moses, an open-source smt software toolkit [14]; at the same time, in the sake of improving word alignment and language model quality among different languages, hwang et al. team utilized the shallow linguistic knowledge [15]; sa´nchez-mart´inez and forcada described an unsupervised method for the automatic inference of structural transfer rules for a shallow-transfer mt system in 2009 [16]. in 2011, khalilov and fonollosa designed a new syntax-based reordering technique to determine the problem of word ordering [17]. deep learning fast development played a great roles in mt research evolving from conventional models to examplebased models by nirenburg in 1989 [18], statistical models by carl and way in 2003 [19], hybrid models by koehn and knight in 2009 [20], and recent years’ neural models by bahdanau et al. in 2014 [21]. neural mt (nmt) is a recently hot topic that leads the automatic translation to be worked in a very different direction with the traditional phrase-based smt methods. in traditional model, the different mt components are training separately, while the nmt components are training jointly by utilizing artificial neural network to increase the translation performance through two step recurrent neural network of encoder and decoder [21]-[23]. 2.2. kurdish mt survey unfortunately, few efforts have been done for kurdish mt yet. in 2011, safeen ghafour proposed a project called speeculate; speekulate can be considered as a theoretical research, a multiuse translator [24]. in 2013, the first english to kurdish (sorani) mt system has been released under the name “inkurdish” for translating english text to kurdish language [25]. in 2016, google translate has added support for 13 new languages including kurdish (kurmanji dialect) language, bringing the total number of supported tongues to 10 [26]. twb has developed offline mt engines for sorani and kurmanji, specifically for translating content for refugees [27]; in 2017, kanaan and fatima have evaluated “inkurdish” mt system using different automatic evaluation metrics in the sake of touching the weaknesses of “inkurdish” mt system [28]; hassani suggested a method for mt among two kurdish dialects (kurmanji and sorani) using bidialectal dictionaries, and his result showed that the translated texts are in 71% and 79% of cases rated as understandable for kurmanji and sorani, respectively. they are rated as slightly understandable in 29% of cases for kurmanji and 21% for sorani [2]. kanaan m. kaka-khan: english to kurdish rule-based machine translation system 34 uhd journal of science and technology | may 2018 | vol 2 | issue 2 3. methodology the nature of language and availability of resources play important roles in selecting mt approach. fig. 1 describes the four different categories of machine translation approaches. 3.1. direct translation direct translation involves a word-by-word translation approach. no intermediate representation is produced. 3.2. rule-based translation rbmt systems parse the source text and produce an intermediate representation. the target language text is generated from the intermediate representation. 3.3. corpus-based translation the advantages of this approach are that they are fully automatic and require less human labor. however, they require sentence-aligned parallel text for each language pair and cannot be used for language pairs, for which such corpora do not exist. 3.4. knowledge-based translation this kind of system is concerted around “concept” lexicon representation a domain. rule-based approach has been chosen for this proposed system; reasons to choose a rule-based instead of a statistic system depend on the unavailability of sufficiently large corpora [29]; we use a rbmt which is suitable for languages, for which there are very little data [27]; despite being spoken by about 30 million people in different countries, kurdish is among less-resourced languages [2]. hence, rbmt is a suitable choice for kurdish mt. rbmt models transform the input structure to produce a representation which matches the target language rules, and it has three components (fig. 2): analysis, to produce the structure of source language; transfer, to transfer the representation of source language to representation of a target language; and generation, using target level structure to generate target language text. after completing the prototype of the system, 500 different random data sets (simple sentence, complex sentence, proverbs, idioms, and phrases) have been given to both systems. then, the output of both systems has been given to an annotator (english specialist kurdish native), to evaluate the results through manual evaluation method. the aim of the evaluation is to determine the translation accuracy for both systems in both meaning and grammar correctness. the evaluation has been designed into 5 categories, from score 5–1: highly accurate, the translation is very near to the reference, it conveys the content of the input sentence, and no post editing is required; accurate, the translation conveys the content of the input sentence, and little post-editing fig. 1. machine translation approaches [1]. fig. 2. rule-based (transfer-based) machine translation diagram [2]. kanaan m. kaka-khan: english to kurdish rule-based machine translation system uhd journal of science and technology | may 2018 | vol 2 | issue 2 35 is required; fairly accurate, while the translation generally conveys the meaning of the input sentence, it suffers from word order problems or tense or un-translated words; poorly accurate, while the translation somehow conveys the meaning of the input sentence, it does not convey the input sentence content accurately; and completely inaccurate, the content of the input sentence is not conveyed at all by the translation, and it just give the translation of the words individually. 4. proposed system configuration our system basically works on dictionaries and transfer rules, and at a basic level, we maintain three main dictionaries: 1. kurdish morphological dictionary: this file describes the rules of how words in kurdish language are inflected, and its named: apertium-kur.kur.dix 2. english morphological dictionary: this file describes the rules of how words in english language are inflected, and its named: apertium-eng.eng.dix 3. bilingual dictionary: this file describes correspondences between words and symbols in kurdish and english languages, and its named: apertium-kur-eng.kur-eng.dix. we maintain files for transfer rules in the two languages. the rules govern the words reordering in target language, the file is: • english to kurdish language transfer rules: this file contains rules govern how english will be changed into kurdish language, its named: apertium-eng-kur.kur-eng.t1x. in spite of the possibility of translating kurdish to english texts, we just present english to kurdish translation in this work. 4.1. terms used in the system before creating the dictionaries and rules, some related terms would be explained briefly. the first term is lemma: lemma is the form of word which is stripped of any grammatical information, for example book is the lemma of (booked, booking, etc.,) and be is the lemma of was. the second term is symbol: a grammatical label for example singular and plural names, first person and present indicative, etc. tags are used for symbols, <n> for noun, <pl> for plural, etc. paradigm is the another related term which refers to inflection of a particular group of words, for example happy, happ (y, ier, iest), instead of storing a lot of the same thing, we can simply store one, and then we say the second inflects like the first, for example “shy, inflects like happy”. paradigms are defined in <pardef> tags, and used in <par> tags. 4.2. basic tags in kurdish and english dictionaries <dictionary><dictionary/> tag is the start and end point which contains the other all tags within xml file. <alphabet><alphabet/> tag defines the set of letters that will be used in the dictionary. <alphabet>abcdefghijklmnopqrstuvwxy zabcdefghijklmnopqrstuvwxyz<alphabet/> for english dictionary. < a l p h a b e t > ئ ـ ا ب پ ت ج چ ح خ د ر </alphabet>ێ ی وو ۆ و ە ـه ن م ڵ ل گ ک ق ڤ ف غ ع ش س ژ ز ڕ for kurdish dictionary. symbol definitions: the symbols name can be written out in full or in abbreviate, for example, noun (n) in singular (sg) and plural (pl) (fig. 3). then, we define a section <section><section/> for the paradigms <pardefs><pardefs/> (fig. 4). this is the basic skeleton for the morphological dictionaries, then the words will be entered through the entries, <e><p><l/><r><s n=”n”/><s n=”sg”/></r></ fig. 3. tags used for symbols. fig. 4. skeleton for morphological dictionary. kanaan m. kaka-khan: english to kurdish rule-based machine translation system 36 uhd journal of science and technology | may 2018 | vol 2 | issue 2 p></e>, here e for entry, p for pair, l for left, and r for right. compiling entries left to right lead to produce analyses from words and from right to left leads to produces words from analyses. the final step is compiling and run the dictionary. both english (apertium-eng.eng.dix) and kurdish (apertiumkur.kur.dix) morphological dictionaries would be created in the same manner. 4.3. bilingual dictionary this describes mappings between words, the basic skeleton is the same as monolingual dictionary, but we need to add an entry to translate between the two words: <e><p><l>university<s n=”n”/><l/><r> ۆکناز <s n=”n”/></r></p></e>. we compile the bilingual dictionary left to right to produce the kurdish→ english dictionary and right to left to produce the english → kurdish dictionary. 4.4. transfer rules it contains rules to govern how english will be changed into kurdish language, and the basic skeleton of the transfer rules is shown here (fig. 5). <rule> tag defines a rule. <pattern> tag means: “apply this rule, if this pattern is found” (here the pattern consists of a single noun defined by the category item nom). patterns are matched in a longest-match first. the pattern matched and rule executed would be the first one. for each pattern, there is an associated action, which produces an associated output, out. the output is a lexical unit (lu).the <clip> tag allows a user to select and manipulate attributes and parts of the source language (side=”sl”) or target language (side=”tl”) lexical item. transfer rules file need to be compiled and tested. 5. results and discussions after completing the prototype of the proposed system, it would be tested against different sets of data; first, we test it against individual words, and then simple sentence, complex sentences, phrases, proverbs, and idioms, some examples are shown in fig. 6. fig. 6 shows a random sample of data translated by our proposed system; we tried to maintain a rich corpus that involves vast numbers of individual words, phrases, idioms, proverbs, etc., in order not to have un translated words in the output. the second part of this work will be evaluation between the proposed system’s results with “inkurdish” mt system results for the same set of data using manual evaluation method. table 1 shows a sample of data translated by both systems. inkurdish non-sense output with paragraphs and long texts obliged us to be satisfied at basic level (simple and compound sentence, idioms, proverbs, and phrases) evaluation; the sample contains a couple of random examples of each data set. the evaluation made by a neutral annotator (kurdish native which is english specialist) according to the five categories has been defined before. detailed explanation of both computational and linguistics issues is out of our main aim, and we focused on accuracy differences between both systems, plus touching some general translation issues found during experimenting the data sets. inkurdish mt system suffers severely from some issues, it is unable to link verbs to objects in sentences, and in spite of having all different meaning for a specific verb fig. 5. skeleton for transfer rules. fig. 6. samples of proposed system translation. kanaan m. kaka-khan: english to kurdish rule-based machine translation system uhd journal of science and technology | may 2018 | vol 2 | issue 2 37 in the corpus, it failures to bring the correct meaning of the verb according to its position in the sentence; it translated the verb “play” in “he went to play football before 1 h” example (table 1) as ‘تێنيبەد ڵۆڕ’ instead of ’تاكەد ىراي‘ and this led to improper translation; the corpus of inkurdish suffers from lack of pre defined common english idioms and proverbs; it always gives literal translation for idioms and proverbs for example, it translated “better late than never” proverb to ‘زیگرەه ەل گنەرد رتشاب’ (table 1) which is very literal and non-sense translation. untranslated word is another issue for inkurdish system for example the word “backyard” has not been translated in “the kids are playing in the backyard” example (table 1). table 2 shows the accuracy average for all different data sets of both systems, and the accuracy averages have been calculated through a simple formula: average = summation of all individual scores/total number of samples. the results showed that our system is more accurate than inkurdish system for all data sets; both systems got high scores with “simple sentence” translation (3.12 and 3.56 of 5 for inkurdish and our system, respectively); inkurdish got the least score for idioms while our system for phrases (1.15 and 2.13 of 5, respectively), this means that inkurdish needs to maintain large number of common english proverbs and idioms with their kurdish equivalents while our system need to involve more english phrases. in our previous work “evaluation of inkurdish mt system,” we addressed the issues of this mt system in details; hence, we tried to bridge the gaps of inkurdish system in our proposed system and this is the reason of clear differences between inkurdish accuracy average and proposed system accuracy average; the most common inkurish issue is lack of rich corpus specifically to deal with phrases, idioms, and proverbs (1.46, 1.15, and 1.25, respectively) (table 2); during table 1 sample of data sets with their translations dataset source text inkurdish translation proposed system translation simple sentence i go to university ۆکناز ۆب مۆڕەد نم ۆکناز ۆب مۆڕەئ نم evaluation 5 4 the kids are playing in the backyard .هكەدايكاب ەل نەكەد ىراي ناكەڵادنم یەچخاب ەل نەکەد یرای ناکەڵادنم .ەوەتشپ evaluation 3 4 complex sentence i have been playing football since i am 6 years old ٦ منم ىەتەوەل ىنيبەد ڵۆڕ مێپ ىپۆت نم .ناڵاس یەتەوەل مەکەئ یپۆت یرای نم ناڵاس ٦ منەمەت evaluation 1 3 while he was watching the movie, the power switched off ،دركەد اشامەت ىەكەميلف وەئ ىەتاك وەل .ەوەيدناژوك اناوت یاشامەت (ەکەڕوک)وەئ یەتاک وەل ابەراک یوزەت ،درکەد یەکەملیف ەوەیاژوک evaluation 3 4 proverbs better late than never زيگرەه ەل گنەرد رتشاب نتشەگەن ەل نتشەگ گنەرد ەرتشاب evaluation 2 5 actions speak louder than words .ەشو ەل نەكەد ەسق رتزرەب ىگنەد راك ەسق کەن ەترەش رادرک evaluation 2 4 idioms the english test was a piece of cake ىكێيەچراپ ىزيلگنيئ ىەوەندركيقات .ووب كێك رۆز یزیلگنیئ یەوەندرک یقات ووب ناسائ evaluation 2 4 you can kill two birds with one stone ڵەگەل تيژوكب ەدنڵاب وود تيناوتەد ۆت .درەب كەي وود درەب کەی ەب تیناوتەئ ۆت .تیژوکب ەدنڵاب evaluation 2 3 phrases thanks i am pretty good مشاب کەیەداڕ ات نم ساپوس مشاب رۆز نم ساپوس evaluation 3 5 we could have dinner at macdonald. how does that sound? .دلەنۆدكەم ەل ناميناوت وێش نيۆخب ەمێئ ؟تاكەد ەگنەد وەئ نۆچ دڵانۆدکام ەل وێش نیناوتەئ ەمێئ ؟ەگنەد وەئ ەنۆچ ،نیۆخب evaluation 2 3 table 2 translation accuracy average for both systems dataset inkurdish accuracy average proposed system accuracy average simple sentence 3.12 3.56 complex sentence 2.45 2.78 proverbs 1.25 2.22 idioms 1.15 2.42 phrases 1.46 2.13 kanaan m. kaka-khan: english to kurdish rule-based machine translation system 38 uhd journal of science and technology | may 2018 | vol 2 | issue 2 experimenting the data with inkurdish, it did not translate even one idiom or proverb, and it gives a literal translation instead. 6. conclusion mt remains to be one of the most challenging aspects of nlp. despite the ongoing efforts to achieve full machinebased translation, little progress has been achieved; due to language structure and composition complexity. open-source platforms have provided the environment and tools required to develop reliable mt systems, especially for language with poor resources such as kurdish. we have presented a mt system to translate english to kurdish developed using an open-source platform. the resulting translation is compared with the result generated by inkurdish popular english to kurdish mt system. the result shows clear differences between inkurdish mt system and our mt system in terms of translation accuracy. the result also shows that rbmt and manual mt evaluation are suitable choices, for poorly resourced languages. biography kanaan m.kaka-khan is an associate professor in the computer science department at human development university, sulaimaniya, iraq. born in iraq 1982. kanaan m.khan had his bachelor degree in computer science from sulaimaniya university and master degree in it from bam university, india. his research interest area include: natural language processing, mt, chatbot, and information security. references [1] f. h. khorshid. kurdish language and the geographical distribution of its dialects. ishbeelia press, baghdad, 1983. [2] h. hassani. kurdish interdialect machine translation. in proc of the fourth workshop on nlp for similar languages, varieties and dialects (vardial), 2017, pp. 63-72. [3] t. siddiqui and u. s tiwary. natural language processing and information retrieval. oxford university press. second impression 2010, oxford, 2010. [4] c. liu, d. dahlmeier and h. t. ng. better evaluation metrics lead to better machine translation, in: proc emnlp, 2011. [5] apertuim. about apertuim, 2018. available: www.apertium.org/ index.eng.html?dir=eng-spa#translation. [aug. 7, 2018]. [6] w. weaver. translation. machine translation of languages: fourteen essays. mit press, cambridge, ma, 1955. [7] j. b. marin˜o, r. e. banchs, j. m. crego, a. gispert, p. lambert, j. a. r. fonollosa and m. r. costa-jussa`. n-gram based machine translation. computational linguistics, vol. 32, no. 4, pp. 527-549, 2006. [8] p. f. brown, v. j. d. pietra, s. a. d. pietra and r. l. mercer. the mathematics of statistical machine translation: parameter estimation. computational linguistics, vol. 19, no. 2, pp. 263-311, 1993. [9] f. j. och. minimum error rate training for statistical machine translation. in proc. of acl, 2003. [10] p. koehn, f. j. och and d. marcu. statistical phrase-based translation. in proc. of the 2003 conference of the north american chapter of the association for computational linguistics on human language technology, association for computational linguistics. vol. 1, pp. 48-54, 2003. [11] p. koehn and c. monz. shared task: statistical machine translation between european languages. in proc. of the acl workshop on building and using parallel texts, 2005. [12] d. chiang. a hierarchical phrase-based model for statistical machine translation. in proc. of the 43rd annual meeting of the association for computational linguistics (acl), 2005, pp. 263-270. [13] a. menezes, k. toutanova and c. quirk. microsoft research treelet translation system: naacl 2006 europarl evaluation. in proc. of wmt, 2006. [14] p. koehn, h. hoang, a. birch, c. callison-burch, m. federico, n. bertoldi, b. cowan, w. shen, c. moran, r. zens, c. j. dyer, o. bojar, a. constantin and e. herbst. moses: open source toolkit for statistical machine translation. in proc. of the 45th annual meeting of the acl on interactive poster and demonstration sessions, association for computational linguistics, 2007b, pp. 177-180. [15] y. s. hwang, a. finch and y. sasaki. improving statistical machine translation using shallow linguistic knowledge. computer speech and language, vol. 21, no. 2, pp. 350-372. [16] f. sa´nchezmart´inez and m. l. forcada. inferring shallow-transfer machine translation rules from small parallel corpora. journal of artificial intelligence research, vol. 34, pp. 605-635, 2009. [17] m. khalilov and j. a. r. fonollosa. syntax-based reordering for statistical machine translation. computer speech and language, vol. 25, no. 4, 761-788, 2011. [18] s. nirenburg. knowledge based machine translation. machine translation, vol. 4, no. 1, pp. 5-24, 1989. [19] m. carl and a. way. recent advances in example-based machine translation. kluwer acadmic publishers, dordrecht/boston/ london, 2003. [20] p. koehn and k. knight. statistical machine translation, november 24. us patent no. 7,624,005, 2009. [21] d. bahdanau, k. cho and y. bengio. neural machine translation by jointly learning to align and translate. corr, vol. abs/1409.0473, p. 9, 2014. [22] k. h. cho, b. van merrienboer, d. bahdanau and y. bengio. on the properties of neural machine translation: encoder-decoder approaches. corr, vol. abs/1409.1259, p.15, 2014. [23] k. wolk and k. marasek. neural-based machine translation for medical text domain. based on european medicines agency leaflet texts. procedia computer science, vol. 64, p. 2-9. 2015. [24] s. ghafour. “speeculate”. 2011. available: www.kurditgroup.org/ sites/default/files/speekulate_0.pdf. [aug. 4, 2018]. [25] inkurdish translator. 2013. available: www.inkurdish.com/. [aug. 4, 2018]. [26] ekurd daily-editorial staff. “google translate adds support for kurdish language”, 2016. available: www.ekurd.net/googletranslate-kurdish-language-2016-02-18. [aug. 4, 2018]. [27] translators without borders. “translators without borders kanaan m. kaka-khan: english to kurdish rule-based machine translation system uhd journal of science and technology | may 2018 | vol 2 | issue 2 39 developed the world’s first crisis-specific machine translation system for kurdish languages”, 2016. available: www. translatorswithoutborders.org/translators-without-bordersdevelops-worlds-first-crisis-specific-machine-translation-systemkurdish-languages/. [aug. 4, 2018]. [28] k. m. kaka-khan and f. jalal. evaluation of in kurdish machine translation system. presented in the 4th international scientific conference of university of human development, apr. 2017. journal of university of human development, vol. 3, no. 2, pp. 862-868, jun. 2017. [29] w. linda. rule-based mt approaches such as apertium and gramtrans. universitetet i tromsø, norway 21.10.2008, 2008. tx_1:abs~at/tx_2:abs~at uhd journal of science and technology | may 2018 | vol 2 | issue 1 1 1. introduction it is commonly agreed that certain people look “young for their age” or “old for their age.” moreover, the two common processes that influence skin aging are the skin aging that genetically determined and happens by passing time which is named chronologically (or intrinsic) skin aging process, while premature (or extrinsic) skin aging process is triggered by environmental factors. recognized environmental factors that lead to premature skin aging process are sun exposure, air pollution, and smoking. the extrinsic skin aging rate is different notably among ethnic groups and individuals, whereas for the intrinsic rate of skin aging this occurrence does not relate [1]-[4]. over the past decades, there has been a major growth in the instances of skin disease all over the world. vital public health implications will occur when individual exposure to high cumulative levels of ultraviolet (uv) radiation [5], [6]. unprotected and excessive sun exposure can damage skin cells, influence the normal growth of the skin and as a result cause several skin diseases such as burning and tanning. in addition, severe skin problems can occur when humans are exploring knowledge and self-care practice toward skin aging and sun protection among college students in sulaimani city-iraq hezha o. rasul1, diary i. tofiq1, mohammad y. saeed2, rebaz f. hamarawf1 1department of chemistry, college of science, university of sulaimani, iraq, 2department of medicine, college of medicine, university of sulaimani, iraq a b s t r a c t several studies have been performed internationally to assess the understanding and self-care exercise of people in the direction of sun exposure and sun protection measures, as self-care is an essential pillar of public health. nevertheless, limited data on these factors are available from the middle east. the aim of this study was to investigate the students’ awareness of skin aging and sun-protection measures among college students. for this purpose, a cross-sectional questionnaire was specially designed; a random sample of the students in the different college of the university of sulaimani was selected. data were collected between january and may 2017. the relationship between the skin cancer awareness and different sociodemographic characteristics was produced by applying multiple logistic regressions. the questionnaires were distributed to 450 college students. a total of 413 questionnaires had been completely responded and covered within the data analysis, with a response rate of 91.7%. 41% of the respondents were females and 61.0% of the participants were aged between 18 and 21 years old. 47% have been privy to the association between sun exposure and skin aging. the respondents had been more likely to be aware of the connection between sun exposure and skin cancer (p < 0.03). the respondents from the third class of undergraduates were more likely to be familiar (p < 0.04). staying under the shade during the outdoor activity was reported by more than 90% of our participants and is positioned as the most frequently used sun protection method. index terms: skin aging, skin cancer, skin care, sunscreen corresponding author’s e-mail: hezha o. rasul, department of chemistry, college of science, university of sulaimani, iraq. e-mail: hezha. rasul@univsul.edu.iq received: 11-09-2017 accepted: 05-02-2018 published: 25-05-2018 access this article online doi: 10.21928/uhdjst.v2n1y2018.pp1-7 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2018 hezha o. rasul, et al. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l r e s e a r c h a r t i c l e uhd journal of science and technology hezha o. rasul, et al.: skin aging and sun protection knowledge among college students 2 uhd journal of science and technology | may 2018 | vol 2 | issue 1 exposed to large quantities of solar uv radiation, including skin aging, pigmentary changes, and skin cancer [7]-[11]. uv radiation is the most important environmental factor which leads to premature skin aging [2]. human beings are exposed to massive portions of uv radiation in part through numerous sources which include living and traveling in sunny climates and outdoor activity, additionally due to thinning of the ozone layer within the stratosphere [5]. the harmful effect of uv radiation on the skin look regarding facial aging was previously discovered in the end 19th century by the two dermatologists unna and dubreuilh [12], [13]. harry daniell, in 1971, discovered the associations between cigarette smoking and skin aging [14]. moreover, moderate alcohol consumption has also been shown to correlate with skin appearance [15]. recent observation reported that air pollution is another significant environmental factor, which influences the skin appearance and leads to skin aging intrinsically [16]. skin cancer has increased gradually during the past 50 years. epidemiological studies demonstrate that skin cancer is developed by the sun, which is considered as the main considerable environmental factor which influences the skin. to reduce the skin cancer occurrence, the first step needs to be done increasing levels of awareness and self-care knowledge of the sun’s harmful effects and how to better protect from solar emission [5], [17], [18]. it is essentially important to focus on educational level; this is with the purpose of changing behavioral patterns and protecting people against the dangerous effects of the sun [17]. education plays a key role in raising awareness [19], [20]. several types of research have been studied in different countries to determine people knowledge levels about the sun effect on facial aging and awareness level concerning sun protection [21]. in local skin care hospital, we have observed that many patients do not protect themselves from the sun and report unhealthy attitudes toward this subject. exploring deficits in sun protection awareness and self-care practice toward different environmental factors can serve as a starting point for primary prevention interventions. identifying knowledge and self-care practices of the public regarding skin aging, exposure and protection of the sun have been studied in several countries. nevertheless, there is no study regarding this issue in sulaimani city-iraq. the purpose of the following study is to find out the levels of knowledge and self-care practice in regard to skin aging, sun exposure, and protection among college students. in addition, we will present the student’s knowledge about various environmental factors such as air pollution, smoking, and drinking alcohol on skin aging. 2. materials and methods a cross-sectional survey was carried out between january 2017 and may 2017, both males and females students were involved at different colleges of sulaimani university. from each college, several departments were selected randomly. a total of 413 questionnaires were collected. data collection was performed by several trained students. the data collection process in this study was carried out using the questionnaire, which was specially created throughout a search of appropriate literature [1], [5], [7]. the re-designed questionnaire was tested initially in sulaimani center for dermatology and venereal disease-teaching hospital to estimate approximately the length of the questionnaire in minutes, verify the participant’s interpretation of questions and develop the questionnaire consequently. these questionnaires were tested in independent data sets, but these candidate questionnaires were excluded from the concluding analysis. however, the final version of the survey was conducted in the university of sulaimani. the final version of the questionnaire included 24 questions and required approximately 5 minutes to complete. approval from the ethics committee of university of sulaimani, sulaimani, iraq, was obtained. the self-administered questionnaire was composed of three sections. the first section of the questionnaire comprised nine questions about personal information, such as university level, residence (urban vs. rural), gender, age, marital status, weight, high, smoking, and drinking alcohol. various questions were integrated into the second part of the questionnaire about the student’s knowledge concerning the factor of skin aging, sun’s benefits and harmful effects on the skin and use of sun protection methods. the data were analyzed using statistical package for the social sciences program (spss) version 21. after that the statistical assessment of data and the summarization of frequencies and percentages, multiple logistic regressions were used. statistical significance was defined as p < 0.05. 3.results the questionnaire was distributed to 453. a complete data from 413 participants were returned and integrated into the analysis with a 91.7% response rate. there was a various range of age of the students who participated in the survey, 61.0% (252/413) of the respondents were aged 18-21 years old, while, and 34.1% (141/413) of the students were aged between 22 and 25. only 4.8% (20/413) of the participants were aged over 25 years. in addition, 58.6% (242/413) of the students who contributed to the survey were male, hezha o. rasul, et al.: skin aging and sun protection knowledge among college students uhd journal of science and technology | may 2018 | vol 2 | issue 1 3 while 41.4% (171/413) of the participants were female. the sociodemographic characters of the research population are depicted in table i. the level of awareness among students regarding unprotected exposure to the sun caused skin damage is illustrated in table ii. this study has indicated that the majority of respondents were mindful that excessive sun exposure causes skin burn (87.9%, 363/409). 47% (194/402) reported that sun exposure can cause skin aging, while almost more than a half of respondents (52.5%, 217/404) were responsive of the relationship between skin cancer and sun exposure. nevertheless, the level of knowledge of students regarding the impact of sunlight on skin was explored; understanding of the relationship between synthesis of vitamin d and sun exposure was the most well-known benefit of the students, with 76.1% of the male and 70.7% of the female citing it. the male and female participants were reported (74.4% and 70.6%, respectively) regarding the association between sun exposure and treatment in some skin conditions. the relationship between positive psychological effects sun exposure was recorded 63% in male respondents, while only 37.0% of the female participants were aware of this relation, as shown in table iii. the logistic regression models were used to assess the relationship between the demographic factors influencing awareness of the connection between sun exposure, and skin cancer is shown in table iv. students aged 18-21 years reported higher rates of skin cancer knowledge (p < 0.025). similarly, respondents from class 3 were more likely to be linked with the understanding of the association between sun exposure and skin cancer (p < 0.037). however, there was no significant dissimilarity found in awareness between students rooted in their gender, marital status, or area of residence (rural vs. urban). the sun protection behaviors among respondents are summarized in table v. 56% of study students reported that they were protecting themselves during the daytime against the effects of the sun by wearing sun protection cream (232/401), wearing sunglasses (61.9%, 256/401), and light-colored cotton clothes (83.1%, 343/407). wearing “a hat” was found to be the least frequently used the technique of sun protection (40.2%, 166/391). the preferred method of protection during their outdoor activities was staying in the shade and inside with the data (90.6%, 374/406) (87.9%, 363/404), respectively. in addition, 87.9% (363/404) of participants reported that they were trying to stay inside to protect their skin from the sunlight. data on using anti-aging table i sociodemographic data of the 413 participants characteristics count (%)a gender male 242 (58.6) females 171 (41.4) age (years) 18-21 252 (61.0) 22-25 141 (34.1) over 25 20 (4.8) marital status single 377 (91.3) married 32 (7.7) residence urban 249 (60.3) rural 162 (39.2) education (undergraduate) class 1 137 (33.2) class 2 113 (27.4) class 3 65 (15.7) class 4 85 (20.6) athe denominator is different among variables due to missing values table iii students’ levels of knowledge about beneficial effects of the sunlight factor male, n (%) female, n (%) yes no yes no synthesis of vitamin d 181 (76.1) 57 (23.9) 118 (70.7) 49 (29.3) treatment in some skin conditions 177 (74.4) 61 (25.6) 115 (70.6) 48 (29.4) positive psychological effects 133 (63.0) 104 (56.5) 78 (37.0) 80 (43.5) table ii awareness of negative effects of the sunlight among respondents n (%)a yes no don’t know what damage does excessive sun-exposure cause? skin burn 363 (87.9) 18 (4.4) 28 (6.8) skin aging 194 (47.0) 74 (17.9) 134 (32.4) skin cancer 217 (52.5) 62 (15.0) 125 (30.3) athe denominator is different among variables due to missing values hezha o. rasul, et al.: skin aging and sun protection knowledge among college students 4 uhd journal of science and technology | may 2018 | vol 2 | issue 1 skin product among our respondents showed that just about 55% of the female students reported that they never used anti-aging cream. of respondents, only 45.1% (60/133) and 31.1% (60/183) of the female and male, respectively, had ever used sunscreen anti-aging product. in addition, the students were asked about the importance of looking after their skin. as illustrated in fig. 1, a surprising result had been recorded in this section, most of the students reported that it is important to look after their skin. in this study, the participants were asked about their concern for various issues relating to the premature skin aging such. according to the data, stress was the most concerned respond, and the majority of students were submitted their choice to less concerned about sun exposure as shown in table vi. perceptions of key factors of aging among students were recorded, and poor diet was considered as the main factor of aging among the students (33.1%) as shown in table vii. 4. discussion information about public knowledge and behaviors regarding skin aging and protection measures among kurdish people are little, and none official study in kurdistan was found after a broad literature review on this topic. diverse climate conditions can be found in different regions of iraq. the northern part of iraq, which is called kurdistan, has a cooler atmosphere than the southern part. the climate of kurdistan has distinct high temperatures in day-time and low temperatures during the nighttime. from june to september, daytime temperatures reach 44°c or higher throughout the area. the south has higher temperatures, which can go as high as 48°c during summer time. in our local clinical dermatology center, we noticed that various sun’s related skin diseases such as sunburn are more common in the summer period because the weather in the summer is exceptionally hot. there has been an important increase in the incidence of the skin cancer over the past few decades. in addition, it has been shown that developing this condition is related to over a lifetime commutation of sun exposure [22]. it has been reported that with the implementation of sun protection measures and proper behaviors approximately around 80% of skin cancer cases can be prevented. nevertheless, the occur rence of skin cancer is still increasing [23]. this study has indicated that more than 90% of our study respondents they commonly stayed under the shade to avoid the harmful effects of sun exposure. gray et al. and ergin et al. achieved a similar result in their 2012 fig. 1. how important is to look after the skin? table iv logistic regression analyses of skin cancer awareness and sociodemographic characteristics p value odds ratio 95% ci (lower bound-upper bound) male 0.082 0.624 (0.367-1.061) age 18–21 years 0.025 5.992 (1.255-28.603) single 0.196 0.514 (0.188-1.410) urban 0.268 1.333 (0.802-2.217) class 3 0.037 0.427 (0.192-0.949) ci: confidence interval table v respondents applied sun protection methods n (%) a regularly never sometimes which of the protection methods do you often use during the daytime? wear sun protection cream 81 (19.6) 169 (40.9) 151 (36.6) wear hat 24 (5.8) 225 (54.5) 142 (34.4) wear sunglasses 89 (21.5) 14 (35.1) 167 (40.4) wear light cotton clothes 92 (22.3) 64 (15.5) 251 (60.8) stay under shade 137 (33.2) 32 (7.7) 237 (57.4) stay inside 113 (27.4) 41 (9.9) 250 (60.5) athe denominator is different among variables due to missing responses hezha o. rasul, et al.: skin aging and sun protection knowledge among college students uhd journal of science and technology | may 2018 | vol 2 | issue 1 5 and 2011 study, respectively [5], [24]. despite the fact that, kokturk et al. and kaymak et al. found that staying inside at peak times to be the most commonly practiced method of avoiding the harmful effects of the sun, with 53% and around 45% for women and men [18], [25]. approximately 52% of the respondents reported awareness of the link between sun exposure and the hazard of skin cancer, which is comparable to the previous study carried out in saudi arabia by alghamdi et al. and al robaee [7], [26]. however, this level of awareness is considered to be lower than similar studies conducted in the western community. for instance, the relationship between sun exposure and skin cancer was made by study participants in malta with figures of 92.5%, 92% in the united states, 90% in australia, and 85% in canada [27]. the outcome of this study has shown that around 88% of study participants were familiar with the linkage between the sun and skin burn. furthermore, knowledge of respondents about the connection between sun exposure and skin aging was confirmed with 47%. in this study, the participants were questioned about their levels of knowledge of the benefits of the sunlight; slight gender distinction was noted in answer to the positive effect of the sun on the synthesis of vitamin d and treatment in several skin conditions. it was found that 76.1% of the male was aware of the positive effect of the synthesis of vitamin d in comparison to 70.7% of the female. in addition, the results of well-known effects of treatment in some skin conditions were analyzed; these statistics were 74.4% for male and 70.6% for female. regardless of the reasonably superior information and awareness among our study participants that the sunlight predisposes people to several skin disorders, including skin aging, sunburn, and skin cancer, the rate of sunscreen attentiveness was low. this study reported that more than half of the participants wear sun protection cream. in addition, 40.9% of the respondents reported that they have never used sunscreen. the finding regarding the use of sunscreen has been cited by several studies on this subject matter [26], [28], [29]. moreover, around 62% of students stated that they wear sunglasses as one of the sun protection methods. nikolaou et al. reported that in mediterranean population sunglasses was the most regularly used sun protection with the number of 83.4% [30]. as we have shown, protective clothes were used as sun protection among students, as 83.1% of respondents reported wearing light cotton clothes, and 40.2% of them reported using a hat during outdoor activities. certainly, this rate of sun protection utilizes and knowledge among the kurdish people as reported by this study is quite alarming and should spotlight the interest in this concern with regard to health education programs and future studies. additional learning is needed as the knowledge only is not sufficient to make a transform in approach. universities are ideal environment because of their existing infrastructure to help students attaining the essential healthy behaviors. sun protection awareness and ideas can be integrated into the existing areas of study programs. nevertheless, this study has several potential limitations that should be reserved when interpreting the results. an expediency sample of students from only one university was surveyed. therefore, caution must be kept in mind in expanding our findings to other universities, especially universities situated in other geographical regions. another limitation to these findings is the reality that the students were asked to report their answers as yes or no with offered statements about sun exposure harmful effect including skin aging, skin burn, and skin cancer, which may prejudice responses and direct to a mistaken evaluation of the proportion of the public who have true and factual information of the sun side effects. finally, the results of this study limited by cross-sectional character, which means that commands of effects can only be hypothesized. table vii key factors of skin aging main factor of aginga responses n (%) percent of cases (%) sun 97 (14.9) 24.3 weather 156 (24.0) 39.1 pollution 181 (27.9) 45.4 poor diet 215 (33.1) 53.9 total 649 (100.0) 162.7 adichotomy group tabulated at value 1 table vi participants concern about issues relating to skincare n (%) concerned not concerned missing-value premature aging caused by sun exposure 248 (60.0) 159 (38.5) 6 (1.5) stress 347 (84.0) 63 (15.3) 3 (0.7) lack of sleep 303 (73.4) 106 (25.7) 4 (1.0) smoking 339 (82.1) 70 (16.9) 4 (1.0) drinking alcohol 299 (72.4) 108 (26.2) 6 (1.5) hezha o. rasul, et al.: skin aging and sun protection knowledge among college students 6 uhd journal of science and technology | may 2018 | vol 2 | issue 1 5. conclusion and recommendation this study has specified a low level of public knowledge and self-care practice among the college students regarding skin aging, the harmful effects of sun exposure and sun protection methods. in addition, this study has discovered that sun protection measure is commonly inadequate among students and on a regular basis only a small part of participants uses sunscreen. therefore, this research highlights the requirement for the media, further studies and future well-being education programs to be utilized with the purpose of developing the implementation of sun protection behaviors including wearing sunscreen regularly and wearing protective clothes among the general public. 6. acknowledgment the authors would like to acknowledge kale rahim and lano hiwa from the university of sulaimani for the data collection. we also thank the staff of the sulaimani center for dermatology and venereal disease (teaching hospital) for their generous help. references [1] h. rexbye. “influence of environmental factors on facial ageing”. age and ageing, vol. 35, no. 2, pp. 110-115, 2006. [2] a. vierkötter and j. krutmann. “environmental influences on skin aging and ethnic-specific manifestations”. dermato-endocrinology, vol. 4, no. 3, pp. 227-231, 2012. [3] b. gilchrest and j. krutmann. skin aging. springer, berlin, 2006. [4] r. halder and c. ara. “skin cancer and photoaging in ethnic skin”. dermatologic clinics, vol. 21, no. 4, pp. 725-732, 2003. [5] e. yurtseven, t. ulus, s. vehid, s. köksal, m. bosat and k. akkoyun. “assessment of knowledge, behaviour and sun protection practices among health services vocational school students”. international journal of environmental research and public health, vol. 9, no. 12, pp. 2378-2385, 2012. [6] o. tekbas, d. evci, and u. ozcan. “danger increasing with approaching summer: sun related uv rays”. taf preventive medicine bulletin, vol. 4, no. 2, pp. 98-107, 2005. [7] k. alghamdi, a. alaklabi and a. alqahtani. “knowledge, attitudes and practices of the general public toward sun exposure and protection: a national survey in saudi arabia”. saudi pharmaceutical journal, vol. 24, no. 6, pp. 652-657, 2016. [8] b. armstrong and a. kricker. “the epidemiology of uv induced skin cancer”. journal of photochemistry and photobiology b: biology, vol. 63, no. 1-3, pp. 8-18, 2001. [9] r. mackie. “effects of ultraviolet radiation on human health”. radiation protection dosimetry, vol. 91, no. 1, pp. 15-18, 2000. [10] m. mabruk, l. toh, m. murphy, m. leader, e. kay and g. murphy. “investigation of the effect of uv irradiation on dna damage: comparison between skin cancer patients and normal volunteers”. journal of cutaneous pathology, vol. 36, no. 7, pp. 760-765, 2009. [11] l. scerri and m. keefe. “the adverse effects of the sun on the skin–a review”. maltese medical journal, 7, pp.26-31, 1995. [12] p. unna. the histopathology of the diseases of the skin. clay, edinburgh, 1896. [13] w. dubreuills, des hyperkératoses circonscrites. masson, paris, 1896. [14] h. daniell. “smoker’s wrinkles”. annals of internal medicine, vol. 75, no. 6, pp. 873, 1971. [15] e. sherertz and s. hess. “stated age”. new england journal of medicine, vol. 329, no. 4, pp. 281-282, 1993. [16] a. vierkötter, t. schikowski, u. ranft, d. sugiri, m. matsui, u. krämer and j. krutmann. “airborne particle exposure and extrinsic skin aging”. journal of investigative dermatology, vol. 130, no. 12, pp. 2719-2726, 2010. [17] t. filiz, n. cınar, p. topsever and f. ucar. “tanning youth: knowledge, behaviors and attitudes toward sun protection of high school students in sakarya, turkey”. journal of adolescent health, vol. 38, no. 4, pp. 469-471, 2006. [18] y. kaymak, o. tekbaş and s. işıl. “knowledge, attitudes and behaviours of university students related to sun protection”. journal of turkish dermatology, vol. 41, pp. 81-85, 2007. [19] p. cohen, h. tsai and j. puffer. “sun-protective behavior among high-school and collegiate athletes in los angeles, ca”. clinical journal of sport medicine, vol. 16, no. 3, pp. 253-260, 2006. [20] t. owen, d. fitzpatrick and o. dolan. “knowledge, attitudes and behaviour in the sun: the barriers to behavioural change in nothhern ireland”. the ulster medical journal, vol.73, no. 2, pp. 96-104, 2004. [21] a. geller, l. rutsch, k. kenausis, p. selzer and z. zhang. “can an hour or two of sun protection education keep the sunburn away? evaluation of the environmental protection agency’s sunwise school program”. environmental health, vol. 2, no. 1, pp. 1-9, 2003. [22] r. bränström, s. kristjansson, h. dal and y. rodvall. “sun exposure and sunburn among swedish toddlers”. european journal of cancer, vol. 42, no. 10, pp. 1441-1447, 2006. [23] n. sendur. “nonmelanoma skin cancer epidemiology and prevention”. turkiye klinikleri journal of internal medical sciences, vol. 1, pp. 80-84, 2005. [24] a. ergin, i. ali and i.b. mehmet. “assessment of knowledge and behaviors of mothers with small children on the effects of the sun on health”. the pan african medical journal, vol. 4, pp. 72-78, 2011. [25] a. köktürk, k. baz and r. buğdaycı. “dermatoloji polikliniğine başvuran hastalarda güneşten korunma bilinci ve alışkanlıkları”, türk klinical dermatology, vol. 12, pp. 198-203, 2002. [26] a. al robaee. “awareness to sun exposure and use of sunscreen by the general population”. bosnian journal of basic medical sciences, vol. 10, no. 4, pp. 314-318, 2010. [27] s. aquilina, a. gauci, m. ellul and l. scerri. “sun awareness in maltese secondary school students”. journal of the european academy of dermatology and venereology, vol. 18, no. 6, pp. 670675, 2004. [28] k. wesson and n. silverberg. “sun protection education in the hezha o. rasul, et al.: skin aging and sun protection knowledge among college students uhd journal of science and technology | may 2018 | vol 2 | issue 1 7 united states: what we know and what needs to be taught”. cutis, vol. 71, pp. 71-74, 2003. [29] e. thieden, p. philipsen, j. heydenreich and h. wulf. “uv radiation exposure related to age, sex, occupation, and sun behavior based on time-stamped personal dosimeter readings. archives of dermatology, vol. 140, no. 2, 2004. [30] v. nikolaou, a. stratigos, c. antoniou, v. sypsa, g. avgerinou, i. danopoulou, e. nicolaidou and a. katsambas. “sun exposure behavior and protection practices in a mediterranean population: a questionnaire-based study”. photodermatology, photoimmunology and photomedicine, vol. 25, no. 3, pp. 132-137, 2009. . uhd journal of science and technology | august 2017 | vol 1 | issue 2 37 1. introduction in this paper, we propose a rejuvenation framework that addresses software aging on abstract level. the changing world demands faster and better alignment of software systems with business requirements to cope with the rising demand for better and faster services. this simply means that a perfectly untouched functioning software ages just because it has not been touched. the aging phenomenon occurs in software products in similar ways to human; parnas [1] draw correlations between the aging symptoms in human and software. as demands for functionality grow software complexity rises, and as a result software, underperformance and malfunctioning became apparent [2]. software aging is a known phenomenon with recognized symptoms such as increase in failure rate [3]. researchers have identified a number of causes of software aging, for example, accumulation of errors over time during system operation. one other cause is “weight gain” as in human, software gains weight as more codes are added to an application to accommodate new functionalities, and consequently, the system loses performance. there are numerous examples where software aging has caused electronic accidents in complex systems such as in billing and telecommunication switching systems [4]. beside the causes researchers in the field have identified a number of aging indicators such as increased rate of resource (e.g., memory) consumption [5]. another aging indicator is how robust a system is against security attacks if observed over time. this is because security attack techniques are becoming more sophisticated by day. however, more is needed to be done to address the aging phenomena. grottke et al. [5] claim that the conceptual aspect of software aging has not been paid adequate attention by researchers to cover the fundamentals of software aging. currently, addressing software aging is mostly done using reengineering techniques such as: a simple software rejuvenation framework based on model driven development hoger mahmud department of computer science, college of science and technology, university of human development, iraq a b s t r a c t in the current dynamic-natured business environment, it is inevitable that today’s software systems may not be suitable for tomorrow’s business challenges which indicate that the software in use has aged. although we cannot prevent software aging, we can try to prolong the aging process of software so that it can be used for longer. in this paper, we outline a conceptual software rejuvenation framework based on model driven development approach. the framework is simple but effective and can be implemented in a recursive five step process. we have illustrated the applicability of the framework using a simple business case study which highlights the effectiveness of the framework. this work adds to the existing literature on software aging and its preventative measures. it also fills in the research gap which exists about software aging caused by changing requirements. index terms: model driven development, software aging, software rejuvenation framework corresponding author’s e-mail: hoger.mahmud@uhd.edu.iq received: 02-07-2017 accepted: 22-08-2017 published: 30-08-2017 access this article online doi: 10.21928/uhdjst.v1n2y2017.pp37-45 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2017 mahmud. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l re se a rc h a rt i c l e uhd journal of science and technology hoger mahmud: a simple software rejuvenation framework based on model driven development 38 uhd journal of science and technology | august 2017 | vol 1 | issue 2 1. forward engineering concerns with moving from highlevel abstraction to physical implementation of a system 2. reverse engineering concerns with analyzing a system to identify components and connectors of that system to represent the system in a different form or higher level of abstraction 3. redocumentation deals with creation or revision of semantically equivalent representation within the same abstract level 4. design recovery concerns with reproducing all required information about a system so that a person can understand what the program does 5. re s t r u c t u r i n g c o n c e r n s w i t h t r a n s f o r m i n g a representation of a system to a different one, without any modification to the functionality of the system. reengineering can facilitate the examination of a system and learn more about it so that appropriate changes can be made. however, it is not the ideal solution for software upgrade as the process is extremely time-consuming and resource expensive. in this paper, we present a conceptual software rejuvenation framework based on model driven development (mdd) techniques capable of addressing software aging with less time and resource. the framework is most effective where the software aging is due to changing business requirements which in effect requires the addition or omission of functionalities. we have illustrated the applicability of the framework through a simple business case study which supports the effectiveness of the framework. this work contributes to the field of software aging by presenting a novel conceptual framework to software developers that can be utilized to dilute software aging. the rest of this paper is organised as follows, in section 2 we provide a brief background about software aging and rejuvenation and in section 3 we present some related works. in section 4, we outline the frame work and in section 5, we illustrate the applicability of the framework using a simple business case study. in section 6 and 7, we discuss, conclude, and provide some recommendations. 2. background in this section, we provide a brief background to both software aging and software rejuvenation with the aim to provide better understanding of the proposed framework later in section 4. a. software aging software aging was first introduced by huang et al. [6] and since then the interest in the topic has risen among academics and industries. complex systems rely on an intricate architectural setup to function, if the structure is slowly destroyed by maintaining and updating the system software aging becomes inevitable [7]. it is a known fact that a system maintainer can mess up perfectly fine functioning software through changing codes or inserting incorrect codes which is known as “ignorant injection” [8]. to provide a focus view of research areas on software aging cotroneo et al. [9] have analyzed more than 70 papers in which they have concluded that overall there are two major categories of research into understanding software aging the first is model-based analysis and the second is measurement-based analysis. several measureable techniques have been proposed to detect software aging such as “aging indicators” and “time series analysis.” the techniques are used to collect data about resources used in a system, and then, analyze it to see if the consumption rate has increased over time which is a sign of aging [3]. as for the causes of software aging, there are two major classes, the first is known as “ignorant surgery” and the second is known as “lack of movement.” fig. 1 shows the major contributors to the two classes of software aging causes. b. software rejuvenation to keep critical systems functioning correctly software rejuvenation is recognized as an effective technique [10]. the objective of software rejuvenation is to rollback a system continuously to maintain the normal operation of the system and prevent failures. according to cotroneo et al. [3] application-specific and application-generic are two main classes of software rejuvenation techniques in which the former works on specific system features and the latter works on the whole system (e.g., system restart). to further elaborate on the two main classes, researchers have provided a number of examples for both; flushing of kernel, file system defragmentation and resource reprioritization are examples of application specific rejuvenation and application restart, cluster failover, and operating system reboot are examples of application generic rejuvenation [3]. fig. 2 illustrates the two classes of software rejuvenation techniques. 3. related work there have been a number of attempts to tackle software aging similar to what we propose here. the authors of huang et al. [6] present a model-based rejuvenation approach for billing applications and okamura and dohi [10] proposes dynamic software rejuvenation policies by extending models presented in pfening et al. [11]. the approach is case hoger mahmud: a simple software rejuvenation framework based on model driven development uhd journal of science and technology | august 2017 | vol 1 | issue 2 39 specific and cannot be applied to a domain; this, however, has similarities with what we are proposing since they also use models to rejuvenate software. saravakos et al. [12] proposes the use of continuous time markov chains to model and analyze software aging and rejuvenation to better understand causes of aging which helps putting in place mitigating measures. this approach is suitable to treat symptoms of aging that happens for technical reasons rather than changes in requirements. dohi et al. [13] models optimal rejuvenation schedule using semi-markov processes to maximize availability and minimize cost. the focus here is aging caused due to processing attributes; however, unlike this work we focus on the functionality attributes of a system garg et al. [14]. adopts the periodic rejuvenation technique proposed by huang et al. [6] and uses stochastic petri net to model stochastic behavior of software aging. beside modeling techniques, others have used techniques such a time triggered rejuvenation technique used by salfner and wolter [15] and software life-extension technique used by machida et al. [16] to counteract software aging in which they take preventative measures to ease software aging and allow more time for system rethink. huang et al. [6] proposes a proactive technique to counteract software aging with the aim to prevent failure using periodic preemptive rollback of running applications. to detect symptoms of aging techniques such as machine learning is used to analyze data through adopting artificial intelligent algorithms (e.g., classifiers) [17]. garg et al. [18] discuss measures for software aging symptom detection with the aim to diagnose and treat the aging taking place, others have used pattern recognition techniques to detect aging symptoms [17]. these works propose how to detect symptoms of software aging without proposing a suitable mechanism to treat the symptoms. all the related works presented so fare address software aging from technical and performance viewpoint and none consider aging caused as a result of changing requirements. this allows us to claim that our framework contributes to the software aging and rejuvenation literature by filling in this gap and take a new direction in tackling software aging. 4. framework outline the base of our conceptual rejuvenation framework is mdd technique [19], [20]. france et al. [21] claim that abstract design languages and high-level programming languages can provide automated support for software developers in terms of solution road map that fast-forward system developments. following their direction we use model driven development (mdd) techniques to design a rejuvenation framework to tackle requirement-based software aging. mdd simply means constructing a model of the system with fine details before transferring it into code. it provides the mapping functions between different models for integration and model reusing purposes [22]. mdd is a generic framework that can accommodate both application specific and application generic classes of software rejuvenation. mayer et al. [23] states mdd is ideal for visualizing systems and not losing the semantic link between different components of the system at the same time. it is inevitable that extensive manual coding in developing a system escalates human errors in the system; this issue can be addressed through code automation which is the ultimate aim of mmd. building and rebuilding system is an expensive process that requires time and resource; model driven aims at using, weaving and extending models to maintain, develop and redevelop systems. experts in the field claim that mdd improves quality as models are continuously refined and reduce costs by automating the development process [22]. this process changes models from being expenses to fig. 1. major causes of software aging fig. 2. software rejuvenation techniques hoger mahmud: a simple software rejuvenation framework based on model driven development 40 uhd journal of science and technology | august 2017 | vol 1 | issue 2 important assets for businesses. researchers have identified the conceptual gap between problem domain and implementation as a major obstacle in the way of developing complex systems. models have been utilized to bridge the gap between problem domain abstractions and software implementation through code generation tools and automated development support [2]. models can serve many purposes such as: 1. simplifying the concept of a complex system to aid better understanding of the problem domain and system transformation to a form that can be analyzed mechanically [24] 2. models are platform and language independent 3. automatic code generation using models reduce human errors 4. for new requirements only the change in model is required this reduces the issue explained previously known as weight gain. a. framework steps we propose a five step recursive software rejuvenation framework to address the issue of software aging. as mentioned the framework is based on model driven software development which is implemented in the following steps: 1. first developers gather system requirements which is one of the must do tasks in every software development 2. developers design the entire system in great details using tools such as unified modeling language (uml) 3. the complete design is fed into code generators such as code cooker (http://codecooker.net) and eclipse uml to java generator to generate system codes 4. software codes are integrated, tested, and finalised, this step is necessary since a code generator tool capable of generating 100% of the code is yet to exist. this limitation is discussed in section 6 5. in the final step where the new product is delivered and installed. fig. 3 illustrates the five steps explained in a recursive setting, i.e., when a new feature is required to be added to the system to address a new requirement the system is upgraded through the model rather than though code injection. the models are kept as assets and refined as new requirements come in, the next section provide more inside as to how the framework works. 5. case study to illustrate the applicability of the framework we present a simple none-trivial business case study specific to kurdistan region. mr. x is a supermarket owner in the city of sulaymaniyah who sells domestic goods and he employs 10 people in his supermarket. currently, his shop is equipped with electronic point of sale (epos) systems to record transactions and the form of payment by customers is cash only. electronic payment is not feasible due to unavailability of electronic payment systems in the region’s banks. his current epos system is capable of performing the following functionalities: 1. store individual item details such as name, price, barcode, and expiry dates 2. store information about employees such as name, address, date of birth, and telephone numbers 3. retrieve and match barcodes on products to display and record item details 4. calculate total price and print out customer receipts 5. record all transactions and generate various reports such as daily sales report, weekly sales report, and sale by item report 6. the administration side of the system is managed through a user management subsystem which allows adding, deleting, updating, and searching on users. the system also contains a product management subsystem that allows managing products through adding, deleting, updating, and searching on item. we make an assumption that in the next 6 months electronic payment systems (epayment) will become available in kurdistan for businesses to use. now mr. x would like to gain an edge over his competitors and add epayment system to his current epos system. fig. 4 is the uml use case diagram for the current epos system in mr. x’s supermarket which shows the use cases than can be performed by each actor. fig. 3. model driven development-based software rejuvenation framework hoger mahmud: a simple software rejuvenation framework based on model driven development uhd journal of science and technology | august 2017 | vol 1 | issue 2 41 fig. 5 is the future use case diagram for the new system which shows the addition of a new actor called “customer” and a new use case called “pay electronically” coted in yellow. now, we assume developers of the system had the framework in mind when they developed the system and have kept a design model of the system similar to the one illustrated in fig. 6 which shows a uml class diagram design model of the epos system. mr. x now goes back to them and request that the new functionality (electronic payment) to be added to the system. using the framework the developers refine the uml class diagram model (new classes coted in yellow) to accommodate the new requirement and produce a new design similar to the one shown in fig. 7. the new design is now ready to be fed into code generators to generate the codes for the new system. using the framework the developers have performed a rejuvenation process on mr. x’s system without touching fig. 4. current electronic point of sale unified modeling language use case diagram fig. 5. future electronic point of sale unified modeling language use case diagram hoger mahmud: a simple software rejuvenation framework based on model driven development 42 uhd journal of science and technology | august 2017 | vol 1 | issue 2 the current operating epos. it is important to point out that the framework tackles software rejuvenation conceptually and on abstract level which means we are bypassing all the technicalities of implementation and testing processes. during our search, we did not come across any related work that considers design for software rejuvenation rather than an actual system, which indicates that our approach is unique. however, it has to be said that although being unique is an advantage, it has made it difficult for us to compare the applicability of our framework with other existing frameworks. 6. discussion researchers in the field have concluded that software aging is inevitable and as software ages it loses its ability to keep up. in this paper, we have proposed a five step recursive software rejuvenation framework based on model driven software development approach. to illustrate the applicability of the framework we have outlined a simple business scenario and explained how the framework rejuvenates the current system in use by the business. the framework will provide the following advantages over existing rejuvenation techniques: 1. the model is used to redevelop the system without taking the old system out of operation which leads to reduction in down time (unavailability) which otherwise lead to lose of customers and profits 2. using models to maintain and update software gives the development process an edge as models are language independent and can be used to develop systems in the state of the art programming languages which in turn ease software aging as the technology used in the development is current [25] 3. as codes are generated automatically human errors are reduced, which is one of the contributors of software aging 4. redevelopment costs and times are reduced as developments are automated. the objective of software rejuvenation is to rollback a system continuously to maintain the normal operation of the system and prevent failures. however, software rejuvenation increases system downtime as the system is taken out of operation while the rejuvenation process is performed. knowing when to perform rejuvenation process on a system is a crucial factor recognized by researchers to minimize cost and maximize availability [3]. the framework we have proposed addresses this issue by working on the system on design level without terminating the system operation while the rejuvenation solution is finalized. it is important to stress that the framework is conceptual and requires further research as there are a number of limitations that need be addressed to make the framework fully applicable. the limitations can be summarized as follows: fig. 6. current electronic point of sale unified modeling language class diagram hoger mahmud: a simple software rejuvenation framework based on model driven development uhd journal of science and technology | august 2017 | vol 1 | issue 2 43 1. available software modeling tools such as uml 2.0 which is an industry standard currently does not provide the ability to model systems from user-defined viewpoint [21] 2. mdd is not widely used [22] although it has gained momentum with a potential for industry wide adaptation 3. once the models are developed and finalized there comes the issues of translating it completely into code as a tool to generate 100% codes from a model not yet exist 4. the issue of measuring the quality of models is realized by researchers to tackle this issue france and rumpe [2] suggests that modeling methods should come with modeling criteria that modelers can use as a guide for system modeling. however, such criterions are yet to be presented by modeling language and tool developers such as developers of uml (www.omg.org) 5. in the course of developing a system, many different models are created at varying abstract levels which creates model tracking, integration, and management issues and the current modeling tools are not sophisticated enough to deal with the issues. despite all the limitations, we believe the fundamental concept behind the framework has great potentials to be advanced and implemented in the future. 7. conclusion and recommendations software aging is inevitable which occurs as a result of changing requirements, ignorant injections, and weight gain. researchers have proposed a number of different approaches to tackle software aging; however, nearly all approaches are trying to address the aging caused by technical update or software malfunction. in this paper, we have outlined a framework for software rejuvenation that uses mdd approach as base for the rejuvenation process. the framework addresses software aging from a change in business requirement point of view which is different from what current researchers are proposing. it is simple, effective, and applicable as demonstrated by applying it to a simple business case study. fig. 7. future electronic point of sale unified modeling language class diagram hoger mahmud: a simple software rejuvenation framework based on model driven development 44 uhd journal of science and technology | august 2017 | vol 1 | issue 2 the foundation concept developed in this paper contributes to the field of software aging and paves the way for looking at software aging in a different angle. now to delay software aging, we recommend a number of quick mitigating actions as follows: 1. characterize the changes that are likely to occur over the lifetime of a software product, and the way to achieve this characterization is by applying principles such as object orientation 2. design and develop the software code in a way that changes can be carried out; to achieve this concise and clear documentation is the key 3. reviewing and getting a second opinion on the design and documentation of a product helps in prolonging the lifetime of a software product. when the aging has already occurred there are things we could do to treat it such as: 1. prevent the aging process to get worse by introducing and creating structures whenever changes are made to the product 2. as changes are introduced to a product a review and update of the documentation is often a very effective step in slowing the aging process 3. u n d e r s t a n d i n g a n d a p p l y i n g t h e p r i n c i p l e o f modularization is a good way to ease the future maintenance of a product 4. combining different versions of similar functions into one system can increase efficiency of a software product and reduce the size of its code which is one the causes of software aging. references [1] d. l. parnas. “software aging.” in proceedings of the 16th international conference on software engineering, 1994, pp. 279-287. [2] r. france and b. rumpe. “model-driven development of complex software: a research roadmap.” in 2007 future of software engineering. washington, dc, usa: ieee computer society, 2007, pp. 37-54. [3] d. cotroneo, r. natella, r. pietrantuono, and s. russo. “a survey of software aging and rejuvenation studies.” acm journal on emerging technologies in computing systems (jetc), vol. 10, no. 1, pp. 8, 2014. [4] a. avritzer and e. j. weyuker. “monitoring smoothly degrading systems for increased dependability.” empirical software engineering, vol. 2, no. 1, pp. 59-77, 1997. [5] m. grottke, r. matias, and k. s. trivedi. “the fundamentals of software aging.” in software reliability engineering workshops, 2008. issre wksp 2008. ieee international conference on, 2008, pp. 1-6. [6] y. huang, c. kintala, n. kolettis, and n. d. fulton. “software rejuvenation: analysis, module and applications.” in fault-tolerant computing, 1995. ftcs-25. digest of papers, twenty-fifth international symposium on, 1995, pp. 381-390. [7] c. jones. “the economics of software maintenance in the twenty first century.” unpublished manuscript, 2006. available: http://www. compaid.com/caiinternet/ezine/capersjones-maintenance.pdf. [last accessed on 2017 may 15]. [8] r. l. glass. “on the aging of software.” information systems management, vol. 28, no. 2, pp. 184-185, 2011. [9] d. cotroneo, r. natella, r. pietrantuono, and s. russo. “software aging and rejuvenation: where we are and where we are going.” in software aging and rejuvenation (wosar), 2011 ieee third international workshop on, 2011, pp. 1-6. [10] h. okamura and t. dohi. “dynamic software rejuvenation policies in a transaction-based system under markovian arrival processes.” performance evaluation, vol. 70, no. 3, pp. 197-211, 2013. [11] a. pfening, s. garg, a. puliafito, m. telek, and k. s. trivedi. “optimal software rejuvenation for tolerating soft failures.” performance evaluation, vol. 27, pp. 491-506, 1996. [12] p. saravakos, g. gravvanis, v. koutras, and a. platis. “a comprehensive approach to software aging and rejuvenation on a single node software system.” in proceedings of the 9th hellenic european research on computer mathematics and its applications conference (hercma 2009), 2009. [13] t. dohi, k. goseva-popstojanova and k. s. trivedi. “statistical nonparametric algorithms to estimate the optimal software rejuvenation schedule.” in dependable computing, 2000. proceedings. 2000 pacific rim international symposium on, 2000, pp. 77-84. [14] s. garg, a. puliafito, m. telek and k. s. trivedi. “analysis of software rejuvenation using markov regenerative stochastic petri net.” in software reliability engineering, 1995. proceedings, sixth international symposium on, 1995, pp. 180-187. [15] f. salfner and k. wolter. “analysis of service availability for timetriggered rejuvenation policies.” journal of systems and software, vol. 83, no. 9, pp. 1579-1590, 2010. [16] f. machida, j. xiang, k. tadano and y. maeno. “software lifeextension: a new countermeasure to software aging.” in software reliability engineering (issre), 2012 ieee 23rd international symposium on, 2012, pp. 131-140. [17] k. j. cassidy, k. c. gross and a. malekpour. “advanced pattern recognition for detection of complex software aging phenomena in online transaction processing servers.” in dependable systems and networks, 2002. dsn 2002. proceedings. international conference on, 2002, pp. 478-482. [18] s. garg, a. van moorsel, k. vaidyanathan and k. s. trivedi. “a methodology for detection and estimation of software aging.” in software reliability engineering, 1998. proceedings. the ninth international symposium on, 1998, pp. 283-292. [19] s. beydeda, m. book, v. gruhn, g. booch, a. brown, s. iyengar, j. rumbaugh and b. selic. model-driven software development, vol. 15. berlin: springer, 2005. [20] j. p. tolvanen and s. kelly. “model-driven development challenges and solutions.” modelsward, vol. 2016, p. 711, 2016. [21] r. b. france, s. ghosh, t. dinh-trong and a. solberg. “modeldriven development using uml 2.0: promises and pitfalls.” computer, vol. 39, no. 2, pp. 59-66, 2006. [22] s. j. mellor, t. clark and t. futagami. “model-driven development: guest editors’ introduction.” ieee software, vol. 20, no. 5, pp. 1418, 2003. hoger mahmud: a simple software rejuvenation framework based on model driven development uhd journal of science and technology | august 2017 | vol 1 | issue 2 45 [23] p. mayer, a. schroeder and n. koch. “mdd4soa: model-driven service orchestration.” in enterprise distributed object computing conference, 2008. edoc’08. 12th international ieee, 2008, pp. 203-212. [24] d. harel, b. rumpe. “modeling languages: syntax, semantics and all that stuff (or, what’s the semantics of semantics?).” in technical report mcs00-16, weizmann institute, rehovot, israel, 2004. [25] n. b. ruparelia. “software development lifecycle models.” sigsoft software engineering notes, vol. 35, no. 3, pp. 8-13, 2010. tx_1~abs:at/tx_2:abs~at 58 uhd journal of science and technology | july 2022 | vol 6 | issue 2 1. introduction in the field of agriculture, many practices particularly the using of chemicals are applied for improving crops quality and quantity, however, although their positive effects, these applications are not empty of undesirable effects on environment, public health, and plant growth. using modern biotechnological approaches, including, electricity current, laser, magnetic field, high voltage, ultraviolet and radiation with gamma or x-ray on different plants material are gaining interest to develop plants growth and yield, and characterized by cheapness and safety on health and environment, therefore the scientists try to make this century a biophysical century, where most of the physical factors depend on increasing energy balance and increase material transport through membranes for improving the growth and the development of crops [1]-[3]. photosynthetic pigments and stomata characteristics of cowpea (vigna sinensis savi) under the effect of x-ray radiation ikbal muhammed albarzinji1, arol muhsen anwar1, hawbash hamadamin karim2, mohammed othman ahmed3 1department of biology, faculty of science and health, koya university, koya koy45, kurdistan region f.r. iraq, 2department of physics, faculty of science and health, koya university, koya koy45, kurdistan region f.r. iraq, 3department of horticulture, college of agricultural engineering sciences, university of raparin, kurdistan regionf.r. iraq a b s t r a c t this study was conducted in the field and laboratories of the faculty of science and health-koya university by exposing the seeds of cowpea plant (vigna sinensis savi) var. california black-eye to x-ray radiation in two different locations (in target or 30 cm out of target) inside the radiation chamber, for four different exposure times (0, 5, 10, or 20 min), to study the effect on some characteristics of seedling components. results show that the exposure location to x-ray had non-significant effects on cowpea leaves content of photosynthetic pigments, whereas each of time of exposure with interaction between location and time of exposure had significant effects on chlorophyll a, total chlorophylls, and total carotenoids pigments. regarding the x-ray effects on stomata characteristics, the results detect that there were non-significant differences between the location of exposure on stomata number on abaxial leaves surfaces and stomata length on adaxial leaves surfaces, whereas a significant effects on number of stomata on the adaxial leaves surfaces, abaxial stomata length, abaxial, and adaxial stomata width were detect. exposing cowpea seeds to x-ray radiation in the target of the radiation source increased significantly stem and leave dry matter percent compared with the one out of the target location, whereas increasing the time of exposure decreased the percent of dry matter of stem and leaves. it is concluded that exposing cowpea seeds to x-ray leads to changes in photosynthetic pigments, stomata characteristics, and plant dry matter content. index terms: vigna sinensis savi, x-ray radiation, pigments, stomata traits corresponding author’s e-mail: ikbal muhammed albarzinji, department of biology, faculty of science and health, koya university, koya koy45, kurdistan region f.r. iraq. e-mail: ikbal.tahir@koyauniversity.org received: 16-07-2022 accepted: 07-09-2022 published: 24-09-2022 access this article online doi: 10.21928/uhdjst.v6n2y2022.pp58-64 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2022 albarzinji, et al. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l re se a rc h a rt i c l e uhd journal of science and technology albarzinji, et al.: pigments and stomata of cowpea under x-ray uhd journal of science and technology | july 2022 | vol 6 | issue 2 59 ionizing radiations are those have wavelengths <100 nm [4]. these radiations are charged high-energy particles (highenergy photons and electrons). two types of ionizing radiations there are: gamma radiations and x-rays, the first is emitted from inside the nucleus, whereas x-ray is radiated from outside the nucleus [5]. there are many applications of x-ray radiation in different fields of plant studies, for example panchal et al. [6] used x-ray for imaging of inner features of a seed sample to identify unseen defects or contaminants. other studies were conducted to investigate the effects of x irradiation on physiological characteristics of different plants, such; rezk et al. [7] found that low dose of x-ray 5 gray (gy) caused increasing in all morphological criteria, total photosynthesis pigments, enzymatic and non-enzymatic antioxidants significantly in two genotypes of okra plants as compared with control treatments, while the doses (higher than 5 gy) caused a considerable decreased in the studied parameters. similarly, singh [8] study shows promoting in chlorophyll development for 60 s x-ray pretreated as it compared to 90 and 120 s pre-treatment for seeds of cicer arietinum, vigna radiata, vigna mungo and vicia faba plants. dhamgaye et al. [9] irradiated seeds of phaseolus vulgaris cv. rajmah using synchroton x-ray beam at 0.5–10 gy, the overall growth of 10 days old seedlings raised from irradiated seeds was substantially reduced at irradiation doses of 2 and 5 gy. same authors dhamgaye et al. [10] irradiated seeds of p. vulgaris cv. rajmah using synchrotron x ray at doses of 1, 10, and 20 gray where, the percent of relative water and protein content was significantly decreased at 10 and 20 gy dose in 4–8 days old seedling, and a decrease in photosynthesis pigments chlorophyll and carotenoids content is observed in shoot tissue when 1 and 10 gy where used. mortazavi et al. [11] accelerated the growth of newly grown plants of p. vulgaris (pinto) by irradiated them with x-rays for 6 days. arena et al. [12] found that exposure of dwarf bean (p. vulgaris l.) plants to different doses of x-rays (0.3, 10, 50, and 100 gy) showed that young leaves exhibited a reduction of area and an increase in specific mass and dry matter content. at higher doses of x-rays (50 and 100 gy) total chlorophyll (a+b) and carotenoid (xanthophylls + carotenoids) content were significantly lower (p < 0.01) compared to lower doses and in control leaves. significant reduction in transpiration was detected in v. faba irradiated by x-ray, this reduction was associated with inhibition of stomatal opening from the 9th to 16th day after irradiation. the osmotic pressure of epidermal cells in irradiated plants appeared to be slightly higher than that of epidermal cells of non-irradiated plants. however, the slight osmotic pressure changes of epidermal cells in irradiated plants did not appear to be a major factor contributing to inhibition of stomatal opening in irradiated plants under the growth conditions of the experiments [13]. the aim of this work was to investigate the effects of seed exposure to x-rays on some of the physiological properties of emerged cowpea plants, because these changes has subsequent effects on the photosynthetic activity and cause a direct effect on the agronomic features of the plant. 2. materials and methods 2.1. plant materials and studied characteristics this work was conducted in the department of biology/koya university, erbil-iraq. the seeds of cowpea plant (vigna sinensis savi) var. california black-eye were exposed to a single dose of x-ray radiation by the xrd tube (from the company of panalytical b.v. lelyweg1, the netherlands) where the highest radiation level was less than 1 sieverts/h measured at the tube surface. 20 seeds for each experimental unit were putted in the device source to exposed to x-ray at the advanced physics laboratory in physic department at same faculty. the experiment was conducted in complete randomize design (crd) where the location of exposure considers as the first factor by exposure the seeds to x-rays either in the target point of the device or 30 cm from the target point in the base of the device chamber, whereas the times of exposure 0, 5, 10, or 20 min were considered as the second factor, where the time zero is considered as the control treatment used for each location. after seeds were exposed to x-ray they planted in 5 kg. soil pots, because of an initial increase in photosynthesis rate during leaf expansion and followed by a decrease on maturation [14], at the end of the vegetative growth stage, fourth leaf of five plants from each experimental unit were taken, and the photosynthetic pigments chlorophylls a, chlorophyll b, total chlorophylls and total carotenoids were estimated as it mentioned in lichtenthaler and wellburn [15] were leaf material was collected and mixture ratio was 50 ml 80% acetone: 1 g leaves sample. samples were grinded by mortar and pestle and filtered by filter paper, then extracts were placed in a 25 ml dark glass vial to avoid evaporation and photo-oxidation of pigments, after that the absorbance of the extract were measured by spectrophotometer at wave lengths 663, 646, and 470 nm. each of chlorophyll a, b, and total carotenoids were estimated as follows: chlorophyll a = (12.21*a663) (2.81*a646) chlorophyll b = (20.13*a646) (5.03*a663) albarzinji, et al.: pigments and stomata of cowpea under x-ray 60 uhd journal of science and technology | july 2022 | vol 6 | issue 2 total carotenoids = (1000*a470 3.27*chl a – 104* chl b)/229 where, a is absorbance, chl. a = chlorophyll a (mg/l) and chl. b = chlorophyll b (mg/l). for converting the concentration from mg/l to mg/g fresh weight, each value multiplied by (extraction volume/sample weight *1000), and total chlorophyll calculated from the summation of each chlorophyll a and chlorophyll b. total chlorophyll was determined by collecting each of chlorophyll a and chlorophyll b (10). for stomata study, the lasting impressions method [16] was used. in this method, about one square centimeter of leaves surfaces was painted by a clear nail polish. after the nail polish was dried they were taped by a clear cellophane tape, and peels it out. the leaf impressions taped on slides and labeled as adaxial and abaxial surfaces then examined under ×40 by light microscope (dm 300, leica microsystems, china). numbers of appeared stomata on lens field were counted for all adaxial and abaxial leaves surfaces. stomata guard cells length and width of adaxial and abaxial leaves surfaces were calculated in micrometer (μm) with scaled ocular lens. because of the important of the percent of dry matter content as a result of the photosynthetic activity it determined for each of stem and leaves by dividing the stem or leaves dry weight by the stem or leaves fresh weight multiplying by 100 as it reported by al-sahaf [17]. 2.2. the statistical analysis the statistical analysis of the study conducted as a factorial experiment performed as crd in three replications, analysis of variance was used for calculating the differences among each factor treatments and their interactions by using the sas software. the test of duncan’s multiple comparison was used to estimate the main effects of treatments which were differ when the f-value was significant at p ≤ 0.05 [18]. 3. results and discussion from table 1 results it is shown that location of exposure to x-ray had non-significant (p > 0.05) effects on leaves content of photosynthesis pigments of cowpea plant, whereas the time of exposure led to a significant (p ≤ 0.05) effects on chlorophyll a and total chlorophylls, were the seeds that exposed to x-ray for 10 min increased chlorophyll a and total chlorophylls significantly (p ≤ 0.05) to 2.64 and 5.42 mg/g fresh weight comparing to other exposure times. results of interactions between locations and time of exposure revealed that exposure for 10 min out of target increased the content of chlorophyll a significantly to 3.01 mg/g fresh weight compared to other interaction treatments, and to 3.14 and 6.15 mg/g fresh weight for each of chlorophyll b and total chlorophylls compared with 5 min exposure out of target only, whereas same interaction increased total carotenoids content significantly to 1.15 mg/g fresh weight compared to 0.94 mg/g fresh weight for 20 min exposure out the target interaction treatment only. in general, ionizing radiation may have different effects on plant metabolism, growth and table 1: effects of location and time of exposure cowpea seed to x‑radiation, and their interactions on chlorophyll a, b, total chlorophylls and total carotenoids treatments chlorophyll (mg/g fresh weight) a chlorophyll (mg/g fresh weight) b total chlorophylls (mg/g fresh weight) total carotenoids (mg/g fresh weight) location of exposure in target (l1) 2.23a 2.33a 4.56a 1.05a out of target (l2) 2.32a 2.38a 4.70a 1.04a time of exposure (min) t0 2.20b 2.27a 4.47ab 1.07a t5 2.11b 1.96a 4.07b 1.01a t10 2.64a 2.78a 5.42a 1.09a t20 2.16b 2.40a 4.56ab 0.99a interactions between location and exposure time l1×t0 2.20b 2.27ab 4.47ab 1.07ab l1×t5 2.27b 2.27ab 4.54ab 1.02ab l1×t10 2.27b 2.42ab 4.70ab 1.03ab l1×t20 2.18b 2.34ab 4.52ab 1.05ab l2×t0 2.20b 2.27ab 4.47ab 1.07ab l2×t5 1.96b 1.65b 3.60b 0.99ab l2×t10 3.01a 3.14a 6.15a 1.15a l2×t20 2.13b 2.46ab 4.59ab 0.94b means that followed by same letters within column are differ non‑significantly at p≤5% according to the duncan multiple range test albarzinji, et al.: pigments and stomata of cowpea under x-ray uhd journal of science and technology | july 2022 | vol 6 | issue 2 61 reproduction, depending on radiation dose, plant species, developmental stage, and physiological traits [12]. our results disagree with the al-enezi and al-khayri [19] results that suggested that photosynthesis pigments chlorophyll a and carotenoids are more sensitive to x-ray than chlorophyll b, whereas we found that chlorophyll b and total carotenoids were less sensitive to x-irradiation compared to chlorophyll a and total chlorophylls. changes in photosynthetic pigments were studied by arena et al. [12] whom confirmed that the decrease in the levels of x-ray (0.3 gy) caused an increase in photosynthetic pigments in bean plants, whereas the high levels (50 and 100 gy) caused a decrease in these pigments, these findings also agree with that of rezk et al. [7] which recorded in two okra genotypes leaves, where the content of photosynthetic pigment improved significantly with increasing the doses of x-ray to 5 gy comparing with untreated plants, also more increase in the radiation doses, encourage the reduction in photosynthetic pigments compared to the control plants. changes in chlorophyll content as a response to x-ray is either toward an increase or a decrease direction, the increase may due to the increase in chlorophyll biosynthesis and/or delaying its degradation [20], whereas the decrease may due to pigment breakdown due to increase of reactive oxygen species [21] and changes in the chloroplast such chloroplast swelling, thylakoid dilation, and breakdown of chloroplast outer membrane [22]. regarding x-radiation effects on the stomata characteristics it was shown that there were non-significant (p > 0.05) differences between the location of exposure on number of stomata on abaxial leaves surfaces and stomata length on adaxial surfaces of leaves (table 2 and figs. 1 and 2). the seeds exposed directly to the source of x-ray (in target) decreased number of stomata on the adaxial leaves surfaces to 148.33 stomata/mm2, whereas abaxial stomata length increased to 11.08 micrometer and abaxial with adaxial stomata width also increased significantly to 7.00 and 7.58 micrometer, respectively, compared to 180.00 stomata/mm2, 9.92, 5.42, and 5.50 micrometer for plants out of target. 10 min of seed exposure to x-ray increased stomata number on both abaxial and adaxial leaves surfaces compared to other exposure times except the control treatment. exposure time had non-significant (p > 0.05) effect on stomata length and width on abaxial leaves surfaces, whereas increasing time of exposure to 20 min increased the stomata length significantly (p ≤ 0.05) compared to 5 and 10 min only, whereas it increased the stomata width significantly compared to all other treatments. from the results of interaction within location and time of exposure, it was clear from the results (table 2), that treating seeds for 10 min in the x-ray target had the more significant effects for abaxial leaves surfaces in increasing stomata number to 540.00 stomata/ mm2, and the stomata length and width to 11.67 and 8.00 table 2: effects of seeds exposure to x‑radiation on some characteristics of cowpea (vigna sinensis savi) plants stomata treatments stomata number/mm2 stomata length (micrometer) stomata width (micrometer) abaxial leaves surface adaxial leaves surface abaxial leaves surface adaxial leaves surface abaxial leaves surface adaxial leaves surface location of exposure in target (l1) 455.00a 148.33b 11.08a 10.58a 7.00a 7.58a out of target (l2) 455.83a 180.00a 9.92b 11.67a 5.42b 5.50b time of exposure (min) t0 526.67a 176.67ab 10.67a 12.33a 6.00a 6.00bc t5 398.33b 146.67bc 10.33a 9.67b 6.17a 5.17c t10 536.67a 201.67a 10.17a 10.17b 6.00a 6.67b t20 360.00b 131.67c 10.83a 12.33a 6.67a 8.33a interactions between location and exposure time l1×t0 526.67ab 176.67bc 10.67ab 12.33ab 6.00abc 6.00bc l1×t5 403.33bc 143.33bc 10.67ab 10.33bcd 7.33ab 6.00bc l1×t10 540.00a 156.67bc 11.67a 8.33d 8.00a 8.67a l1×t20 350.00c 116.67c 11.33a 11.33abc 6.67ab 9.67a l2×t0 526.67ab 176.67bc 10.67ab 12.33ab 6.00abc 6.00bc l2×t5 393.33c 150.00bc 10.00ab 9.00cd 5.00bc 4.33c l2×t10 533.33a 246.67a 8.67b 12.00ab 4.00c 4.67cd l2×t20 370.00c 146.67bc 10.33ab 13.33a 6.67ab 7.00b means that followed by same letters within column are differ non‑significantly at p≤5% according to the duncan multiple range test albarzinji, et al.: pigments and stomata of cowpea under x-ray 62 uhd journal of science and technology | july 2022 | vol 6 | issue 2 micrometer, respectively, in coincides with the treatment 20 min exposure time in the target of radiation source for adaxial leaves surfaces which increased stomata width to 8.67 and 9.67 micrometer, respectively. the present observations showed changes in stomata characteristics under x-ray radiation compared with that not treated. these changes in stomata dimensions under x-ray may due to change in osmotic pressure of epidermal cells which prevent the development of sufficient osmotic pressure in guard cells to open to the same extent as occurs in non-irradiated plants, so the average stomatal opening of x-ray irradiated plants was significantly less compared to non-irradiated plants [13]. stomatal aperture depends on the genotype of plants and is regulated by many internal and external factors [23]. from table 3 results, it is shown that exposure seeds to x-ray in target source increases significantly each of stem and leaves dry matter percent to 10.75 and 14.00% compared to that is out of target location (9.13 and 11.88%), respectively, which agrees with al-enezi and al-khayri [24] whom found an increase in fresh and dry weights of date palm (phoenix dactylifera l.) leaf tissues with increasing the x irradiation dose from 0 to 1500 rad, it also agrees with the results of arena et al. [12] whom found that the high dose of x-rays (50 gy) increased significantly (p < 0.001) leaf dry matter content in faba been young leaves compared to the control leaves. regarding the time of exposure 5 min exposure to x-ray increased the percent of stem and leaves dry matter content significantly (p ≤ 0.05) to 13.75 and fig. 1. lower (abaxial) leaves surfaces of vigna sinensis savi showing stomata at ×400 for (a) the control, (b) in-target −5 min. (c) in-target −10 min. (d) in-target −20 min. (e) out of target −5 min. (f) out of target −10 min., and (g) out of target −20 min. c da b e f fig. 2. upper (adaxial) leaves surfaces of vigna sinensis savi showing stomata at ×400 for (a) the control, (b) in-target −5 min. (c) in-target −10 min. (d) in-target −20 min. (e) out of target −5 min. (f) out of target −10 min. and (g) out of target −20 min. c d gfe a b albarzinji, et al.: pigments and stomata of cowpea under x-ray uhd journal of science and technology | july 2022 | vol 6 | issue 2 63 table 3: effects of location and time of exposure cowpea seed to x‑radiation, and their interactions in stem and leaf dry matter treatments stem dry matter (%) leaves dry matter (%) location of exposure in target (l1) 10.75a 14.00a out of target (l2) 9.13b 11.88b time of exposure (min) t0 8.50b 12.00bc t5 13.75a 15.75a t10 9.50b 13.50b t20 8.00b 10.50c interactions between location and exposure time l1×t0 8.50cd 12.00c l1×t5 15.00a 17.00a l1×t10 10.00c 14.00bc l1×t20 9.50c 13.00bc l2×t0 8.50cd 12.00c l2×t5 12.50b 14.50b l2×t10 9.00c 13.00bc l2×t20 6.50d 8.00d means that followed by same letters within column are differ non‑significantly at p≤5% according to the duncan multiple range test 15.75% compared to other treatments, whereas increasing the time of exposure to 20 min decreased the percent of stem and leaves dry matter significantly (p ≤ 0.05) to 8.00% and 10.5%, respectively. regarding the interactions between the location and time of exposure, the percent of stem and leaves content of dry matter increased significantly to 15.00% and 17.0% for plants emerged from seeds exposed to x-ray for 5 min on the source target, whereas the lowest values were recorded for seeds exposed to 20 min x-ray out of target. for the x-ray effects, it was seen that the shortest time records led to the highest significant increase, which can be concluded that it likes the effects of low doses of x-radiation which encourage cellular activities and growth whereas higher doses may cause chromosomal abnormalities [25]. hence, higher x-ray radiation exposure time effect on the growth of plants which reflects on stem and leaves percent of dry matter. 4. conclusions we can conclude that exposing cowpea seeds to x-ray radiation had stimulation effects regarding photosynthesis pigments and stomata characteristics either as increase or decrease responses according to the treatment. it was concluded that the location of the exposure had nonsignificant effects on photosynthetic pigments, whereas, it effects on stomata characteristics and dry matter content. best exposure time differ according to the studied characteristics. more studies are recommended about effects of x-ray on wet seeds and seedling by different doses of radiation. references 1. g. vasilevski. “perspectives of the application of physiological methods in sustainable agriculture”. bulgarian journal for plant physiol, special issue, vol. 3-4, pp. 179-186, 2003. 2. m. k. al-jebori and i. m. al-barzinji. “exposing potato seed tuber to high voltage field i. effects on growth and yield”. journal of iraqi agricultural sciences, vol. 39, no. 2, pp. 1-11, 2008. 3. i. m. al-barzinji and m. k. al-jubouri. “effects of exposing potato tuber seeds to uv radiation on growth, yield and yield quality”. research and reviews: journal of botany, vol. 5, no. 2, pp. 19-26, 2016. 4. k. h. ng. “non-ionizing radiations-sources, biological effects, emissions and exposures”. proceedings of the international conference on non-ionizing radiation at uniten (icnir2003). electromagnetic fields and our health, 2003. 5. environmental protection agency. “radiation: facts, risks and realities”. environmental protection agency, washington, d.c, united states, 2012. available from : https://www.epa.gov/sites/ default/files/2015-05/documents/402-k-10-008.pdf [last accessed on 2022 sep 23]. 6. k. p. panchal, n. r. pandya, s. albert and d. j. gandhi. “a x-ray image analysis for assessment of forage seed quality”. international journal of plant, animal and environmental sciences, vol. 4, no. 4, pp. 103-109, 2014. 7. a. a. rezk, j. m. al-khayri, a. m. al-bahrany, h. s. el-beltagi and h. i. mohamed. “x-ray irradiation changes germination and biochemical analysis of two genotypes of okra (hibiscus esculentus l.)”. journal of radiation research and applied sciences, vol. 12, no. 1, pp. 393-402, 2019. 8. j. singh. “studies in bio-physics: effect of electromagnetic field and x-rays on certain road side legume plants at saharanpur”. international journal of scientific and research publications, vol. 3, no. 12, pp. 1-9, 2013. 9. s. dhamgaye, v. dhamgaye and r. gadre. “growth retardation at different stages of bean seedlings developed from seeds exposed to synchrotron x-ray beam”. advances in biological chemistry, vol. 8, no. 2, pp. 29-35, 2018. 10. s. dhamgaye, n. gupta, a. shrotriya, v. dhamgaye and r. gadre. “biological effects of seed irradiation by synchrotron x-ray beam in young bean seedlings”. advances in biological chemistry, vol. 9, no. 2, pp. 88-97, 2019. 11. s. m. mortazavi, l. a. mehdi-pour, s. tanavardi, s. mohammadi, s. kazempour, s. fatehi, b. behnejad and h. mozdarani. “the biopositive effects of diagnostic doses of x-rays on growth of phaseolus vulgaris plant: a possibility of new physical fertilizers”. asian journal of expermental sciences, vol. 20, no. 1, pp. 27-33, 2006. 12. c. arena, v. de micco and a. de maio. “growth alteration and leaf biochemical responses in phaseolus vulgaris exposed to different doses of ionising radiation”. plant biology, vol. 16, no. suppl 1, pp. 194-202, 2014. 13. r. m. roy. “transpiration and stomatal opening of x-irradiated broad bean seedlings”. radiation botany, vol. 14, no. 3, pp. 179-184, 1974. 14. c. z. jiang, s. r. rodermel and r. m. shibles. “photosynthesis, rubisco activity and amount, and their regulation by transcription albarzinji, et al.: pigments and stomata of cowpea under x-ray 64 uhd journal of science and technology | july 2022 | vol 6 | issue 2 in senescing soybeen leaves”. plant physiology, vol. 101, no. 1, pp. 105-112, 1993. 15. k. lichtenthaler and a. r. wellburn. “determination of total carotenoids and chlorophylls a and b of leaf extracts in different solvents”. biochemical society transactions, vol. 11, no. 5, pp. 591-592, 1983. 16. r. priyanka and r. m. mishra. “effect of urban air pollution on epidermal traits of road side tree species, pongamia pinnata (l.) merr”. isro journal of environmental science, toxicology and food technology, vol. 2, no. 6, pp. 2319-2402, 2013. 17. f. h. al-sahaf. “applied plant nutrition. university of baghdad. ministry of higher education and scientific research”. dar alhikma press, iraq, p. 260, 1989. 18. a. h. reza. “design of experiments for agriculture and the natural sciences”. 2nd ed. chapman and hall/crc, new york, pp. 452, 2006. 19. n. a. al-enezi, and j.m. al-khayri. “alterations of dna, ions and photosynthetic pigments content in date palm seedlings induced by x-irradiation”. international journal of agricultural and biology, vol. 14, no. 3, pp. 329-336, 2012a. 20. a. a. aly, r. w. maraei, and s. ayadi. “some biochemical changes in two egyptian bread wheat cultivars in response to gamma irradiation and salt stress”. bulgarian journal of agricultural science, vol. 24, no. 1, pp. 50-59, 2018. 21. l. r. dartnell, m. c. storrie-lombardi, c. w. mullineaux, a. v. ruban, g. wright, a. d. griffiths, j. p. muller and j. m. ward. “degradation of cyanobacterial biosignatures by ionizing radiation”. astrobiology, vol. 11, no. 10, pp. 997-1016, 2011. 22. h. h. latif and h. i. mohamed. “exogenous applications of moringa leaf extract effect on retrotransposon, ultrastructural and biochemical contents of common bean plants under environmental stresses”. south african journal of botany, vol. 106, pp. 221-231, 2016. 23. l. taiz and e. zeiger. “plant physiology”. 3rd ed. sinauer associates publications, sunderland, massachusetts, p. 690, 2002. 24. n. a. al-enezi and j. m. al-khayri. “effect of x-irradiation on proline accumulation, growth and water content of date palm (phoenix dactylifera l.) seedlings”. journal of biological sciences, vol. 12, no. 3, pp. 146-153, 2012b. 25. d. o. kehinde, k. o. ogunwenmo, b. ajeniya, a. a. ogunowo and a. o. onigbinde. “effects of x-ray irradiation on growth physiology of arachis hypogaea (var. kampala)”. chemistry international, vol. 3, no. 3, pp. 296-300, 2017. _goback tx_1~abs:at/tx_2:abs~at uhd journal of science and technology | jan 2023 | vol 7 | issue 1 7 1. introduction digital image processing (dip) is significant in many areas, particularly medical image processing, image in-painting, pattern recognition, biometrics, content-based image retrieval, image de-hazing, and multimedia security [1], [2]. it is becoming more important for analyzing medical images and identifying abnormalities in these images. computeraided diagnosis (cad) systems based on image processing have emerged as an intriguing topic in the field of medical image processing research. a cad system is a computerbased system that assists medical professionals in diagnosing diseases, in particular cancers, using medical images such as x-ray, magnetic resonance imaging (mri), computed tomography (ct), ultrasound, and microscopic images [3]. the aim of developing autonomous cad systems is to extract the targeted illnesses with a high accuracy and at a lower cost and time consumption. preprocessing, segmentation, feature extraction, and classification are the four basic phases of each cad system. a feature is an important factor to categorize the disease in the cancer detection systems. feature extraction is the process of transforming raw data into a set of features [4]. there are numerous types of cancers such as breast cancer, brain tumors, lung cancer, skin cancer, and blood cancer. this paper focuses on the early detection of the cancerous cells in the breast. breast cancer is one of the most frequent kinds of cancer among females worldwide. there are currently no strategies for preventing breast cancer. the difficulty of radiologist interpretation of mammogram images can computer-aided diagnosis for the early breast cancer detection miran hakim aziz1, alan anwer abdulla2,3 1applied computer, collage of medicals and applied sciences, charmo university, chamchamal, sulaimani, kurdistan region, iraq, 2department of information technology, college of commerce, university of sulaimani, sulaimani, iraq, 3department of information technology, university college of goizha, sulaimani, iraq a b s t r a c t the development of the use of medical image processing in the healthcare sector has contributed to enhancing the quality/accuracy of disease diagnosis or early detection because diagnosing a disease or cancer and identifying treatments manually is costly, time-consuming, and requires professional staff. computer-aided diagnosis (cad) system is a prominent tool for the detection of different forms of diseases, especially cancers, based on medical imaging. digital image processing is a critical in the processing and analysis of medical images for the disease diagnosis and detection. this study introduces a cad system for detecting breast cancer. once the breast region is segmented from the mammograms image, certain texture and statistical features are extracted. gray level run length matrix feature extraction technique is implemented to extracted texture features. on the other hand, statistical features such as skewness, mean, entropy, and standard deviation are extracted. consequently, on the basis of the extracted features, support vector machine and k-nearest neighbor classifier techniques are utilized to classify the segmented region as normal or abnormal. the performance of the proposed approach has been investigated through extensive experiments conducted on the well-known mammographic image analysis society dataset of mammography images. the experimental findings show that the suggested approach outperforms other existing approaches, with an accuracy rate of 99.7%. index terms: computer-aided diagnosis, medical image, breast cancer, gray level run length matrix, classifier technique corresponding author’s e-mail: dr. alan anwer abdulla, assistant prof., department of information technology, college of commerce, university of sulaimani, sulaimani, iraq, department of information technology, university college of goizha, sulaimani, iraq. e-mail: alan. abdulla@univsul.edu.iq received: 17-09-2022 accepted: 11-12-2022 published: 12-01-2023 access this article online doi: 10.21928/uhdjst.v7n1y2023.pp7-14 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2022 aziz and abdulla. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l re se a rc h a rt i c l e uhd journal of science and technology aziz and abdulla: early breast cancer detection 8 uhd journal of science and technology | jan 2023 | vol 7 | issue 1 be alleviated by employing the early-stage breast cancer detection method. thus, early diagnosis of this condition is critical in its treatment and has a significant influence in minimizing mortality. the most effective way of detecting breast cancer in its early stages is to analyze mammography images [5]. breast cancer is a disorder in which the cells of the breast proliferate uncontrollably. the kind of breast cancer is determined by which cells in the breast develop into cancer. breast cancer can start in any part of the breast. it can spread outside of the breast through blood and lymph arteries. breast cancer is considered to have metastasized when it spreads to other regions of the body [6]. in general, a breast is composed of three major components: lobules, ducts, and connective tissue (fig. 1) [6]. the lobules are the milk-producing glands. ducts are tubes that transport milk to the nipple. the majority of breast cancers start in the lobules or ducts [6]. connective tissue joins or separates and supports all other forms of bodily tissue. it contains of cells surrounded by a fluid compartment termed the extracellular matrix (ecm), as do all other forms of tissue. however, connective tissue varies from other kinds in that its cells are loosely instead of densely packed inside the ecm [7]. the aim of this study is developing a cad system for the early detection of breast cancer. the developed cad system has the advantages of increasing accuracy rate, reducing time consumption, and reducing cost in comparison with manually detecting system. the main contributions of the proposed approach are segmenting the breast region properly as well as extracting the most significant features, and this leads to increase the accuracy rate and reduce mistake rate of wrongly treating patients. the proposed system includes the following steps: a pre-processing step for enhancing the image quality, a segmentation step for segmenting the breast region from the other components of mammography images, and a feature extraction step for extracting the most influential features. finally, the classification step is conducted, which helps the system decide whether a cell is cancerous or noncancerous. the rest of the paper is structured as follows. section 2 provides a summary of past efforts from the literature. section 3 presents the proposed cad system. section 4 shows the results of experiments. finally, section 5 gives the conclusion. 2. literature review in medical image processing, the cad system is a computerbased system that helps clinicians in their last decision about different diseases, especially cancers. the whole process is about extracting significant information from medical images such as: mri, ct, and ultrasounds. several cad systems have been developed for identifying different diseases including: breast cancer, tumor detection, and lung cancer. this study concentrates on breast cancer. the processing and analysis of breast mammogram images plays a significant role in the early diagnosis of breast cancer. this section reviews the most influential as well as relevant current efforts on the early breast cancer detection using dip. the main obstacle in this field of research is reducing the rate of breast cancer detection errors. in general, most of the cad systems for the early breast cancer detection consist of the following steps: image enhancement, image segmentation, feature extraction, feature selection, and classification. in 2010, eltoukhy et al. suggested an algorithm for the breast cancer detection using a curvelet transform technique at multiple scales [8]. different scales of the largest curvelet coefficients are extracted and investigated from each level as a classification feature vector. this algorithm is reached an accuracy rate of 98.59% at scale 2. srivastava et al., in 2013, introduced a cad system for the early breast cancer diagnosis using digital mammographic images [9]. contrast-limited histogram equalization technique is utilized for the enhncement purposes. consequently, three-class fuzzy c-means is used for the segmentation process. the texture features such as geometric/shape, wavelet-based, and gabor were extracted. the minimum redundancy maximum relevance feature selection method was utilized to select the fewest redundant and most relevant characteristics. finally, support vector fig. 1. major components of the breast [6]. aziz and abdulla: early breast cancer detection uhd journal of science and technology | jan 2023 | vol 7 | issue 1 9 machine (svm), k-nearest neighbor (knn), and artificial neural network (ann) classifier techniques were used for classifying cancerious and non-canceroius cells. furthermore, svm provides better results in comparison to the knn and ann. this technique is achieved an accuracy rate of 85.57% for the 10-fold cross-validation using mammographic image analysis society (mias) dataset of images. vishrutha et al., in 2015, developed a strategy for combining wavelet and texture information that leads to increase the accuracy rate of the developed cad system for the early breast cancer diagnosis [10]. the mammogram images were pre-processed using median filter. in addition, the label and the black background are removed on the bases of sum of each column’s intensities. consequently, if the total intensity of a column falls below a certain level/threshold, the column will be removed. the resulted images from the pre-processing step were utilized as input for the region growth technique used to determine the region of interest (roi) as a seqmentation step. discrete wavelet transform technique was used to extract features from the seqmented images/regions. finally, svm classifier technique was utilized to categorize the mammogram images as benign or malignant with an accuracy rate of 92% using mini-mias dataset of images. in 2017, pashoutan et al. developed a cad system for the early breast cancer diagnosis [11]. for the pre-processing step, cropping begins by employing coordinates and an estimated radius of any artifacts introduced into images to get to the roi where bulk and aberrant tissues are found. moreover, histogram equalization and median filter were used to enhance the contrast of the images. edge-based segmentation and region-based segmentation methods are that the two main methods were used for the segmentation purposes. furthermore, four different techniques were utilized for extracting features, such as wavelet transform, gabor wavlet transform, zernike moments, and gray-level cooccurance matrix (glcm). eventually, using the mias dataset, this technique reached an accuracy rate of 94.18%. hariraj et al., in 2018, developed a cad system for the breast cancer detection [12]. in the pre-processing step, fuzzy multilayer was used to eliminate background information such as labels and wedges from images. moreover, thresholding was used to transform the grayscale image to the binary image. furthermore, morphological technique was implemented on the binary image to remove undesirable tiny items. regarding to the segmentation step, k-means clustering was utilized. for the feature extraction purposes, certain shape and texture features were extracted such as: diameter, perimeter, compactness, mean, standard deviation, entropy, and correlation. finally, the fuzzy multi-layer svm classifier technique provides better accuracy rate of 98% out of other tested classifier techniques using mini-mammographic mias dataset of images. sarosa et al., in 2019, designed a breast cancer diagnosis technique by investigating glcm and backpropagation neural network (bpnn) classification technique [13]. histogram equalization was utilized for the pre-processing and enhancing the images. consequently, glcm was used to extract features from the pre-processed images. finally bpnn was used to determine whether the input image is normal or abnormal. the suggested approach was evaluated using a mias dataset of images and it achieved an accuracy rate of 90%. in 2019, arafa et al. introduced a technique for the breast cancer detection [14]. in the pre-processing step, just the area including the breast region is automatically picked and artifacts as well as pectoral muscle were removed. the gaussian mixture model (gmm) was utilized to extract the roi. moreover, texture, shape, and statistical features were extracted from the roi. for the texture feature, glcm was utilized. furthermore, the following shape features such as circularity, brightness, compactness, and volume were extracted. regarding to the statistical features, mean, standard deviation, correlation, skewness, smoothness, kurtosis, energy, and histogram were extracted. finally svm classifier technique was used to classify segmented roi into normal, abnormal, benign, and malignant. this proposed technique was evaluated using mias dataset of images and it achieves an accuracy of 92.5%. farhan and kamil developed a cad system for classifying the input mamogram images into normal or abnormal, in 2020, [15]. at the beginning, contrast limited adaptive histogram equalization (clahe) method was used to improve all mammogram images. in addition, the histogram of oriented gradient, glcm, as well as the local binary pattern (lbp) techniques was used to extract features. finally, svm and knn classifier techniques were used for classifying cancerious and non-canceroius cells. the best accuracy rate of 90.3%, using mini-mias dataset, was obtained when glcm and knn were used. in 2020, eltrass and salama developed a technique for breast cancer diagnosis [16]. as a pre-processing step, the mammography image was translated into a binary image, aziz and abdulla: early breast cancer detection 10 uhd journal of science and technology | jan 2023 | vol 7 | issue 1 and then all regions are sorted to identify the mammogram’s greatest area, that is, breast region. in addition, all artifacts and pectoral muscle were eliminated. this cad system utilized the expectation maximization technique for the segmentation purposes. wavelet-based contourlet transform technique was used to extract features. finally, svm classifier technique was used and an accuracy rate of 98.16% was achieved using mias dataset. saeed et al., in 2020, designed a classifier model to aid radiologists in providing a second opinion when diagnosing mammograms [17]. in the pre-processing step, median filter was used to remove noise and minor artifacts. hybrid bounding box and region growing algorithm was used to segment the roi. for the features extraction, two types of features were extracted which are: (1) statistical features such as mean, standard deviation skewness, and kurtosis and (2) texture features such as lbp and glcm. consequently, svm was used to categorize mammography images as normal or abnormal in the first level, and benign or malignant in the second level. this proposed technique used mais dataset to evaluate the performance, and an accuracy of 95.45% was obtained for the first level and 97.26% for the second level. mu’jizah and novitasari in 2021, developed a cad system for the breast cancer diagnosis [18]. at the beginning, certain pre-processing techniques, such as gaussian filter and canny edge detection technique, were implemented to enhance the visual quality of the input images. the thresholding method was also used for the segmentation purposes. to extract features, glcm was used as texture feature, and area, perimeter, metric, as well as eccentricity were extracted as shape feature. finally, for the classification step, svm was used and an accuracy rate of 98.44% was obtained using mini-mias dataset of images. recently, in 2022, holi produced a breast cancer detection system [19] which used a median filter and clahe for enhancing the input image. then, chebyshev distancedfuzzy c-means clustering was used to segment the preprocessed image. the augmented local vector pattern, shape features, and glcm were used to extract features. the classification step was conducted using knn classifier technique. this proposed technique was achieved an accuracy rate of 97% using mias dataset of images. the remainder of this paper concerns with the extension and further refinement of the strategy of using dip to increase the accuracy rate for the early breast cancer detection. 3. proposed approach the microscopic image of breast is called a mammogram, which consists of three parts/regions. the breast part appears on a mammogram in colors of gray and white, while the mammog ram backdrop is often black. in addition, a lump or tumor appears as a concentrated white area. tumors may be either malignant or benign [20]. the most significant step of each cad system for the breast cancer detection is extracting/cropping the roi from the other parts of the mammogram image. this section describes the proposed approach which involves the following steps: 1. pre-processing: in this step, certain techniques are applied such as region-props to delete the label from the mammogram images, and median filter as well as adaptive histogram equalization to enhance the image quality (fig. 2). 2. segmentation: to segment the roi from other parts of the input image, the thresholding segmentation technique is applied on image (d) in fig. 2, and the resulted image is a binary image, see image (a) in fig. 3. the threshold-based segmentation approach is an effective segmentation technique that divides an image based on the intensity value of each pixel. it is used to segment an image into smaller portions using a single color value to generate a binary image, with black representing the background and fig. 2. pre-processing step:(a) original mammogram image, (b) label removed, (c) resulted image after the median filter has been applied on image (b), and (d) resulted image after histogram equalization has been applied on image (c). dc ba aziz and abdulla: early breast cancer detection uhd journal of science and technology | jan 2023 | vol 7 | issue 1 11 white representing the objects [21]. the threshold t value can be selected either manually or automatically based on the characteristics of the image. in the proposed approach, t = 0.7 was used, which provides the optimum accuracy results. in the next section, all the tested values for the t are illustrated in table 5. 3. feature extraction: texture features and statistical features are extracted from the segmented image, that is, image (b) in fig. 3. the extracted features are summarized in table 5. furthermore, all the extracted features are fused for the classification purposes. 4. classification: svm and knn classification techniques were applied on the extracted features to distinguish normal cells from abnormal cells. the reason behind using svm and knn is because these two classifier techniques are the most common used in this field of research. for the both classifiers, the k-fold crossvalidation with k = 5, 10, 15, and 20 was investigated. fig. 4 illustrates the block diagram of the proposed approach. 4. experimental results the primary goal of the proposed cad approach is classifying the breast cancer cells into normal or abnormal. experiments are carried out in a thorough manner in this part of the study to evaluate how well the suggested approach works in terms of accuracy rate. in addition, the proposed approach is assessed alongside the findings of the earlier research. 4.1. dataset the mias dataset provides the tested input images, which are taken from the public domain and are quite well recognized. the mias dataset contains the original 322 images, 206 normal and 116 abnormal, in the pgm format [22]. all of the images have the same resolution which is 1024 by 1024 pixels. the mias dataset has been taken into consideration in order to assess the performance of the proposed cad approach. 4.2. results using several classifier techniques, such as svm and knn, the accuracy rate for each the extracted features is assessed. tables 2 and 3 present the accuracy rate of statistical and glcm separately using svm and knn respectively. in all the evaluation tests, different values ok k-fold have been considered. in addition, the accuracy rate has been calculated using the following formula [23]: accuracy rate = tp + tn/(tp + tn + fp + fn) (1) where: tp, tn, fp, and fn refer to true positive, true negative, false positive, and false negative, respectively. more investigation has been conducted by fusing the extracted features, namely statistical and glcm. meanwhile, the 11 retrieved features are utilized to evaluate the effectiveness table 1: extracted features type of features name of the feature statistical features texture feature: gray level run length matrix skewness mean entropy standard deviation short run emphasis long run emphasis gray level non-uniformity run percentage run length non-uniformity low gray level run emphasis high gray level run emphasis table 2: svm‑based accuracy rate for the extracted features separately features 5 k-fold 10 k-fold 15 k-fold 20 k-fold average (%) statistical 99.1 99.2 98.8 99.3 99.1 glrlm 98.1 99.3 99.1 99.1 98.1 svm: support vector machine, glrlm: gray level run length matrix table 3: knn‑based accuracy rate for the extracted features separately features 5 k-fold 10 k-fold 15 k-fold 20 k-fold average (%) statistical 94.4 94.7 97.5 97.5 96 glrlm 97.2 98.1 97.6 98.3 97.8 knn: k‑nearest neighbor, glrlm: gray level run length matrix fig 3. segmentation step: (a) binary image, (b) based on the binary image in (a), the roi is selected in the original image. ba aziz and abdulla: early breast cancer detection 12 uhd journal of science and technology | jan 2023 | vol 7 | issue 1 of the proposed cad approach in distinguishing between normal and abnormal cells. those 11 features are previously mentioned in table 1. moreover, kfold cross-validation with various values of k is used in the evaluation process to measure the accuracy. training and testing have been done using k-fold cross-validation, which divides data automatically into training and testing depending on the value of k.based on the investigation conducted in this study, fig. 4. block diagram of the proposed approach. aziz and abdulla: early breast cancer detection uhd journal of science and technology | jan 2023 | vol 7 | issue 1 13 the svm classifier technique provides a higher accuracy rate (table 4). tables 5 and 6 illustrate the findings of further tests done by comparing the obtained results of the proposed approach to results of four existing approaches. two of the existing works were used svm classifier techniques and the remained two works were used knn. all of the four tested cad systems used only 5k fold to evaluate the performance of their approaches and also tables 7 illustrate the time consumption of the all process in our system. according to the results presented in tables 5 and 6, the best accuracy rate is achieved by the proposed approach and it outperforms all the tested existing approaches. moreover, in eltrass and salama [16], the total time consumption is highlighted which is (2.26267) second, while the time consuming of our proposed approach is (2.004) second. the time consumption of the proposed approach is calculated as follows: more investigations have been done for testing the optimum value for the thresholding t that used for the segmentation purposes. based on the results presented in table 8, it is quite obvious that the best accuracy rate was achieved when t = 7. 5. conclusions since detecting a disease/cancer and identifying treatments manually is costly, time consuming, and requires professional staff, the evolution of the application of medical image processing in the healthcare field has contributed in an improvement in the quality/accuracy of disease diagnosis (or early detection). meanwhile, medical image processing techniques can accurately extract target diseases/cancers at higher accuracy and lower cost. breast cancer is one of the leading causes of mortality among women, compared to all other cancers. therefore, early detection of breast cancer is necessary to reduce fatalities. thus, early detection of breast cancer cells may be anticipated using recent machine learning approaches. the primary objective of developing cad system for mammogram images is to aid physicians and diagnostic experts by providing a second perspective, this increases confidence in the diagnostic process. this study was focused on the development of an efficient cad system for the early breast cancer detection. the testing findings reveal that the proposed cad approach obtained an accuracy rate of 99.7% and outperforms the existing approaches. to improve the performance of the proposed approach, the following are points of potential plans that extend our work in the future: (1) more filters and image processing techniques will be tested for pre-processing purposes to table 4: accuracy rate of the proposed cad approach cross validation 5k 10k 15k 20k average (%) svm 99.7 99.8 99.4 99.7 99.7 knn 98.4 99.1 98.8 98.8 98.9 cad: computer‑aided diagnosis, svm: support vector machine, knn: k‑nearest neighbor table 5: accuracy rate of the tested approaches using svm accuracy rate (%) proposed mu’jizah and novitasari[18] eltrass and salama [16] 99.7 98.4 98.1 table 6: accuracy rate of the tested approaches using k‑nearest neighbor accuracy rate (%) proposed farhan and kamil[15] holi [19] 98.9 90.3 97 table 7: time consumption of the proposed computer‑aided diagnosis system stage times in second pre-processing 0.371 segmentation 0.298 feature extraction 0.061 classification 1.274 total 2.004 s table 8: investigating the optimum value for thresholding t thresholding values knn (%) svm (%) 0.1 89 90.8 0.2 89.7 91.1 0.3 89.8 91.9 0.4 94.1 94.7 0.5 96.3 96 0.6 97.6 99.1 0.7 98.9 99.7 0.8 98.4 99.3 0.9 98.6 99.5 knn: k‑nearest neighbor, svm: support vector machine aziz and abdulla: early breast cancer detection 14 uhd journal of science and technology | jan 2023 | vol 7 | issue 1 enhance the image quality, (2) different techniques will be tested to improve segmenting purposes, and (3) different kinds of features should be tested and investigated. references [1] a. a. abdulla. “efficient computer-aided diagnosis technique for leukaemia cancer detection”. the institution of engineering and technology, vol. 14, no. 17, pp. 4435-4440, 2020. [2] a. a. abdulla and m. w. ahmed. “an improved image quality algorithm for exemplar-based image inpainting”. multimedia tools and applications, vol. 80, pp. 13143-13156, 2021. [3] h. arimura, t. magome, y. yamashita and d. yamamoto. “computer-aided diagnosis systems for brain diseases in magnetic resonance images”. algorithms, vol. 2, no. 3, pp. 925-952, 2009. [4] g. kumar and p. k. bhatia. “a detailed review of feature extraction in image processing systems”. international conference on advanced computing and communication technologies acct, pp. 5-12, 2014. [5] t. t. htay and s. s. maung. “early stage breast cancer detection system using glcm feature extraction and k-nearest neighbor (k-nn) on mammography image”. 2018-the 18th international symposium on communications and information technologies, pp. 345-348, 2018. [6] centers for disease control and prevention. “what is breast cancer?”. centers for disease control and prevention, united states. 2021. available from: https://www.cdc.gov/cancer/breast/ basic_info/what-is-breast-cancer.html [last accessed on 2022 dec 18]. [7] j. vasković. “overview and types of connective tissue.” medical and anatomy experts, 2022. available from: https://www.kenhub. com/en/library/anatomy/overview-and-types-of-connective-tissue [last accessed on 2022 dec 20]. [8] m. m. eltoukhy, i. faye and b. b. samir. “breast cancer diagnosis in digital mammogram using multiscale curvelet transform”. computerized medical imaging and graphics, vol. 34, no. 4, pp. 269-276, 2010. [9] s. srivastava, n. sharma, s. k. singh and r. srivastava. “design, analysis and classifier evaluation for a cad tool for breast cancer detection from digital mammograms”. international journal of biomedical engineering and technology, vol. 13, no. 3, pp. 270300, 2013. [10] s. c. satapathy, b. n. biswal, s. k. udgata and j. k. mandal. “proceedings of the 3rd international conference on frontiers of intelligent computing: theory and applications (ficta) 2014”. advances in intelligent systems and computing, vol. 327, pp. 413-419, 2014. [11] s. pashoutan, s. b. shokouhi and m. pashoutan. “automatic breast tumor classification using a level set method and feature extraction in mammography.” 2017 24th iranian conference on biomedical engineering and 2017 2nd international iranian conference on biomedical engineering icbme 2017, pp. 1-6, 2018. [12] v. hariraj, w. khairunizam, v. vijean and z. ibrahim. “fuzzy multilayer svm classification”. international journal of mechanical engineering and technology (ijmet), vol. 9, pp. 1281-1299, 2018. [13] s. j. a. sarosa, f. utaminingrum and f. a. bachtiar. “breast cancer classification using glcm and bpnn”. international journal of advances in soft computing and its applications, vol. 11, no. 3, pp. 157-172, 2019. [14] a. arafa, n. el-sokary, a. asad and h. hefny. “computer-aided detection system for breast cancer based on gmm and svm”. arab journal of nuclear sciences and applications, vol. 52, no. 2, pp. 142-150, 2019. [15] a. h. farhan and m. y. kamil. “texture analysis of breast cancer via lbp, hog, and glcm techniques”. iop conference series: materials science and engineering, vol. 928, no. 7, p. 072098, 2020. [16] a. s. eltrass and m. s. salama. “fully automated scheme for computer-aided detection and breast cancer diagnosis using digitised mammograms”. iet the institution of engineering and technology, vol. 14, no. 3, pp. 495-505, 2020. [17] e. m. h. saeed, h. a. saleh and e. a. khalel. “classification of mammograms based on features extraction techniques using support vector machine”. computer science and information technologies, vol. 2, no. 3, pp. 121-131, 2020. [18] h. mu’jizah and d. c. r. novitasari. “comparison of the histogram of oriented gradient, glcm, and shape feature extraction methods for breast cancer classification using svm”. journal of technology and computer systems, vol. 9, no. 3, pp. 150-156, 2021. [19] g. holi. “automatic breast cancer detection with optimized ensemble of classifiers”. international journal of advanced research in engineering and technology (ijaret), vol. 11, no. 11, pp. 2545-2555, 2020. [20] v. r. nwadike. “what does breast cancer look like on a mammogram?”. 2018. available from: https://www. medicalnewstoday.com/articles/322068 [last accessed on 2022 dec 16]. [21] k. bhargavi and s. jyothi. “a survey on threshold based segmentation technique in image processing”. international journal of innovative research and development, vol. 3, no. 12, pp. 234-239, 2014. [22] j. suckling, j. parker, d. dance, s. astley, i. hutt, c. boggis, i. ricketts, e. stamatakis, n. cerneaz, n, s. kok, p. taylor, d. betal and j. savage. “the mammographic image analysis society digital mammogram database”. international congress series, vol. 1069, pp. 375-378, 1994. [23] r. murtirawat, s. panchal, v. k. singh and y. panchal. “breast cancer detection using k-nearest neighbors, logistic regression and ensemble learning”. proceedings of the international conference on electronics and sustainable communication systems, icesc 2020, pp. 534-540, 2020. . uhd journal of science and technology | jan 2019 | vol 3 | issue 1 1 1. introduction the shewashok oil field was discovered in 1930. the first well was drilled in 1960 and the second was drilled in 1978, but, due to political circumstances, oil was not extracted until 1994 where the production was 44,027 barrels/day in that year. then production reached 140,000 barrels a day by 2016 [1]. a total of 31 wells are drilled, and currently, more wells are drilling, but the field has rarely been studied scientifically, especially regarding ecological aspects. air, water, and food are the basic needs of most of the living organisms to survive. the quality of consumed water, air, and food may transfer to the consumer body organisms. with gas flaring in the oil field, toxic gases and particles are released into the atmosphere [2]. quite possibly the particles contain heavy metals due to that they are driven from hydrocarbons and come from deep geological layer formations, obviously living organisms consume this contaminated air as the source of their respiration. furthermore, diet is the most critical pathway of transferring the trace elements to mammal’s organisms and store in the tissues; therefore, laboratory testing of animal tissues can be environmental impacts of shewashok oil field on sheep and cow meat using vital trace elements as contamination bioindicators mamoon qader salih1, rawaz rostam hamadamin2, rostam salam aziz3 1department of oil and gas, mad institute, arbil-koya road, erbil, kurdistan region, iraq, 2department of basic education, koya university, daniel mitterrand boulevard, koya koy45 ab64, kurdistan region – iraq, 3department of geography, koya university, daniel mitterrand boulevard, koya koy45 ab64, kurdistan region – iraq a b s t r a c t ambient environment is built based on the interaction of living and non-living organism and chemical and physical compounds, and thus, oil field emissions, effluents, and its general waste can be a part of environmental condition of certain area. this study is to investigate the environmental impacts of oil field on sheep and cow meat around shewashok oilfield. it has been performed at the laboratories of the department of medical microbiology, koya university, by detecting and measuring heavy metals and vital trace elements as contamination indicators. 20 meat samples of domestic animals (cow and sheep) in both control and affected area were collected for the purpose of detecting the concentration of heavy metals in the animals. the samples dried and digested with concentrated hno 3 and concentrated h 2 o 2 . the concentration of heavy metals of the sample digested domestic animal was determined using inductively coupled plasma–optical emission spectroscopy. this study shows that iron, cobalt, copper, zinc, arsenic, manganese, aluminum, mercury, and chromium were detected in all the meat samples. overall, this study confirms that the cow and sheep meat are still safe to eat in both locations because only al, fe, and hg were found danger in both sheep and cows’ meat in comparison with allowed limits of the world health organization 2017, and all other trace elements are complying with the global standards. index terms: cows’ and sheep meat, environmental pollution, oil field, shewashok, trace elements corresponding author’s e-mail: rawaz rostam hamadamin, department of basic education, koya university, daniel mitterrand boulevard, koya koy45 ab64, kurdistan region – iraq, e-mail: rawaz.rostam@koyauniversity.org received: 20-09-2018 accepted: 24-01-2019 published: 25-01-2019 access this article online doi: 10.21928/uhdjst.v3n1y2019.pp1-8 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2019 salih, et al. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l r e s e a r c h a r t i c l e uhd journal of science and technology maamun qadir salih, et al.: shewashok oil field impacts on environment 2 uhd journal of science and technology | jan 2019 | vol 3 | issue 1 a vital bioindicator for environmental pollution [3]-[5]. some nigerian studies showed that, during drilling, oil production, refining, and gas flaring, harmful elements can add to air, soil, and both surface and groundwater [6], [7]. if air, water, and soil quality is not acceptable by standards, then vegetation, plants, and fruit quality can alter [7]. in general, contamination of air, water, and soil can transfer to plants then to animals by ingestion and then to human. in the study area, a research showed that groundwater is already not complying with national and international standards [1]. however, air, soil, and agriculture crops have not been studied yet. not all the trace elements are heavy metal but all the heavy metals are trace elements and toxic out of their limits. therefore, some of the trace elements are essential for life, although some of them can cause a high risk to the health [8], [9]. in general, the metals can be classified into three main groups: potential toxic such as cadmium and mercury; probably essential such as manganese and silicon; and essential metals such as cobalt, copper, zinc, and iron [8]-[10]. the toxicity effects are referred to specific types of metals which are not beneficial to human health; contrary, it causes severe toxicological effect if body receives an amount out of safe limit [8]. it may not be easy to prevent intake of trace elements by human, as industries significantly develop on a sustained speed around the world, a large amount of metals streaming into the environment. moreover, yet, most of the heavy metals are permanently circling in the environment because they are indecomposable materials and these can integrate with daily essentials such as food and water, and hence, they make their way into the human tissues through the food chain [8], [11]. meat is considered as an essential source of human nutrition. the chemical composition of meat depends on the quality of animal feeding; this may potentially accumulate toxic minerals and represent one of the sources of critical heavy metals [8], [10]. the risk associated with the exposure to heavy metals present in food and food products has aroused widespread concern in human health [11]. however, improvement in food production and processing technology achieved, but food contamination with various environmental pollutants also increased, especially trace elements and heavy metals among them. in the light what introduces above, the current study aims to evaluate some vital trace elements such as al, as, cu, cr, co, fe, hg, mn, and zn in raw meat of cow and sheep that produced in iraqi kurdistan, and it tries to understand their level of danger and toxicity to consumers. the samples were collected from two industrial sites, an area surrounding the shewashok oil field and in the north of erbil. it will compare both samplings together and then evaluate them by considering the who standards for heavy metals and trace elements. 1.1. study area the samples were collected from north of iraq in erbil province fig. 1. this region is with mediterranean climates system, having cool, wet winter, and hot and dry summers with mild spring and autumn, and its annual average precipitation is 450 mm with some variation from the mountains to the plains [12]. two locations were selected from the province for the sampling: focused location which is shewashok oil field (called study area group in this article) in the southeast of erbil and the second location is in the north of erbil which is the main arable area and livestock farming of the province. the animals are feeding with available rearing resources in the region that means that the meat quality is affected by the ambient environment condition. 2. materials and methods the study data collection, preparation, and analysis followed below stages. 2.1. sample collection the materials used for the study included field and laboratory materials. the experimental work has been performed at the laboratories of the department of medical microbiology, koya university. the collected samples from slaughterhouse of arbil city “control area” and koya city “study area.” 20 meat samples were collected from each cow and sheep of the study area to detect the concentration of trace elements. in parallel, 20 samples have been collected from each cow and sheep of the control area. 2.2. the summary of the samples collected at both the locations • number of samples collected: 80 samples. • number of samples collected of the study area sheep and cows: a total of 40 samples “in another word, 20 samples of each.” • number of samples collected of control sheep and cows: a total of 40 samples “in another word, 20 samples of each.” • trace element analyzed: iron, cobalt, copper, zinc, arsenic, manganese, aluminum, mercury, and chromium. maamun qadir salih, et al.: shewashok oil field impacts on environment uhd journal of science and technology | jan 2019 | vol 3 | issue 1 3 2.3. used materials and chemicals 2.3.1. material cylinder, funnel, beaker filter paper watch glass, pipette, volumetric flask, conical flask, balance, bottle (250+500) ml hot plate, oven, centrifuge, hood, gloves, tissues, bio hand (alcohol to cleaning), plastic bags, blade operations, parafilm, bottle to save solution, falcon tube, cuter, and tongue depressor were used. 2.3.2. chemical nitric acid, hydrogen peroxide, distilled water, deionized water, vacuum clever for cleaning materials, and inductively coupled plasma (icp) were used as chemicals. 2.3.3. digestion procedure to the determination of trace elements in the sample meat of sheep and cow animals by icp–optical emission spectroscopy (oes) the collected samples were decomposed by wet digestion method for the determination of various metals. the collected samples were washed with distilled water to remove any contaminant particles. the samples were cut to small pieces using clean scalpel. samples were dried in an oven at 100°c. weight 1 g of dried sample, using sensitive balance. transfer the dried samples into 250 ml digestion beaker or flask. digest the sample by adding 10 ml of concentrated hno 3 and mix well. heat the digestion mixture on a hot plate at 100 ± 10°c for 30 min, inside the fume chamber (hood). repeat the heating process once more with 10 ml of the acid. cool down the mixtures to room temperature, and then, add 2 ml of concentrated h 2 o 2 . heat the beaker or flask again carefully, until dryness. leave to cool down, then dissolve the mixture in distilled or deionized water until obtaining a clear solution. filter the sample solution through a cellulose filter paper into 25 ml digestion tubes. the filtrate was diluted to 25 ml with distilled or deionized water and heated the solution to dissolve the precipitate transfer the samples into laboratory polyethylene bottles and store until analyzed. a blank digestion prepared in the same procedure for the control samples. finally, analyze the elements in the sample solutions by icp/icp-oes. the final measurement volume of the sample solutions should be 5 ml [13]-[15]. fig. 1. study area map with explain sampling locations. source: kurdistan region of iraq, ministry of planning, information directorate and the preparation of maps, map of erbil in 2016, scale (1: 250000). maamun qadir salih, et al.: shewashok oil field impacts on environment 4 uhd journal of science and technology | jan 2019 | vol 3 | issue 1 2.3.4. icp-oes the well-known icp–mass spectrometry technique has been used to test the samples at a modern scientific laboratory for heavy metals. among the trace elements, only 17 critical heavy metals have been examined due to their negative impacts on living organisms [16]. 2.4. statistical analysis for the first section of this study discussion, data were expressed as mean ± standard error of mean ,and the statistical package for the social sciences (version 20) software was used to analyze the results. differences in mean values between two groups were analyzed by t-test. p < 0.05 was considered to be statistically significant. 2.5. comparison of the study observations with the who standards for trace elements for the second section of this study discussion, only study area data (excluding control area in this section) were compared with the who 2017 guidelines for trace elements limits to find the level of contamination in our study according to the global scale. 3. results and discussion in recent years, much attention has been given to contamination of food products, among the animal meats. the level of trace elements in meat from different animals depends on some factors such as environmental conditions of the animal grazing location. the obtained results of the current study were divided into two sections to discuss; the first section is a comparison between study area which is shewashok oil field and control area which is the north of erbil, whereas the second section is a comparison between the study areas with the who standards. 3.1. first section: comparison of study area with control area table 1 shows that the difference between control and study groups of aluminum in sheep samples is 254.6 and 404.5 ppb, respectively, that means the study area is higher than control group by 1.5 times. table 2 shows that the value of aluminum in cow sample of both control group and study area is 186.2 and 278.7 ppb, respectively, again the value of study area is higher than the control group by 1.4 times. both locations have a similar value for aluminum, but in comparison with the who 2017 guidelines which are 200 ppb, both locations are higher than allowed limit that is due to the type of animal diet in both the groups [17]. arsenic is also very toxic to animals, because it affects their body through gastrointestinal tract and the cardiovascular system. symptoms of arsenic poisoning in animals include watery diarrhea, severe colic, dehydration, and cardiovascular collapse [13]. table 1 presents that the value of arsenic in sheep samples of control area and study area is 8.005 ppb and 6.256 ppb, respectively, and table 2 presents that the value in cow samples of control area and study area is 8.015 ppb and 7.478 ppb, respectively. both sample sheep and cows of control group are higher than the study area and it is due to the contamination of pasture by industrial emissions [14]. previous study shows a high concentration of arsenic in the meat of cattle and goats in bieszczady mountains [18]. all samples of both locations in this study are within the allowed limit of the who which is10 ppb. tables 1 and 2 show that the result of chromium in both the locations had high differences between the control group and study area. the value of the control group of both sheep and cows’ meat samples showed zero, but the study area location of the samples showed 0.752 and table 1: trace element concentration in control and study groups of sheep meat elements control group (ppb) study group (ppb) p value al 254.6±48.51 404.5±126.3 0.275 fe 1941±295.2 474.1±121.2 0.0001 hg 26.12±0.434 26.91±0.484 0.229 mn 159.5±31.21 179.7±28.88 0.638 zn 1006±100.9 1080±128.8 0.654 as 8.005±0.789 7.478±1.010 0.683 co 0.000±0.000 0.266±0.116 0.028 cr 0.000±0.000 0.752±0.347 0.037 cu 492.6±61.65 1038±253.8 0.043 results expressed as mean±se table 2: trace element concentrations in control and study groups of cow meat elements control group (ppb) study group (ppb) p value al 186.2±31.59 278.7±41.19 0.08 fe 1356±154.9 3720±534.3 0.0001 hg 26.49±0.455 26.78±0.585 0.699 mn 104.9±22.35 110.0±12.45 0.842 zn 685.9±90.73 1688±264.4 0.001 as 8.015±0.812 6.256±0.950 0.171 co 0.271±0.127 1.242±0.344 0.012 cr 0.000±0.000 6.692±4.636 0.157 cu 922.2±268.9 134.3±28.96 0.006 results expressed as mean±se maamun qadir salih, et al.: shewashok oil field impacts on environment uhd journal of science and technology | jan 2019 | vol 3 | issue 1 5 6.692 ppb, respectively, of sheep and cow sample; the high value of chromium in the study area is due to the release of chromium into the environment due to natural gas flaring during oil processing [19], [20]. this result was supported by the assessment of heavy metal pollution and contaminations in the cattle meat [13], [21]; however, the samples of both the locations had a lower value of chromium than allowed limit 50 ppb according to the who guideline. for cobalt, tables 1 and 2 show that the value between both locations in the sheep meat sample is 0.000 and 0.266 ppb, respectively, in control and study group, and for the cows’ meat sample, the recorded value in control and study group is 0.271 and 1.242 ppb, respectively; the study area was higher than control area by approximately 4.6 times, it might due to soil contamination, also pasture lands is recognized as a source of co, it can occur as a result of animal treading or soil splash on short pasture during heavy rain [22]. however, all samples of both the locations are within the allowed limit of the who which is 3 ppb. mercury is volatile liquid metal, found in rocks and soils, and also is present in air as a result of human activities as the use of mercury compounds in the production of fungicides, paints, cosmetics, papers pulp, etc. the highest concentrations were found in soils from urban locations; mercury may induce neurological changes and some diseases [23]. table 1 shows that the sample of sheep meat had a high value of mercury contents of samples in the control and study area ranged between 26.12 ppb and 26.91 ppb, respectively, and also the sample of cows’ meat like sheep meat had a high amount in both location control group (26.49 ppb) and study area (26.78 ppb). a previous study findings comply with this finding as mercury recorded high in beef meat from algeria [14]. zinc is another essential element in our diet, but the excess may be harmful, and the provisional tolerable weekly intake (ptwi) zinc for meat is 700 mg/week/person [21]. the minimum and maximum levels of zn were detected in both the location of control group and study area of sheep samples which was recorded between 1006 and 1080 ppb respectively, and for the cows’ sample, was recorded 685.9 and 1688 ppb, respectively, in both location of control group and study area, and none of the samples exceeded the recommended limit 3000 ppb according to the who guideline. moreover, the difference between both positions of zinc metal is non-significant in sheep samples but for the cows’ sample had a highly significant; however, the meat sample of cows and sheep in study area location showed a higher value when compared to control group because of the high intake of zinc by animals, due to several factors, first of all having excessive amounts of zinc in animal’s food, pastures lands contaminated with smoke that polluted by zinc, surfaces painted with high-zinc paints where animals could lick them and finally food transport in galvanized containers that already containing zinc when manufactured [24], [25]. iron deficiency causes anemia and meat is the source of this metal; however, when their intake is excessively elevated, the essential metal can produce toxic effects [26]. table 1 shows that the iron value of control group and study area for sheep was 1941 and 474.1 ppb, respectively, and the amount of the control is more elevated than study group by 4 times, which recorded among the sheep meat samples [8]. table 2 shows that the value of iron in cows’ meat sample was 1356 and 3720 ppb, respectively, of the control group and study area, and both the locations are higher than allowed limit 300 ppb according to the who guideline. that is due to the type of feeding which contains dry plants that may be very rich with mentioned elements, or the consumed water is containing a high level of fe. tables 1 and 2 show that the value of manganese element in cows’ meat sample in control group and study area is 104.9 and 110 ppb, respectively, also in sheep meat sample, the amount of manganese of control and study area is 159.5 and 179.7 ppb, respectively, and the values of the study area is higher than the control group. although copper is essential for good health, the ptwi copper for fresh meat has been proposed as 14 mg/week/ person [13]. however, very high intakes can cause health problems such as liver and kidney damage [25]. determination of the cu content in food is also an important subject concerning human consumption [27], [28]. table 1 shows that the value of copper element of control group and study area for sheep is 492.6 and 1038 ppb, respectively, and the study area is higher than control area by 2.1 times. the results of the present study indicate that the values of copper in the study area were relatively high compared with the who guideline. that is, since this metals enter through feed material from burning zoon and transport excuse products, ultimately passage into the tissues and the excessive ingestion of copper by animals could occur in various situations such as grazing maamun qadir salih, et al.: shewashok oil field impacts on environment 6 uhd journal of science and technology | jan 2019 | vol 3 | issue 1 immediately after fertilization, pastures grown on soils containing high concentration of copper, supply of wheat treated with antifungal drugs containing copper, and pasture contaminated by smoke from foundries [21]. this result is compatible with other studies in countries such as sweden, as high values of copper were found in the cattle meat [28], [29]. moreover, about copper, moreover, table 2 shows that copper values of the control group and study area is 922.2 and 134.3 ppb respectively for cow’s meat. cows’ meat from the control group had a higher value of cu concentration compared with the study group. in this study, records of both animals and locations are within allowed limit 1000 ppb of who 2017. in general, both areas are quite similar for cows and sheep because the values are close, which may be result of similarity of the geographic feature and exist no effective physical barrier between both locations. 3.2. section two: comparing study area with the who 2017 standards most of the elements are within the who standards such as mn, zn, as, co, and cr in both cows’ and sheep meat, and only cu is just above the who limit by 38 ppb in sheep meat samples; however, it is within the standard in cow samples (table 3). al and fe both are exceeding the who guideline, al by 204.5 and fe by 174.1 ppb in sheep samples and al 78.7 ppb and fe 3420 ppb in cow samples. furthermore, hg is out of the who accepted range but with a high significant difference between the samples and the standard value, which is more than 4 times higher than the standard (table 3). this simple comparison notes that most of the elements are within the who standards such as mn, zn, as, co, and cr, which means that they have no health risks on consumers [30], [31]. cu which is an essential trace element for a human body is just above the who limit only in sheep meat samples, but it probably not causing a tremendous health risk as the exceedance is negligible. cu can increase in animal body if the consumed vegetable leafs have contaminated with cu [8]. both al and fe are effective exceeding the who guideline, as discussed in the first section high value of al is due to the type of both animals’ diet in the study area [17]. fe is higher than the who standards in both animal meat samples, but it is very high in cows’ meat samples as showed above. both excessive and deficiency of fe intake can lead to health disorder [32]. fe is a naturally occurring element, but extreme high value as read in cow samples may be due to human intervention through the quality of air, water, or food that consumed by the animals, but there is no study regarding of air, water, or vegetation quality of the study area. furthermore, hg which is a toxic elements [8], [9] is out of the who accepted range but with a high significant distance between the samples and the standard value. a previous study confirms this finding as mercury recorded high in beef meat in north algeria [14]. however, hg is a naturally accruing element, but a high value in the body can have a detrimental effect on health of the consumers [33], such as damaging nervous system, liver, and eyes, and infant may be deformed; other symptoms of mercury toxicity are a headache, fatigue, anxiety, lethargy, and loss of appetite. 4. conclusion the present findings indicated that these trace elements such as iron, cobalt, copper, zinc, arsenic, manganese, aluminum, mercury, and chromium were detected in all the samples. only, hg, al, and fe, in both sheep and cows’ meat, presented high values for both groups in comparison with allowed limits of the who 2017. however, overall, this study confirms that the cow and sheep meat still safe to eat in the study area because only al, fe, and hg were found danger, but all other elements are complying with the global standards. 5. acknowledgment our special thanks and much appreciation goes to the workers in slaughterhouses of erbil and koya, for their support and cooperation with the data collection. we would like to thank garmian university for testing the samples and koya university for using its laboratories. table 3: comparison of study area with the who 2017 standards elements who (ppb) study group (sheep) (ppb) study group (cow) (ppb) al 200 404.5 278.7 fe 100-300 474.1 3720 hg 1-6 26.91 26.78 mn 100-400 179.7 110.0 zn 3000 1080 1688 as 10 7.478 6.256 co 3 0.266 1.242 cr 50 0.752 6.692 cu 1000 1038 134.3 this table is made based on tables 1 and 2 and the who standards for heavy metals 2017 maamun qadir salih, et al.: shewashok oil field impacts on environment uhd journal of science and technology | jan 2019 | vol 3 | issue 1 7 references [1] a. y. ali, n. j. hamad, and r. r. hamadamin. “assessment of the physical and chemical properties of groundwater resources in the shewashok oil field”. koya university journal, vol. 45, pp. 163-183, 2018. [2] f. i. ibitoye. “ending natural gas flaring in nigeria’s oil fields”. journal of sustainable development, vol. 7, no. 3, p.13, 2014. [3] m. durkalec, j. szkoda, r. kolacz, s. opalinski, a. nawrocka and j. zmudzki. “bioaccumulation of lead, cadmium and mercury in roe deer and wild boars from areas with different levels of toxic metal pollution”. international journal of environmental research, vol. 9, no. 1, pp. 205-212, 2015. [4] q, zhou, j. zhang, j. fu, j. shi and g. jiang. “biomonitoring: an appealing tool for assessment of metal pollution in the aquatic ecosystem”. analytica chimica acta, vol. 606, no. 2, pp. 135-150, 2008. [5] s. stankovic, p. kalaba and a. r. stankovic. “biota as toxic metal indicators”. environmental chemistry letters, vol. 12, no. 1, pp. 63-84, 2014. [6] c. n. nwankwo and d. o. ogagarue, d.o. “effects of gas flaring on surface and ground waters in delta state, nigeria”. journal of geology and mining research, vol. 3, no. 5, pp. 131-136, 2011. [7] k. ihesinachi and d. eresiya. “evaluation of heavy metals in orange, pineapple, avocado pear and pawpaw from a farm in kaani, bori, rivers state nigeria”. international journal of environmental research and public health, vol. 1, pp. 87-94, 2014. [8] world health organization. “trace elements in human nutrition and health”. world health organization, geneva, 1996. [9] a. mehri and r. f. marjan. “trace elements in human nutrition: a review”. international journal of medical investigation, vol. 2, pp. 115-28, 2013. [10] r. munoz-olives and c. camara. speciation related to human health. in: l. ebdon, l. pitts, r. cornelis, h. crews, o. f. donard and p. quevauviller, editors. “trace element speciation for environment food and health”. the royal society of chemistry, cambridge, pp. 331-353, 2001. [11] food and agriculture organization. standard for contaminants and toxins in consumer products human and animal. in: “codex alimentarius”. food and agriculture organization, geneva, pp. 193, 1995. [12] a. naqshabandy. “regional geography of kurdistan-iraq”. 1st ed. braiaty center, erbil, pp. 74 -78, 1998. [13] k. sathyamoorthy, t. sivaruban, and s. barathy. “assessment of heavy metal pollution and contaminants in the cattle meat”. journal of industrial pollution control, vol. 32, no. 1, pp. 350-355, 2016. [14] b. badis, z. rachid and b. esma. “levels of selected heavy metals in fresh meat from cattle, sheep, chicken and camel produced in algeria”. annual research and review in biology, vol. 4, no. 8, p. 1260, 2014. [15] o. akoto, n. bortey-sam, s. m. nakayama, y. ikenaka, e. baidoo, y. b. yohannes, h. mizukawa and m. ishizuka. “distribution of heavy metals in organs of sheep and goat reared in obuasi: a gold mining town in ghana”. international journal of environmental science and technology, vol. 2, no. 2, pp. 81-89, 2014. [16] m. bettinelli, g. beone, s. spezia, and c. baffi. “determination of heavy metals in soils and sediments by microwave-assisted digestion and inductively coupled plasma optical emission spectrometry analysis”. analytica chimica acta, vol. 424, no. 2, pp. 289-296, 2000. [17] o. miedico, m. iammarino, g. paglia, m. tarallo, m. mangiacotti and a. e. chiaravalle. “environmental monitoring of the area surrounding oil wells in val d’agri (italy): element accumulation in bovine and ovine organs”. environmental monitoring and assessment, vol. 188, no. 6, p. 338, 2016. [18] j. krupa and j. swida. “concentration of certain heavy metals in the muscles, liver and kidney of goats fattened in the beiszczady mountains”. animal science, vol. 15, pp. 55-59, 1997. [19] m. malarkodi, r. krishnasamy, r. kumaraperumal and t. chitdeshwari. “characterization of heavy metal contaminated soils of coimbatore district in tamilnadu”. agronomy journal, vol. 6, pp. 147-151, 2007. [20] agency for toxic substances and disease registry. “toxicological profile for chromium. agency for toxic substances and disease registry”. u.s. department of health and human services. public health service, united states, pp. 263-278, 2012. [21] p. trumbo, a. a. yates, s. schlicker and m. poos. “dietary reference intakes: vitamin a, vitamin k, arsenic, boron, chromium, copper, iodine, iron, manganese, molybdenum, nickel, silicon, vanadium, and zinc”. journal of the academy of nutrition and dietetics, vol. 101, no. 3, p. 294, 2001. [22] e. d. andrews, b. j. stephenson, j. p. anderson and w. c. faithful. “the effect of length of pasture on cobalt deficiency in lambs”. new zealand journal of agricultural research, vol. 1, pp. 125-139, 1958. [23] t. d. luckey and b. venugopal. “metal toxicity in mammals”. plenum press, new york, p. 25, 1977. [24] o. m. radostits, c. c. gay, d. c. blood and k. w. hinchcliff. doenças causadas por substâncias químicas inorgâncias e produtos químicos utilizados nas fazendas. in: o. m. radostits, c. c. gay, d. c. blood and k. w. hinchcliff, editors. “clínica veterinária: um tratado de doenças dos bovinos, ovinos, suínos, caprinos e equinos”. guanabara koogan, rio de janeiro, pp. 1417-1471, 2002. [25] e. manno, d. varrrica and g. dongarra. “metal distribution in road dust samples collected in an urban area close to a petrochemical plant at gela, sicily”. atmospheric environment, vol. 40, pp. 59295941, 2006. [26] p. ponka, m. tenenbein and j. w. eaton. iron. in: g. f. nordberg, b. a. fowler, m. nordberg and l. t. friberg, editors. “handbook on the toxicology of metals”. academic press, san diego, vol. 30, pp. 577-598, 2007. [27] f. zhang, x. yan, c. zeng, m. zhang, s. shrestha, l.p. devkota and t. yao. “influence of traffic activity on heavy metal concentration of roadside farmland soil in mountainous areas”. international journal of environmental research and public health, vol. 9, pp. 1715-1731, 2012a. [28] l. johrem, b. sundstrom, c. astrand and g. haegglund. “the levels of zinc, copper, manganese, selenium, chromium, nickel, cobalt and aluminium in the meat, liver and kidney of swedisch pigs and cattle”. zeitschrift für le0bensmittel-untersuchung und -forschung, vol. 188, pp. 39-44, 1989. [29] j. falandysz. “some toxic and essential trace metals in cattle from the northern part of poland”. science of the total environment, vol. 136, pp. 177-191, 1993. [30] world health organization. “guidelines for drinking-water quality: maamun qadir salih, et al.: shewashok oil field impacts on environment 8 uhd journal of science and technology | jan 2019 | vol 3 | issue 1 incorporating first addendum”. world health organization, geneva, 2017. [31] world health organization. “guidelines for drinking-water quality: recommendations”. world health organization, geneva, 2004. [32] g. nordberg, b. a. fowler and m. nordberg. “handbook on the toxicology of metals”. academic press is an imprint of elsevier, london, 2014. [33] k. m. rice, e. m. walker jr., m. wu, c, gillette and e. r. blough. “environmental mercury and its toxic effects”. journal of preventive medicine and public health, vol. 47, no. 2, pp. 74, 2014. . 24 uhd journal of science and technology | may 2018 | vol 2 | issue 2 1. introduction electronic and smart health-care systems have changed the way we receive care and have improved quality and reduced cost [1]. in electronic health-care systems, many stakeholders collaborate with the aim to provide the right care at the right time within the right cost. achieving the aim is not without obstacles, and there are challenges many of which are yet to be addressed by researchers and system developers. vir tual health-care systems where patients receive care without face-to-face meetings are increasingly becoming the norm due to advances in communication technologies. we have previously suggested the use of virtual breeding environment (vbe) and virtual organization (vo) concepts for health care by mahmud and lu [2], and we have explained the benefits of using such concepts in providing virtual health care. in section 3.1, we introduce vbe and vo concepts briefly. one of the challenges of any virtual collaboration system a blockchain-based service provider validation and verification framework for health-care virtual organization hoger mahmud1,2, joan lu2, qiang xu3 1department of computer science, college of science and technology, university of human development, kurdistan region, iraq, 2department of computing science, school of computing and engineering, university of huddersfield, huddersfield, uk, 3department of engineering, school of computing and engineering, university of huddersfield, huddersfield, uk a b s t r a c t virtual organization (vo) and blockchain are two newly emerging technologies that researchers are exploring their potentials to solve many information and communication technology unaddressed problems and challenges. health care is one of the sectors that are very dynamic, and it is in need of constant improvement in the quest to better the quality of cares and reduce cost. one of the hotlines of research in the sector is the use of information and communication technology to provide health care, and this is where the concept of virtual health care is relevant. in virtual health care, patients and care providers are collaborating in virtual settings where two of the most difficult challenges are verifying and validating the identity of the communicating parties and the information exchanged. in this paper, we propose a conceptual framework using blockchain technology to address the health-care provider and record verification and validation issue. the framework is specific to health-care systems developed based on virtual breeding environment and vo. we outline and explain each step in the the framework and demonstrate its applicability in a simple health-care scenario. this paper contributes toward the continuing effort to address user identity and information verification and validation issues in virtual settings in general and in health care in specific. index terms: blockchain, conceptual framework, validation and verification, virtual health care, virtual organization corresponding author’s e-mail: hoger mahmud, department of computer science, college of science and technology, university of human development, kurdistan region, iraq, department of computing science, school of computing and engineering, university of huddersfield, huddersfield, uk received: 26-07-2018 accepted: 14-08-2018 published: 24-08-2018 access this article online doi: 10.21928/uhdjst.v2n2y2018.pp24-31 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2018 mahmud, et al. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) r e v i e w a r t i c l e uhd journal of science and technology hoger mahmud, et al.: a blockchain-based service provider validation and verification framework for health-care virtual organization uhd journal of science and technology | may 2018 | vol 2 | issue 2 25 is user verification and validation. to ensure the quality and integrity of health-care services provided through such virtual systems as well as preventing information falsification and identity thefts, user verification and validation are essential. validation is necessary to ensure that the right provider with the right attribute as specified by the requester is selected and verification is necessary to ensure that the information provided is correct. in this paper, we propose a framework that uses blockchain technology to verify and validate health-care providers in vbe-based health-care systems. in general, speaking blockchain records and stores transaction in a package called “block” and blocks are linked together in a distributed system. blockchain technology is gaining interest to be used in various fields due to its flexibility in modifying the basic concept to be applied in various forms. currently, well-known companies such as ibm, the tierion/philips partnership (netherlands), brontech (australia), gem (u.s.), and guardtime (europe) are applying and adapting the technology for their own particular needs [3]. for further clarification, we briefly introduce the blockchain concept in section 3.2. a recent study by deloitte has found that health-care providers are planning to use blockchain technologies in a wide scale as the technology gaining momentum both theoretically and practically. zyskind et al. [4] suggested that blockchain technology can be a solution to the user identity verification problem that current authentication systems have a password, and dual-factor verification and validation mechanisms have not been successfully. the framework can also be used for record verification and validation which falls within user verification and validation issue. statistics point to big health-care record keeping security issues, for example, in 2015 there were 112 million health-care record hacks [5]. medical records are sensitive, and any alteration to its content may result in serious consequences. to ensure record integrity, blockchain can act as a distributed database that is secure and safeguard medical records against tempering [4]. the proposed framework does not present the technical aspects of implementing blockchain technology nor does it specify the blockchain mechanism to be used. as a first step, we have outlined the main information flow steps and have identified the required parties that should be members in a chain to verify and validate a health-care provider in vbe-based virtual health-care systems. we have also demonstrated the applicability of the framework in a simple but non-trivial virtual health-care scenario. this paper contributes toward the use of blockchain technology in health care in general and vbe-based health-care systems in specific for user verification and validation. the rest of this paper is organized as follows: section 2 provides some related research. in section 3, we provide brief background information about vbe, vo, and blockchain as well as outlining and explaining the proposed framework. we demonstrate the use of the framework in section 4 and discuss the result in section 5. we finally conclude in section 6. 2. related work blockchain concept was first introduced in 2008 [6], and later in 2009, the concept was implemented in creating the first cryptocurrency (bitcoin) [7]. the technology is considered for use in health care and is already in use to provide a number of health-care services, for example, a system called “prescrypt” is developed by deloitte netherlands in partnership with sns bank and radboud3. the system enables patients to have full control over their data including allowing or revoking providers to access their data [8]. some companies use blockchain in health care, for example, gem (in collaboration with philips healthcare blockchain lab), pokitdok, healthcoin, hashed health, and many others [9]. other researchers have considered the use of blockchain technology for patient identification which allows a single person identification [10], and the use of blockchain in health care is considered by alhadhrami et al. [1] for sharing health records between all relevant health-care stakeholders safely. as for data verification and validation in health care, the technology is used in various implemented and proposed systems, for example, the technology is used in developing a decentralized patient record database where data can be shared among many different parties with no concern for the integrity of the data [8]-[11]. health bank (www.healthbank. coo) which is a swiss company is planning to use blockchain to give full control of data usage to users through the use of the blockchain technology for transaction verification and validation. in a virtual healthcare setting the reputation of a care provider in terms of academic achievements and practical experience is one of the key selection attribute to provide a particular care. this is because care providers with high reputation presumed to provide better quality of care however the challenge here is how to verify and validate a reputation claim made by a care provider. blockchain technology is proposed as a possible verification and validation technology for health-care provider reputations, for example, the authors of sharples and domingue [12] propose to use the technology in a system that can verify hoger mahmud, et al.: a blockchain-based service provider validation and verification framework for health-care virtual organization 26 uhd journal of science and technology | may 2018 | vol 2 | issue 2 and validate educational records of health-care providers. the authors of carboni [13] have developed a reputation model based on blockchain where customers can provide feedback after receiving a service from a provider and calculate the providers reputation based on feedbacks they receive. gem health network launched by gem a us startup uses blockchain technology to provide an infrastructure for health-care specialists to share information [14]. the technology is researched for fighting drug counterfeiting by hyperledger in collaboration with accenture, cisco, intel, ibm, block stream, and bloomberg [15]. the technology is also considered and used in other fields, for example, it has been used in financial services such as online payment [16] and has been considered in other services such as smart contracts [17] and public services [18]. the dutch ministry of agriculture is currently running a project called blockchain for agrifood that aims to explore the potential of blockchain technology in agriculture [19]. blockchain is also used by the social enterprise “provenance” (www.provenance.org) in the uk to record and validate certificates in agriculture supply chains. the technology is also used in music industry, for example, startups such as ujo or peertraks propose to use the technology to manage music rights [20]. our prosed use of the technology differs from all the above researches as we are the first (to the best of our knowledge) to suggest the use of blockchain technology for health-care provider verification and validation in vbe-based health-care systems. 3. background and framework definition in this section, we briefly introduce vbe, vo, and blockchain technology and we also define the proposed framework and explain its main steps. 3.1. vbe and vo internet and telecommunication technologies have paved the way for a new type of collaboration known as “virtual collaboration [21], [22]. the fact that virtual collaboration occurs between unknown participants has given rise to the challenge of collaboration management and regulation in a virtual world. to address the challenge, researchers have proposed vbe and vo [23], [24]. the concepts are researched for collaboration management and regulation in education, e-commerce, and teleworking [25], [26]. the framework proposed in this paper is specific to health-care services provided through systems which are developed based on vbe and vo. vo is a short-lived temporarily consortium where a number of parties collaborating and working together to provide a particular service. vo is described as “a loosely bound consortium of organizations that together address a specific demand that none of them can (at the given time) address alone and once the demand has been satisfied the vo might disband” [27]. vbe, on the other hand, is a permanent consortium of parties that provide the environment and support for vo creation and management. participants in both vbe and vo can be human or machines or both, but they all have to collaborate through communication technologies [28]. 3.2. blockchain blockchain concept was developed from bitcoin paper published by nakamoto in 2008. it is a peer-to-peer network where all participants (peers) serve as a node and all nodes hold the same information (hash value in this case). blockchain uses cr yptographic techniques to record transactions between peers in a peer-to-peer network and store the transaction in a digital ledger as a block. blocks are linked together for validation and verification purposes. each block is comprised of three main parts which are block headers, a hash value of the previous transaction, and merkle root as illustrated in fig. 1. each block contains a unique hash value that is the transaction recorded and distributed to all nodes in the chain after its creation and all have to agree before a change in the block can happen. the uniqueness of a hash value comes from the fact that any combination of data produces a unique hash value and this value changes if there is any alteration to the data; this mechanism ensures data validity. the use of cryptographic techniques in blockchain enhances the security of the data within a transaction which is an essential requirement of any health-care system. blockchain uses the public key cryptographic technique to encrypt transactions, and it is visible to all participants in a blockchain; however, to decrypt fig. 1. blocks linked in a chain. hoger mahmud, et al.: a blockchain-based service provider validation and verification framework for health-care virtual organization uhd journal of science and technology | may 2018 | vol 2 | issue 2 27 a transaction, a participant must have a private key which is not publically available [29]. in general, there are two types of blockchain which care for permissionless and permissioned blockchain. in a permissioned type of blockchain, a central authority controls all requests for change to transaction records or any other modification and the requester will have to go through access controls such as identity verification to access transactions [30]. on the other hand in permissionless blockchain, there is no central authority and requests can be made freely to change transaction records. examining which type of blockchain is most suited for health-care virtual collaboration is beyond the scope of this paper as we only outline a framework without going into technical details; however, we think that it is an interesting topic to research. fig. 2 illustrates the two types of blockchain. 3.3. framework definition here, we outline the proposed framework and provide more insights to each of the framework steps. our proposed framework is conceptual rather than structural, and the purpose is to provide vbe-based health-care system developers with a step-by-step guide as to how to verify and validate health-care service providers and records using blockchain technology. for the framework to work, the following requirements should be fulfilled: 1. a virtual health-care system must use vbe and vo concept as a base for collaboration and organization of care provision which means that there must be a virtual environment where patients can send requests to and the environment creates a vo for the service requested after all requirements are fulfilled. 2. service providers are recruited either within the vbe or from a global pool of virtual health-care providers after their credentials are verified and validated. 3. a blockchain is created between a number of vbes, academic institutes, and health institutes where credentials of care providers are shared in blocks between all participants. each blockchain participant has a job to do as follows: a. vbes: it provides information about health-care providers that have provided care within their environments for reputation verification and validation purposes. vbes can also take part in health record verification and validation through sharing their records in the created blockchain. b. academic institutes: it provides information about the qualification that health-care providers claim to possess for credential verification and validation. c. health institutes: these provide information about the practices and experiences of health-care providers in real life situation and verify and validate the level of expertise and experience that providers claim to possess. after all the above requirements are fulfilled, we suggest an eight-step framework which is illustrated in fig. 3 to verify and validate providers and records as follows: 1. a health-care service request is triggered: this step serves as the trigger for the whole validation and verification process. in this step, a patient sends a request to a vbe for a virtual health-care service; for example, a patient would like to consult a doctor about a pain that he/she has developed in the neck after a minor car accident. the request can also be for a change of record that is held by a particular vbe, for example, a patient would like to make changes to the address registered in his/her record. 2. a health-care service accepted: vbes cannot provide all types of care since health-care services are many and can change on a case bases; therefore, the vbe would have to check the details of the request to see if the requested service falls within their scope of work. if the request passes the check, then it is the job of the vbe to find the right health-care service provider after which a vo is created to provide the care. to do so, the vbe searches within its resources for the right service provider if not found the vbe would have to search the global pool for the right care provider. 3. after a provider is found, contacted, and accepted to offer the service, the vbe would have to verify and validate the credentials of the provider before final go-ahead for the service provision and vo creation. if the request is for changes in records held by the vbe, the credentials of the requester should be verified and validated before the change can be made. 4. vbe share the credentials: after step 3, the vbe would now have to share the record or the provider details with other participants using blockchain technology for verification and validation. 5. blockchain-based verification and validation: when the information is shared, now each node in the chain would compare the information provided with the record held in blocks within their system for verification fig. 2. types of blockchain. hoger mahmud, et al.: a blockchain-based service provider validation and verification framework for health-care virtual organization 28 uhd journal of science and technology | may 2018 | vol 2 | issue 2 and validation. the comparing process is done using consensus algorithms such as proof of work. in section 3.2 we have explained that blockchain is a peer-to-peer network and each peer in the network is a node that holds copies of transactions made in the network. when a new block is created, it is distributed to all nodes. the nodes will have the responsibility to validate the content of the block through comparing it with the block that is already held by the node .in blockchain an exact copy of a transaction (block) is held by all nodes in the chain, when a block is changed the request for change has to be broadcastand all nodes would have to approve the change to the block before a new block with the requested change is added to the network. in this case, if a service provider has provided false information, or if a record content is altered, it would be detected and rejected easily. this method of validation and verification is more robust than the insystem verification and validation since a record held in a system database can be hacked or altered, whereas in blockchain, it is impossible to alter data without all participant approval. 6. new block validation and creation: sometimes, a request is send by a vbe where its content is new, for example, a care provider qualification needs to be verified and validated. in this case, the request would have to be compared with the records held by an academic institution, and once verified and validated, a new block would be created and added to the network. the step six is there for two purposes, the first is that participants would work as a peer in the network to provide verification and validation for blocks already created, and the second is to create new blocks and add it to the network as requests for information verification and validation comes into the network. 7. request result: once the result of the request is complete, it is sent back to the vbe, if the result is positive, the vbe would take steps to create a vo for the service otherwise new service provider has to be found, and the steps 2–7 have to be repeated or the whole process is stopped and the requester is informed of the reason. 8. vo creation: a vo is a short-lived entity created fig. 3. the proposed framework. hoger mahmud, et al.: a blockchain-based service provider validation and verification framework for health-care virtual organization uhd journal of science and technology | may 2018 | vol 2 | issue 2 29 to provide a specific service, and once the goal is achieved, the vo is dismantled and the service ends. if the result of step 7 is positive, then a vo is created where both service requesters and service providers can communicate and collaborate. the process of vo creation mechanism is beyond the scope of this paper as we are currently researching on actively. 4. case study one of the requirements of a service requested is that the service has to be feasible virtually, i.e. the service requested has to be achievable through an online system. healthcare services are complex with some requiring face-to-face meetings between care requesters and providers, and others can be achieved in a virtual system. one of the most common virtual heath-care service requests is for consultation. this where a patient would like to receive guidance about a particular medical needs or addresses a concern he/she may have. to show simply and effectively the contribution of the framework in verifying and validating care providers and records, we consider the scenario below: mr. adam has recently been involved in a car accident and has developed a neck pain after the accident. despite visiting a hospital a couple of times, the pain is still present and he would like to consult with a bone specialist that was not available in his local hospital. a vbe called “virtual hospital system” is introduced to him by a friend, and now he would like to contact the vbe for a service. he fills in the virtual care request form for a consultation service with a bone specialist. he specifies in his request that the bone specialist should have a good reputation and minimum 5 years of care provision experience. the specialist should be an eu graduate and speak very good english. it is now the job of the vbe to find the right specialist for mr. adam. to ensure the right specialist if put into contact with mr. adam in a vo, the vbe uses the proposed framework and take the following steps: 1. the vbe searches though its database for a specialist that fulfills the requirements, but we assume that it fails to find one. the vbe then broadcasts the request and search for the right specialist in the global pool of care providers. 2. in the search process, the details of a specialist who lives in different countries than that of mr. adam match the requirements specified in the request form. the vbe contacts the specialist and offers to recruit him to provide the service and he accepts. in his profile, he claims that he is a uk-based university qualified with 7 years of experience in a german-based hospital. however, since the specialist is unknown to the vbe, the claims have to be verified and validated before the final go ahead. 3. the vbe create a block using the information provided by the specialist and broadcast it for verification and validation in the created blockchain. now using blockchain mechanisms, the claims can easily be verified and validated by comparing the information in the block with those held by the network participants. 4. the result is sent back to the vbe and if positive put both mr. adam and the specialist into contact by creating a vo for them, and otherwise, the vbe withdraws the recruitment offer made to the specialist and search for another one or terminate the process. the above scenario can simply demonstrate the applicability and the contribution of the framework in a clear manner; however, it must be said that the framework is conceptual and yet to be implemented for a real test which is something we still working on alongside other concepts to create the first vbe-based health-care system. the purpose of this paper is to share the principal concept and build on it in later works. 5. discussion ever since the alma-ata world leaders meeting that declared health care as a fundamental human right, many efforts and investments have been channeled through different healthcare systems around the world to ensure the delivery of this right. however, the main goal which was every human is entitle to receive quality care which is yet to be realized and this led to the world health organization to call for universal health-care coverage [31]. it is a known fact that most health-care systems are failing care receivers due to lack of stakeholder data safety, unacceptable quality of care, and limited care availability which all point to the need for change in health care. in the search for new ways to provide health care blockchain technology is seen by many researchers as a revolution with the potential to change the way heath care is provided currently [32], [33]. in this paper, we have outlined a framework that uses blockchain to address one of the most known issues in health care that can be provided virtually which is user verification and validation. despite the invention of many techniques such as username and password authentication to ensure the identity and validity of the claims that are made by virtual care providers and verify their suitability to provide a requested care, the issue remains at large. the proposed framework is developed to hoger mahmud, et al.: a blockchain-based service provider validation and verification framework for health-care virtual organization 30 uhd journal of science and technology | may 2018 | vol 2 | issue 2 contribute to the ongoing work to address the issue. the framework is simple and feasible as all technologies required to apply the framework are available. however, the framework is conceptual and required to be implemented and tested to show its full potential in contributing to the issue of data verification and validation in virtual health-care systems. the main contribution of this paper is the consideration of using blockchain in vbe-based health-care systems for service provider and records verification and validation as a concept, and the aim is to pave the way for further research and provide a basic validation and verification guide to system developers. as we have presented in this paper, blockchain technology is being considered for use in health care to address various issues in the field. however, despite the apparent theoretical applications of blockchain technologies in health care, the technology is yet to be applied fully due to its infancy and lack of technical implementation knowledge. one of the downsides of blockchain technology is the fact that operation costs are difficult to estimate as the computing power required to run it changes continuously as the number of hot nodes changes in the chain [5]. however, blockchain technology has the potential to be used in health-care areas such as medicine authenticity identification and patient record sharing. swan [8] identified a number of opportunities that blockchain technology can provide in health care such as: 1. removal of third party between health-care providers and receivers as well as various health-care providers 2. minimizing transaction costs as all transactions are transparent, direct, and happen real time 3. ensuring the data shared between healthcare stakeholders is the last updated version as changes to stakeholder records are made real-time and updates are distributed to all nodes in the chain. 4. creating one single and secure patient record access mechanism 6. conclusion health-care provision is changing as different techniques are proposed to make health care more available and accessible with better quality and less cost. one of the techniques that are becoming familiar is receiving care through online without face-to-face meetings which is known as e-health or virtual health care. the technique has a number of challenges which are yet to be addressed fully, and one of which is record and service provider verification and validation. in this paper, we have outlined an eight-step framework that uses blockchain to address the issue in vbe-based health-care systems. the framework is conceptual and yet to be implemented, but we have demonstrated its applicability through applying it to a simple scenario that results in verifying and validating a care provider. this paper contributes toward tackling the challenge of verifying and validating users and records in health care and considers the use of blockchain for the first time in vbe-based healthcare systems. we plan to research further the possibility of implementing and testing the framework to uncover its full potential for virtual health-care systems. references [1] z. alhadhrami, s. alghfeli, m. alghfeli, j. a. abedlla and k. shuaib. “introducing blockchains for healthcare.” electrical and computing technologies and applications (icecta), 2017 international conference, pp. 1-4, 2017. [2] h. mahmud and j. lu. “a generic vobe framework to manage home healthcare collaboration,” journal of theoretical and applied information technology, vol. 80, no. 2, p. 362, 2015. [3] c. stagnaro. “white paper: innovative blockchain uses in health care.” [4] g. zyskind, o. nathan and others. “decentralizing privacy: using blockchain to protect personal data.” security and privacy workshops (spw), 2015 ieee, pp. 180-184, 2015. [5] c. p. transaction. “blockchain: opportunities for health care.” cp transaction, 2016. [6] s. nakamoto, “bitcoin: a peer-to-peer electronic cash system.” cp transaction, 2008. [7] z. zheng, s. xie, h. dai, x. chen and h. wang. “an overview of blockchain technology: architecture, consensus, and future trends.,” big data (bigdata congress), 2017 ieee international congress on, pp. 557-564, 2017. [8] m. swan. “blockchain: blueprint for a new economy.” california: o’reilly media, inc, 2015. [9] “how blockchain can solve real problems in healthcare.” available: https://www.linkedin.com/pulse/how-blockchain-cansolve-real-problems-healthcare-tamara-stclaire. [jun. 5, 2018]. [10] l. a. linn and m. b. koo. “blockchain for health data and its potential use in health it and health care related research.” onc/ nist use of blockchain for healthcare and research workshop. gaithersburg, maryland, united states: onc/nist, 2016. [11] “the blockchain for healthcare: gem launches gem health network with philips blockchain lab.” available: https://www. bitcoinmagazine.com/articles/the-blockchain-for-heathcaregem-launches-gem-health-network-with-philips-blockchainlab-1461674938. [may 02, 2018]. [12] m. sharples and j. domingue. “the blockchain and kudos: a distributed system for educational record, reputation and reward.” european conference on technology enhanced learning, pp. 490-496, 2016. [13] d. carboni. “feedback based reputation on top of the bitcoin blockchain.,” arxiv preprint arxiv:1502.01504, 2015. [14] “gemhealth.” available: https://www.bitcoinmagazine.com/articles/ the-blockchain-for-heathcare-gem-launches-gem-health-networkwith-philips-blockchain-lab-1461674938/. [apr. 2, 2018]. [15] “applying blockchain technology to medicine traceabilit.” hoger mahmud, et al.: a blockchain-based service provider validation and verification framework for health-care virtual organization uhd journal of science and technology | may 2018 | vol 2 | issue 2 31 available: https://www.securingindustry.com/pharmaceuticals/ applying-blockchain-technology-to-medicine-traceability/s40/ a2766/#.w1htsnizbiu.[last accessed on 2018 mar 27]. [16] g. peters, e. panayi and a. chapelle. “trends in crypto-currencies and blockchain technologies: a monetary theory and regulation perspective.” journal of financial perspectives, pp. 38-69, 2015. [17] a. kosba, a. miller, e. shi, z. wen and c. papamanthou. “hawk: the blockchain model of cryptography and privacy-preserving smart contracts.” 2016 ieee symposium on security and privacy (sp), pp. 839-858, 2016. [18] b. w. akins, j. l. chapman and j. m. gordon. “a whole new world: income tax considerations of the bitcoin economy.” pittsburgh tax review, vol. 12, p. 25, 2014. [19] l. ge, c. brewster, j. spek, a. smeenk, j. top, f. van diepen, b. klaase, c. graumans and m. de r. de wildt. “blockchain for agriculture and food.” wageningen economic research, p. 112, 2017. [20] m. mettler. “blockchain technology in healthcare: the revolution starts here.” e-health networking, applications and services (healthcom), 2016 ieee 18th international conference on, pp. 1-3, 2016 [21] r. p. biuk-aghai and s. simoff. “patterns of virtual collaboration in online collaboration systems.” proceedings of the iasted international conference on knowledge sharing and collaborative engineering, st. thomas, usvi, pp. 22-24, nov, 2004. [22] l. wainfan and p. k. davis. “challenges in virtual collaboration: videoconferencing, audioconferencing, and computer-mediated communications”. rand corporation, 2004. [23] c. zirpins and w. emmerich. “virtual organisation by service virtualisation: conceptual model and e-science application.” research notes rn/07/07, university college london, dept. of computer science, 2007. [24] e. ermilova and h. afsarmanesh. “modeling and management of profiles and competencies in vbes.” journal of intelligent manufacturing, vol. 18, no. 5, pp. 561-586, 2007. [25] p. r. messinger, e. stroulia and k. lyons. “a typology of virtual worlds: historical overview and future directions.” journal for virtual worlds research, vol. 1, no. 1, 2008. [26] j. m. balkin and b. s. noveck. “state of play: law, games, and virtual worlds: law, games, and virtual worlds (ex machina: law, technology, and society)”. new york: nyu press, 2006. [27] s. reiff-marganiec and n. j. rajper. “modelling virtual organisations: structure and reconfigurations.” adaptation and value creating collaborative networks, pp. 297-305, 2014. [28] h. afsarmanesh and l. m. camarinha-matos. “a framework for management of virtual organization breeding environments.” collaborative networks and their breeding environments, pp. 3548, 2005. [29] a. salomaa. “public-key cryptography”. new york: springer science and business media, 2013. [30] a. collomb and k. sok. “blockchain/distributed ledger technology (dlt): what impact on the financial sector?” communications and strategies, no. 103, p. 93, 2016. [31] b. m. till, a. w. peters, s. afshar and j. g. meara. “from blockchain technology to global health equity: can cryptocurrencies finance universal health coverage?” bmj global health, vol. 2, no. 4, p. e000570, 2017. [32] i. c. ellul. “blockchain and healthcare: will there be offspring?” sweden: palestinian, 2017. [33] d. randall, p. goel and r. abujamra, “blockchain applications and use cases in health information technology.” journal of health and medical informatics, vol. 8, no. 3, 2017. . uhd journal of science and technology | jan 2019 | vol 3 | issue 1 39 1. introduction groundwater is a valuable freshwater resource and constitutes about two-third of the fresh water reserves of the world [1]. buchanan (1983) estimated that the groundwater volume is 2000 times higher than the volume of waters in all the world’s rivers and 30 times more than the volume contained in all the fresh water of the world lakes. the almost is 5.0 l × 1024 l in the world of groundwater reservoir [2]. groundwater is used in many fields for industrial, domestic, and agricultural purposes. however, due to the population growth and economic development, the groundwater environment is becoming more and more important and extensive [3], and the heavy groundwater extraction has caused many problems such as groundwater level drop, saltwater intrusion, and ground surface depression, which need to be improved. therefore, the identification, assessment, and remediation using regression kriging to analyze groundwater according to depth and capacity of wells aras jalal mhamad1,2 1department of statistic and informatics, college of administration and economics, sulaimani university, sulaimani city, kurdistan region – iraq, 2department of accounting, college of administration and economics, human development university, sulaimani city, kurdistan region – iraq a b s t r a c t groundwater is valuable because it is needed as fresh water for agricultural, domestic, ecological, and industrial purposes. however, due to population growth and economic development, the groundwater environment is becoming more and more important and extensive. the study contributes to current knowledge on the groundwater wells prediction by statistical analysis under-researched. such as, it seems that the preponderance of empirical research does not use map prediction with groundwater wells in the relevant literature, especially in our region. instead, such studies focus on several simple statistical analysis such as statistical modeling package. accordingly, the researcher tried to use the modern mechanism such as regression kriging (rk), which is predicted the groundwater wells through maps of sulaimani governorate. hence, the objective of the study is to analyze and predicting groundwater for the year 2018 based on the depth and capacity of wells using the modern style of analyzing and predicting, which is rk method. rk is a geostatistical approach that exploits both the spatial variation in the sampled variable itself and environmental information collected from covariate maps for the target predictor. it is possible to predict groundwater quality maps for areas at sulaimani governorate in kurdistan regions iraq. sample data concerning the depth and capacity of groundwater wells were collected on groundwater directorate in sulaimani city. the most important result of the study in the rk was the depth and capacity prediction map. the samples from the high depth of wells are located in the south of sulaimani governorate, while the north and middle areas of sulaimani governorate have got low depths of wells. although the samples from the high capacity are located in the south of sulaimani governorate, in the north and middle the capacity of wells have decreased. the classes (230–482 m) of depth are the more area, while the classes (29–158 g/s) of capacity are the almost area in the study. index terms: groundwater analysis, interpolation, regression kriging corresponding author’s e-mail: aras jalal mhamad, department of statistic and informatics, college of administration and economics, sulaimani university, sulaimani city, kurdistan region – iraq, department of accounting, college of administration and economics, human development university, sulaimani city, kurdistan region – iraq. e-mail: aras.mhamad@univsul.edu.iq received: 20-04-2019 accepted: 22-05-2019 published: 29-05-2019 access this article online doi: 10.21928/uhdjst.v3n1y2019.pp39-47 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2019 mhamad. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l re se a rc h a rt i c l e uhd journal of science and technology aras jalal mhamad: using r.k. to analyze groundwater 40 uhd journal of science and technology | jan 2019 | vol 3 | issue 1 of groundwater problems have become quite a crucial and useful topic in the current time. for the above reasons, the analysis of groundwater requires implementing scientific and academic methods, from which one of the verified models is the rk that is used for this purpose [2]. regressionkriging (rk) is a spatial prediction technique that combines regression of the dependent variable on auxiliary variables with kriging of the regression residuals. it is mathematically a consideration of interpolation method variously called universal kriging and kriging with external drift, where auxiliary predictors are used to solve the kriging weights directly [4]. rk is an application of the best linear unbiased predictor for spatial data, which is the best linear interpolator assuming the universal model of spatial variation [5]. rk is used in many fields, such as soil mapping, geological mapping, climatology, meteorology, species distribution modeling, and some other similar fields [6]. regression kriging (rk) is one of the most widely used methods, which uses hybrid techniques and combines ordinary kriging with regression using ancillary information. since the correlation between primary and secondary variables is significant [7], so, the aim of this study is to analyze and predicting groundwater depending on depth and capacity of wells in sulaimani governorate; using rk. 1.1. objective of the study the main objective of this research is to analyze and predict groundwater wells at the un-sampled locations in sulaimani governorate according to depth and capacity of existing groundwater wells using rk and to assess the accuracy of these predictions. 2. materials and methods 2.1. interpolation spatial interpolation deals with predicting values of the locations that have got unknown values. measured values can be used to interpolate, or predict the values at locations which were not sampled. in general, there are two accepted approaches to spatial interpolation. the first method uses deterministic techniques in which only information from the observation point is used. examples of direct interpolation techniques are such as inverse distance weighting or trend surface estimation. the other method depends on regression of addition information, or covariates, gathered about the target variable (such as regression analysis combined with kriging). these are geostatistical interpolation techniques, better suited to count for spatial variation, and capable of quantifying the interpolation errors. hengl et al. (2007) advocate the combination of these two into so-called hybrid interpolation. this is known as rk [8]. in another paper, hengl et al. (2004) explain a structure for rk, which forms the basis for the research in this study [7]. limitation of rk is the greater complexity than other more straightforward techniques like ordinary kriging, which in some cases might lead to worse results [9]. 2.2. rk the most basic form of kriging is called ordinary kriging. when we add the relationship between the target and covariate variables at the sampled locations and apply this to predicting values using kriging at unsampled locations, we get rk. in this way, the spatial process is decomposed into a mean and residual process. thus, the first step of rk analysis is to build a regression model using the explanatory grid maps [8]. the kriging residuals are found using the residuals of the regression model as input for the kriging process. adding up the mean and residual components finally results in the rk prediction [8]. rk is a combination of the traditional multiple linear regression (mlr) and kriging, which means that an unvisited location s 0 is estimated by summing the predicted drift and residuals. this procedure has been found preferable for solving the linear model coefficients [10] and has been applied in several studies. the residuals generated from mlr were kriged and then added to the predicted drift, obtaining the rk prediction. the models are expressed as: ( ) ( ). 0 0 0 ˆ •ˆ p ml r k k k z s x s = = ∑ (1) ( ) ( ) ( ) ( ) ( )0 . 0 0 0 0 1 • ;ˆ ˆ 1 n rk ml r i i i z s z s w s e s x s = = + =∑ (2) when .ˆ ml rz (s0) is the predicted value of the target variable z at location s 0 using mlr model, ˆrkz (s0) is the predicted value of the target variable at location s 0 using rk model, ˆ k is the regression coefficiency for the kth explanatory variable xk, p is the total number of explanatory variables, wi (s0) are weights determined by the covariance function and e (si) are the regression residuals. in a simple form, this can be written as: ( ) ( ) ( )z s m s s= + ′ (3) when z(s) is the value of a phenomenon at location s, m(s) is the mean component at s, and ε′ (s) stands for the residual component including the spatial noise. the mean component is also known as the regression component. aras jalal mhamad: using r.k. to analyze groundwater uhd journal of science and technology | jan 2019 | vol 3 | issue 1 41 the process of refining the prediction in two steps (trend estimation and kriging) is shown in fig. 1, where the result of the mean component, only regression sm( )ˆ , is visible as a dashed line , and the sum of trend + kriging is the curving thick line ( )ˆ sz . this should approach the actual distribution better than either just a trend surface or a simple interpolation. the linear modeling of the relationship between the dependent and explanatory variables is quite empirical. the model selection determines which covariable is important, and which one is not. it is not necessary to know all these relations, as long as there is a significant correlation. once the covariates have been selected, their explanatory strength is determined using (stepwise) mlr analysis. for each covariate this, leads to a coefficient value, describing its predictive strength, and whether this is a positive or negative relationship. with the combination of values for all covariate maps, a trend surface is constructed. this regression prediction is, in fact, the calculation for each target cell from each input cell from all covariates times the coefficient value. the amount of correlation is expressed by r2 in the regression equation. to enable this, the covariate data first need to be processed by overlaying the sample locations with the covariate data layers. in this way, a matrix of covariate values for each sample point is constructed. this matrix may still hold several “na” or missing values due to the fact that some maps do not have coverage, while some others do. an example of this is the absence of information on the organic matter in urban areas. since the linear models cannot be constructed properly when some covariate data are missing, these sample points are discarded altogether. the resulting data matrix is therefore complete for all remaining measurement data points. the second step in which the covariate data are needed is the model prediction phase of the mean surface values. first, a prediction mask is made, which is the selection of grid cells for which covariate data are available and only contains the coordinates of valid cells. next, the regression mean values are calculated by predicting the regression model for every grid cell in the prediction mask. in the residual kriging phase, this prediction grid is used again as a mask for the kriging prediction [7]. 2.3. variogram and semivariogram semivariogram analysis is used for the descriptive analysis. the spatial structure of the data is investigated using semivariogram. this structure is also used for predictive applications, in which the semivariogram is used to fit a theoretical model, parameterized, and also used to predict a regionalized variable at other unmeasured points. estimating the mean function x(s)tβ and the covariance structure of ε(s) for each s in the area of interest is the first step in both the analysis of the spatial variation and the prediction. semivariogram is commonly used as a measure of spatial dependency. the estimated semivariogram explains a description of how the data is correlated with the distance. the factor 1/2 that ϒ(h) indicates is a semivariogram, and 2ϒ(h) is the variogram. thus, the semivariogram function measures half the average squared difference between pairs of data values separated by a given distance, h, which is known as the lag [11], [5]. the experimental variogram is a plot of the semivariance against the distance between sampling points. the variogram is the fitted line that best describes the function connecting the dots from the experimental variogram [12]. assuming that the process is stationary, the semivariogram is defined in equation (4): ( ) ( ) ( ) 2 ( ) 1 [ ] 2 i jh n h h z s z s n  = −∑ (4) here, n(h) is the set of all pairwise euclidean distances i–j = h, nh is the number of distinct pairs in n(h). z(si) and z(sj) are the values at spatial location i and j, respectively, and ϒ(h) is the estimated semivariogram value at distance h. the semivariogram has three important parameters: the nugget, sill, and range. the nugget is a scale of sub-grid variation or measurement error, and it is indicated by the intercept graphically. the sill is the semivariant value as the lag (h) goes to infinity, and it is equal to the total variance of the data set. the range is a scalar which controls the degree of correlation between data points (i.e., the distance at which the semivariogram reaches its sill). as shown in fig. 2, it is then necessary to select a type of theoretical semivariogram model based on that estimate. fig. 1. a schematic representation of regression kriging using a crosssection [8]. aras jalal mhamad: using r.k. to analyze groundwater 42 uhd journal of science and technology | jan 2019 | vol 3 | issue 1 commonly used theoretical semivariogram shapes increase monotonically as a function of distance, by comparing the plot of empirical semivariogram with various theoretical models that can choose the semivariogram model. three are some parametric semivariogram models for testing, such as: exponential, gaussian, and spherical. these models are given by the following equations: exponential: ( ) exp0 1 2 3 1 , h h        = + − −      (5) gaussian: ( ) exp 2 0 1 2 1 3 , h h         = + − −         and (6) spherical: ( ) 3 0 1 22 2 0 1 2 3 1 1 e x p , 02 2 , h h h h h               +     = ≤ ≤        + >   (7) when h is a spatial lag, θ 0 is the nugget, θ 1 is the spatial variance (also referred to as the sill), and θ 2 is the spatial range. the nug get, sill, and range parameters of the theoretical semivariogram model can fit the empirical semivariogram ϒ(h) by minimizing the nonlinear function. when fitting a semivariogram model, if we consider the empirical semivariogram values and try to fit a model to them as a function of the lag distance h, the ordinary least squares’ function is as given by ( ) 2( :ˆ ) h h h   −  ∑ , where ϒ(h: θ) denotes the theoretical semivariogram model and θ = (θ 0 , θ 1 , θ 2 ) is a vector of parameters. rk computes the parameters θ and β separately. the parameters β in the mean function are estimated by the least squares method. then, it computes the residuals, and their parameters in the semivariogram are estimated by various estimation methods, such as least squares or a likelihood function. prediction of rk at a new location s 0 can be performed separately using a regression model to predict the mean function, a kriging model of prediction residuals and then adding them back together as in equation (8): ( ) ( ) ( ) ( )0 0 0 0 0 n n k k i i k i z s x s s s   = = = +∑ ∑ (8) here, si = (xi, yi) is the known location of the ith sample, xi and yi are the coordinates, βk is the estimated regression model coefficient, λi represents the weight applied to the ith sample (determined by the variogram analysis), ε(si) represents the regression residuals, and x 1 (s 0 )… xn(s0) are the values of the explanatory variables at a new location s 0 . the weight λi is chosen such that the prediction error variance is minimized, yielding weights that depend on the semivariogram [13]. more details about the kriging weight λi follow immediately [14]. the main objective is to predict z(s) at a location known as s 0 , given the observations {z(s 1 ), z(s 2 ),…, z(s 3 )}′. for simplicity we assume e{z(s)} = 0 for alls. we briefly outline the derivation of the widely used kriging predictor. let the predictor be in the form of ( ) '0 ( )ẑ s z s= , where λ = {λ 1 , λ 2 ,…, λ n }′. the objective is to find weights λ, which is a minimum. ( ) [ ]20 0 ( ) ( )q s e z s z s= −′ (9) by minimizing q(s 0 ) with respect to λ, it can be shown that; ( ) ( ) ( )1'0 0 ,ẑ s s s z s − = ∑ (10) when σ′(s 0 , s) = e(z(s 0 ) z(s)), and ∑= e[z(s) z (s)] are the covariance matrix. the minimum of q(s 0 ) is min ( ) 12 '0 0( , ) ( )q s s s z s  − = − ∑ . note that, q(s0) can be rewritten in terms of the variogram by applying; ( ) ( )20 0 1 , 1 , 2 s s s s = − γ (11) when γ(s 0 , s) is the corresponding matrix of variograms. we can thus rewrite q(s 0 ) given in equation (9) as; fig. 2. illustration of semivariogram parameters. aras jalal mhamad: using r.k. to analyze groundwater uhd journal of science and technology | jan 2019 | vol 3 | issue 1 43 ( ) ( )0 0 1 , 2 q s s s  ′ ′= − γ + γ (12) q(s 0 ) is now minimized with respect to λ, subject to the constraint λ′ 1 = 1 (accounting for the unbiasedness of the predictor ( )0ẑ s ) [11]. 2.4. advantages of rk geostatistical techniques such as multiple regression, inverse distance weight, simple kriging, and ordinary kriging uses either the concept of regression analysis with auxiliary variables or kriging for prediction of target variable, whereas rk is a mixed interpolation technique; it uses both the concepts of regression analysis with auxiliary variables and kriging (variogram analysis of the residuals) in the prediction of target variable. it considers both the situations, i.e., long-term variation (trend) as well as local variations. this property of rk makes it superior (more accurate prediction) over the above-mentioned techniques [15]. among the hybrid interpolation techniques, rk has an advantage that there is no danger of instability as in the kriging with the external drift [9]. moreover, the rk procedure explicitly separates the estimated trend from the residuals and easily combined with the general additive modeling and regression trees [16,17]. 2.5. cross-validation of rk results to assess which spatial prediction method provides the most accurate interpolation method, cross-validation is used to compare the estimated values with their true values. cross-validation is accomplished by removing each data point and then using the remaining measurements to estimate the data value. this procedure is repeated for all observations in the dataset. the true values are subtracted from the estimated values. the residuals resulting from this procedure are then evaluated to assess the performance of the methods. one particular method is called k-fold crossvalidation, where “k” stands for the number of folds one wants to apply. each fold is a set of data kept apart from the analysis, repeated for the number of folds. a special type of k-fold cross validation is where the repetition of analyses (k) is equal to the number of data. this is called “leave one out” cross-validation, for the analysis is repeated once for every sample in the dataset, omitting the sample value itself. resulting is a prediction for every observation, made using the same variogram model settings as for the normal rk prediction. the degree in which the cross-validation predictions resemble the observations is then a measure for the goodness of the prediction method. this can be calculated using the mean squared normalized error or “z score” [18]. to aid further in the assessment of prediction results, additional parameters can be calculated from the cross-validation output, such as the mean prediction error (mpe), root mean square prediction error (rmspe), and average kriging standard error (akse). ´ ( ) ( ) 1 1 n x x x mpe z z n =  = −  ∑ (13) 2 ' ( ) ( ) 1 1 n x x x rmspe z z n =   = −     ∑ (14) when n stands for the number of pairs of observed and predicted values, z(x) is the observed value at location x, and z’(x) is the predicted value by ordinary kriging at location z. ( ) 2 1 1 n x akse x n  = =   ∑ (15) here, x is a location, and σ(x) is the prediction standard error for location x. mpe indicates whether a prediction is biased and should be close to zero. rmspe and akse are measures of precision and have to be more or less equal. the cv-procedure only accounts for the kriging part, since the input is the residuals from the linear modeling phase [4]. 3. data analysis and results 3.1. data description data were obtained from a (groundwater directorate/well license department) in sulaimani, kurdistan-region. 451 observations (wells) were used in the study, only records containing valid x, y – locations are used in the statistical modeling process. one check is to print all measurement locations to check whether they are located within the defined regions. if not, they are removed. the rk method is suitable for predicting the groundwater wells, due to nature of data (there are coordinates for each wells). for kriging purposes, duplicate x,y-locations need to be checked, to prevent singularity issues, as shown by yang et al. [4]. duplicated locations share the same coordinates (based on one decimal digit), making it impossible to apply interpolation. therefore, the choice is made to delete each second record that has duplicated coordinates. the research area is limited to the sulaimani governorate of the kurdistan region, only depth and capacity of wells are available at the individual point aras jalal mhamad: using r.k. to analyze groundwater 44 uhd journal of science and technology | jan 2019 | vol 3 | issue 1 locations. therefore, this research is targeted at depth and capacity of wells. the data were presented in fig. 3. the dataset used for the analysis contains six variables and deals with properties of well for the year 2018, which are (depth, capacity, state well level, dynamic well level, latitude-x, and longitude-y). 3.2. experimental variogram once the regression prediction has been performed, the variogram for the resulting residuals from the sample data can be modeled, as shown in fig. 4. the model of depth with partial sill c 0 = 5328, nugget = 371.95, and range = 0.078 was used for the residual variogram. the result has indicated that with an increase in distance, the semivariance value increases. semivariograms are used to fit the residuals of the recharge estimates to enable the residuals then to be spatially interpolated by kriging. fig. 4 shows simple kriging of the modeled residuals using the same locations from the first prediction surface; the kriging is provided over the surface to obtain the results, which are not interpolated over geological boundaries, which are not necessary to have any spatial correlation with the residuals. these semivariograms explain the nugget that is high in each group. when the nugget value is high, it indicates low spatial correlation in the residuals and has an effect that interpolation is not trying to match each point value of the residuals. although the range shows the extent of the spatial correlation of recharge residuals, it ensures that the residuals’ spatial surface is only using the local information. the model for capacity of wells has partial sill c 0 = 3805.4, nugget = 11222.3, and range = 0.429. the result has indicated that with an increase in distance, the semivariance value increases. fig. 5 shows simple kriging of the modeled residuals using the same locations from the first prediction surface for the capacity of wells. from the semivariograms, the nugget also is high in each group, which indicates low spatial correlation in the residuals. the range shows the extent of the spatial correlation of recharge residuals and ensures that the residuals’ spatial surface is only using the local information. from fig. 6, the variance has some artifacts. it can be expected that values close to the locations, where point samples were taken, have lower variances. however, the blue colored regions in the depth of wells appear very strange here, especially when other sparsely sampled fig. 3. sample distribution. fig. 4. variogram for the depth of the wells. fig. 5. variogram for the capacity of wells. aras jalal mhamad: using r.k. to analyze groundwater uhd journal of science and technology | jan 2019 | vol 3 | issue 1 45 regions do not have this blue color, but yellow and orange colors, indicating a lower variance value. the kriging variance is produced together with the kriging operation and is shown in the left part of fig. 6. although in the prediction map, the blue areas correspond somewhat to higher predictions for depth wells (red, 432–482 m fig. 6 left), this is not reversely so for the other regions. there are some points located at the blue area, each having a high depth of wells (ranged between 432 and 482 m). in the blue area, this phenomenon is enlarged, showing the scale at which the variance is increasing in the depth of wells, just around a cluster of sample points at the blue area. it is observed from the predicted depth of wells that the values are higher in the kalar, kifri, and khanaqin (lower portion of the study area), followed by sulaimani governorate, while low values are found in the upper of sulaimani governorate (upper portion of the study area). this fact can be seen from the rk variance (fig. 6 left). higher variance values (482) for the depth of wells are found in the plain areas whereas the mountainous areas have relatively lower values (29.7). fig. 7 explains the variance. it can be expected that values have lower variances, close to the locations, where point samples were taken. although the blue colored regions in the capacity of wells’ appear have very high variance value, yellow and orange colors are indicating a lower variance value. the kriging variance is produced together with the kriging operation and is shown in the left part of fig. 7. although in the prediction map, the blue areas correspond with higher predictions for capacity wells (372–415 g/m fig. 7 right). there are some points located at the blue area, each having high capacity of wells (range 372–415 g/m). in the blue area, this phenomenon is enlarged, showing the scale at which the variance is increasing in the capacity of wells. it is observed from the predicted capacity of wells that the values are higher in kalar, kifri, and khanaqin, followed by sulaimani governorate, while low values are found in the upper of sulaimani governorate. this fact can be seen from the rk variance (fig. 7 left). higher variance values (415) are found in the plain areas, whereas the mountainous areas have relatively lower values (29.8). 3.3. cross-validation of kriging cross-validation was used to obtain the goodness of fit for the model. in addition, for each cross-validation result, the mpe, or mean prediction error, was calculated. the mpevalue should be close to zero. rmspe (root mean square fig. 6. regression kriging results (left) and variance (right) of the residuals from the depth of wells’ model. aras jalal mhamad: using r.k. to analyze groundwater 46 uhd journal of science and technology | jan 2019 | vol 3 | issue 1 prediction error) and akse are given as well. the latter two error values should be close to each other, indicating prediction stability. the validation points were collected from all data in the study area so as to have an unbiased estimated accuracy. in this study, mpe, rmspe, and akse are the three statistical parameters used for validation. the smaller the rmspe, means the closer the predicted values to the observed values. similarly, the mpe gives the mean of residuals and the unbiased prediction gives a value of zero. the results of the validation analysis are summarized in table 1. the mpe is quite low in both depth and capacity of wells and is a low bias value of 0.019 and 0.021, respectively. the value of mpe is a result of a slight over-estimation of predicted depth and capacity of wells in the model. the rmspe value is only 0.844 and 1.31, indicating the closeness of predicted value with the observed value. the results indicate the utility of rk in spatially predicting depth and capacity of wells even in the varying landscape. 4. results and conclusions the results of the study show that the cross-validation measurement of the models was achieved. looking at the quantitative results from the cross-validation, there are no obvious indications that the kriging model prediction is worse in the models of depth and capacity of wells. one important result of the study is the region model predictions in the dataset with sample values. the samples from the high depth of wells are almost absent in north and middle of sulaimani governorate, while in the south they are present, although the capacity of wells gave the same result depth of the wells. the samples from the high capacity of wells are almost absent in the north and middle of sulaimani governorate, while in the south they are present. in the map results after the kriging in figs. 4 and 5, the areas within class (230–482m) of depth are almost, this result was close to the master thesis from iraq – sulaimani university by renas abubaker table 1: cross‑validation results measurements depth of wells capacity of wells mpe 0.019 0.021 rmspe 0.844 0.720 akse 0.661 1.316 mpe: mean prediction error, rmspe: root mean square prediction error, akse: average kriging standard error fig. 7. regression kriging results (left) and variance (right) of the residuals from the capacity of wells’ model. aras jalal mhamad: using r.k. to analyze groundwater uhd journal of science and technology | jan 2019 | vol 3 | issue 1 47 ahmed, 2014, in which resulted that the depth of wells was between 20 m and more the 170 m for the same areas, which was used multivariate adaptive regression spline model to predicting groundwater wells [19], while the areas within class (29–158 g/s) of capacity are almost in the study, also it was close to miss. rena’s results, which is reported that the capacity of wells between 10 and 140 gallon [19]. references 1. chilton, j. “women and water”. waterlines journal, vol. 2, no. 110, pp. 2-4, 1992. 2. buchanan. “ground water quality and quantity assessment”. journal ground water, vol. 7, no. 7, pp. 193-200, 1983. 3. han, z. s. “groundwater resources protection and aquifer recovery in china”. environmental geology, vol. 44, pp. 106-111, 2003. 4. yang, s. h., f. liu, x. d. song, y. y. lu, d. c. li, y. g. zhao and g. l. zhang. “mapping topsoil electrical conductivity by a mixed geographically weighted regression kriging: a case study in the heihe river basin, northwest china”. ecological indicators, vol. 102, pp. 252-264, 2019. 5. georges, m. “part 1 of cahiers du centre de morphologie mathématique de fontainebleau”. le krigeage universel, école nationale supérieure des mines de paris, 1969. 6. tomislav, h., b. branislav, b. dragan, r. i. hannes. “geostatistical modeling of topography using auxiliary maps”. computers and geosciences, vol. 34, no. (12), pp. 1886-1899, 2008. 7. ye, h., w. huang, s. huang, y. huang, s. zhang, y. dong and p. chen. “effects of different sampling densities on geographically weighted regression kriging for predicting soil organic carbon”. spatial statistics, vol. 20, pp. 76-91, 2017. 8. hengl, t., g. b. m. heuvelink and d. g. rossiter. “about regression-kriging: from equations to case studies”. computers and geosciences, vol. 33, no. (10), pp. 1301-1315, 2007. 9. goovaerts, p. “geostatistics for natural resource evaluation”. oxford university press, new york, 1997. 10. lark, r.m and b. r. cullis. “model-based analysis using reml for inference from systematically sampled data on soil”. the european journal of soil science, vol. 55, pp. 799-813, 2004. 11. seheon, k., p. dongjoo, h. tae-young, k. hyunseung and h. dahee. “estimating vehicle miles traveled (vmt) in urban areas using regression kriging”. journal of advanced transportation, vol. 50, pp. 769-785, 2016. 12. webster, r and m. a. oliver. “geostatistics for environmental scientists”. 2nd ed. wiley, chichester, 2007. 13. keskin, h and s. grunwald. “regression kriging as a workhorse in the digital soil mapper’s toolbox”. geoderma, vol. 326, pp. 22-41, 2018. 14. cressie, n. “statistics for spatial data”. john wiley and sons, hoboken, nj, 1993. 15. lloyd, c.d. “assessing the effect of integrating elevation data into the estimation of monthly precipitation in great britain”. journal of hydrology, vol. 308, no. 1-4, pp. 128-150, 2005. 16. huang, c. l., h. w. wang and j. l. hou. “estimating spatial distribution of daily snow depth with kriging methods: combination of modis snow cover area data and groundbased observations”. the cryosphere discussion paper. vol. 9, pp. 4997-5020, 2015. 17. mcbratney, a., i. odeh, t. bishop, m. dunbar and t. shatar. “an overview of pedometric techniques of use in soil survey”. geoderma, vol. 97, pp. 293-327, 2000. 18. bivand, r. s., pebesma, e. j. and gómez-rubio, v. “applied spatial data analysis with r”. springer, new york, 2008. 19. ahmed, r. a. “multivariate adaptive regression spline model for predicting new wells groundwater in sulaimani governorate”. master thesis of statistic department, college of administration and economic. university of sulaimani, kurdistan region, iraq, 2014. tx_1~abs:at/tx_2:abs~at uhd journal of science and technology | july 2022 | vol 6 | issue 2 85 1. introduction changes in human lifestyle and the deterioration of the environment have left a negative impact on human health. for that reason, human health has always been the subject of research with the aim to improve it. diabetes is a group of metabolic diseases which result in high blood sugar levels for a prolonged period. as stated by international diabetes federation, 537 million adults (20–79 years) are living with diabetes which is 1 in 10 of adult population. this number is predicted to rise to 643 million by 2030 and 783 million by 2045 [1]. diabetes has been the subject of research for some times by multidisciplinary scientists with the aim to find and improve methods that lead to effective prevention, diagnosis, and treatment of the disease. for instance, in a similar approach, in 2013, anouncia et al. proposed a diagnosis system for diabetes. the system is implemented to diagnose the type of diabetes based on symptoms provided by patients. they have used rough set-based knowledge representation in developing their system and the results showed improvements in terms of accuracy of diabetes type diagnosis and the time it takes for the diagnosis [2]. despite all the efforts invested into researching diagnostic techniques for diabetes, research rough set-based feature selection for predicting diabetes using logistic regression with stochastic gradient decent algorithm kanaan m. kaka-khan1, hoger mahmud2, aras ahmed ali3 1department of information technology, university of human development, iraq, 2department of information technology, the american university of iraq, sulaimani, 3university college of goizha, sulaymaniyah a b s t r a c t disease prediction and decision-making plays an important role in medical diagnosis. research has shown that cost of disease prediction and diagnosis can be reduced by applying interdisciplinary approaches. machine learning and data mining techniques in computer science are proven to have high potentials by interdisciplinary researchers in the field of disease prediction and diagnosis. in this research, a new approach is proposed to predict diabetes in patients. the approach utilizes stochastic gradient descent which is a machine learning technique to perform logistic regression on a dataset. the dataset is populated with eight original variables (features) collected from patients before being diagnosed with diabetes. the features are used as input values in the proposed approach to predict diabetes in the patients. to examine the effect of having the right variable in the process of making predictions, five variables are selected from the dataset based on rough set theory (rst). the proposed approach is applied again but this time on the selected features to predict diabetes in the patients. the results obtained from both applications have been documented and compared as part of the approach evaluations. the results show that the proposed approach improves the accuracy of predicting diabetes when rst is used to select variables for making the prediction. this paper contributes toward the ongoing efforts to find innovative ways to improve the prediction of diabetes in patients. index terms: logistic regression, stochastic gradient descent, rough set theory, k-fold cross-validation, diabetes prediction corresponding author’s e-mail: kanaan m. kaka-khan, department of information technology, university of human development, iraq. e-mail: kanaan.mikael@uhd.edu.iq received: 21-08-2022 accepted: 02-10-2022 published: 18-10-2022 access this article online doi: 10.21928/uhdjst.v6n2y2022.pp85-93 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2022 kanaan m. kaka-khan, et al. this is an open access article distributed under the creative commons attribution noncommercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l re se a rc h a rt i c l e uhd journal of science and technology kanaan m. kaka-khan, et al.: rough set-based feature selection for predicting diabetes 86 uhd journal of science and technology | july 2022 | vol 6 | issue 2 shows that there is still room for improvement, especially in areas related to the level of accurately in predicting the disease in a patient. rough set theory (rst) has been used by researchers to predict a wide array of topics such as time series prediction [3], crop prediction [4], currency crisis prediction [5], and stock market trends prediction [6]. in this research, we use rst to select variables in a dataset with the aim to improve the level of accuracy in predicting diabetes in a patient. stochastic gradient descent algorithm is used to process the variables selected to make diabetes prediction based on computed logistic regression values from the dataset. the dataset used for all experiments in this study is made available by the pima indian diabetes [7]. this paper contributes toward the ongoing efforts to find innovative ways to improve the prediction of diabetes in patients by proposing a new approach to predict diabetes in patients using machine learning techniques. the results presented in sections 5.1 and 5.2 show that the approach improves accuracy in making diabetes predictions compared to other available approaches. the rest of this paper is organized as follows: section 2 provides the theoretical background needed to understand the selected techniques and section 3 provides a survey of related literatures. section 4 provides the description of the methodology used in this study. experimental results and discussion are provided in section 5. finally, conclusions are drawn in section 6. 2. background this section provides a basic background on the theories used in the study. 2.1. rst rough set [8] is proposed by pawlak to deal with uncertainty and incompleteness. it offers mathematical tools to discover patterns hidden in datasets and identifies partial or total dependencies in a dataset based on indiscernibility relation. the technique calculates a selection of features to determine the relevant feature. the general procedures in rough set are as follows: the lower approximation of set d is the set of objects in a table of information which certainly belongs to the class x: ax = {xi∈u|[xi]nd(a) ⊂ x}, x ∈ att (1) the upper approximation of a set x includes all objects in a table of information which possibly belongs to the class x: ax = {xi∈u |[xi] (a)∩ a ≠ φ} (2) boundar y region is the difference between upper approximation set and lower approximation set that is referred to as bnd (x) b = ax ax (3) positive region is the set of all objects that belong to lower approximation, which means, the union of the lower approximation consist of the union of all the lower approximation sets: ρ = ∪ a (union of all lower sets) (4) indiscernibility of positive reign for any g ⊆ att is the associated equivalence relation: ind (g) = {(x, y) ∈ p× : ∀a ∈g, a (x) = (y)} (5) reducts are the minimum range representation of the original data without loss of information: reducts δ = minind (6) 2.2. stochastic gradient descent according to [9], stochastic gradient descent is a function’s minimizing process, following the slope or gradient of that function. in general, in machine learning, stochastic gradient descent can be considered as a technique to evaluate and update the weights every iteration, which minimizes the error in training data models. while training, this optimization technique tries to show each and every training sample to the model one by one. for each training sample, the model produces an output (prediction), calculates the error, and updates to minimize the error for the next output, and this process is repeated for a fixed number of epochs or iterations. equation-7 describes the way of finding and updating the set of weights (coefficients) in a model from the training data. b = b-learning rate × error × x (7) here, b is the coefficient (weight) being estimated, learning rate is a learning value that can be configured between (0.01 and 10), error is the model’s predicted error, and x is the input value. the accuracy of the prediction can be calculated simply by dividing the number of corrected predictions by the actual values produced in formula 3. accuracy correct predictions actual values = ∑ ∑ � � � (8) kanaan m. kaka-khan, et al.: rough set-based feature selection for predicting diabetes uhd journal of science and technology | july 2022 | vol 6 | issue 2 87 2.3. logistic regression logistic regression [10] is a two-class problems linear classification algorithm. equation 9 represents the logistic regression algorithm. in this algorithm, to make a prediction (y), using coefficient (weight) values, the input values (x) are combined in a linear form. logistic regression produces an output of binary value (0 or 1). yhat e b b x = + − + × � . . � ( ) 1 0 10 0 1 1 (9) the foundation of logistic regression algorithm is euler’s number, the estimated output is represented as yhat, the algorithm’s bias is b 0, and the coefficient (weight) for the single input value (x 1 ) is represented as b 1. the logistic regression produces a real value as an output (yhat) which is between 0 and1. to be mapped to an estimated class value, the output needs to be converted (rounded) to an integer value. each column (attribute) of the dataset has an associate value (b) that should be estimated from the training data and it is the actual model’s representation that can be saved for further use. 3. related work prediction is a widely used approach in many fields of science including healthcare to foresee possible outcomes of a cause. disease prediction is certainly an area, where researchers have been working by applying a number of different theories including machine learning theories with the aim to find methods to make the most accurate prediction possible. rst is one of the theories used to classify and predict diseases. for instances, the authors of [11] have used the theory to classify medical diagnosis, the authors of [12] and [13] have modified and used the theory to improve disease prediction. type 1 and 2 diabetes were the focus of the authors of [14], in which they developed a hybrid reasoning model to address prediction accuracy issues. based on their results, they claim that their approach raises diabetes prediction accuracy to 95% compared to other existing approaches. in 2017, rst was used by the authors of [15] to develop a model for patient clustering in a dataset. the authors considered average values calculated from diabetes indicators in a dataset to cluster the patients in it. in the same year, deep learning was utilized by the authors of [16] to establish an intelligent diabetes prediction model, in which patients’ risk factors collected in a dataset were considered to make the prediction. in 2018, fuzzy rst is applied first to select specific features in a dataset, later in the process, to improve prediction performance, save processing time, and better diagnosis accuracy that the optimized generic algorithm (oga) is applied. the results obtained from the study shows that the approach has achieved the objectives of the study [17]. in 2020, vamsidhar talasila and kotakonda madhubabu proposed the use of rst technique to select the most relevant features to be inputted to the recurrent neural network (rnn) technique for disease prediction. they claimed that the rst-rnn method achieved accuracy of 98.57% [18]. in the same year, gao and cheng proposed an improved neighborhood rough set attribute reduction algorithm (inrs) to increase the dependence of conditional attributes based on considering the importance of individual features for diabetes prediction [14]. in 2021, gadekallu and gao proposed a model using an approach based on rough sets to reduce the attributes needed in heart disease and diabetes prediction [19]. the main limitation of these studies is the fact that none has considered the quantity and quality of viables used to make diagnostic predictions. the approach used in this study is similar to the ones used in the surveyed literatures but differs in objectives. we use rst to select the best features in a dataset and use stochastic gradient decent algorithm to compute the logistic regression values from the selected features in the dataset with the aim to improve the prediction accuracy of diabetes in a patient. 4. methodology this section provides insights on the methodology used to achieve the objectives of the study. the methodology is comprised six major steps: 4.1. step 1 a dataset is selected, examined for suitability and reliability based on a number of characteristics, and uploaded to be analyzed. the dataset selected and uploaded for the purpose of this research is provided by pima indians diabetes [7]. the selected dataset involves predicting diabetes within 5 years in pima indians given medical details. the dataset is a 2-class classification problem and consists of 76 samples with 8 input and 1 output variable. the variable names are as follows: number of times pregnant, plasma glucose concentration a 2 h in an oral glucose tolerance test, diastolic blood pressure (mm hg), triceps skinfold thickness (mm), 2-h serum insulin (mu u/ml), body mass index (weight in kg/[height in m]2), diabetes pedigree function, age, and class variable (0 or 1). before implementing the model, it is highly preferred to do preprocessing due to some kanaan m. kaka-khan, et al.: rough set-based feature selection for predicting diabetes 88 uhd journal of science and technology | july 2022 | vol 6 | issue 2 deficiencies. usually, the dataset contains features highly varying in magnitudes, units, and range which may results in inaccurate output [20]. in this work due to use of stochastic gradient descent algorithm, the dataset has been normalized using min-max scaling to bring all values to between 0 and 1. table 1 shows a sample of the selected dataset. 4.2. step 2 the selected diabetes dataset is preprocessed and normalized. to increase the efficiency and accuracy of the model, the dataset needs to be pre-processed before applying the proposed model since the data may contain null values, incorrect, and redundant information. in general, data processing involves two major steps: data cleaning and data normalization. data cleaning means removing incorrect information or filling out missing values to increases the validity and quality of a dataset though applying a number of different methods [21]. in this study, in case of any tuple containing missing values, the missed attribute value assumed to be 0 (this is achieved using the fill_mising_values () function from the python script developed for the implementation phase of this study). redundant or unnecessary columns are deleted to have a high quality dataset (this is achieved using the remove_duplicate_columns () function from the python script). to let all features have equal weight and contribution to the model, the range of each feature needs to be scaled, for this purpose, the dataset is normalized to a range of [0,1] by the following processes: string columns converting: the string columns are converted to float through str column using the float() function. min max finding: min and max values of each column of the dataset are found through using the dataset minmax() function. finally, the dataset is normalized by the min-max normalization method using the following equation adapted form [22]. x x x x x ' ( ) max ( ) = − ( ) − min min (10) 4.3. step 3 in this step, rst is applied to select the features which might produce a better prediction. there are nine variables in total in the dataset, as shown in table 1. the class variable is considered as a dependent variable and the other eight variables are assumed as predictors or independent variables. table 2 presents the regression calculation summary for diabetes classification of the dataset. the result of the calculation clearly shows that the accuracy of diabetes prediction is 30.32% if all variables in the dataset are considered in the calculation. the low accuracy result is an indication that there might be one or more variables which are not fit to be used for prediction. the regression calculation also shows that the un-standardized regression coefficient (b) is 0.06 for pregnancies, which indicates that if all other predictors are controlled then an increment of one unit in pregnancies increases the accuracy by 0.06. the same statement can be made for the other variables. to flitter the features that might produce a better diabetes prediction, the dataset is grouped together into nine elementary sets based on indiscernibility relation level between the data elements. table 3 shows the details of the groups. to further process the groups, the discernibility matrix has been developed for the elementary sets and the result is shown in table 4. from the discernibility matrix, a discernibility function has been developed, as shown in equation 11. f(a) = f(a1) × f(a2)×…×f(an) (11) as the result of discernibility function of all elementary sets for the entire dataset, we found that: f(a) = a1∨a2∨a5∨a6∨a8 where a1 is pregnancies; a2 is plasma glucose; a5 is insulin; a6 is dpf; and a8 is age attribute. table 5 shows the reduct matrix for the elementary sets. from the reduct matrix, all reducts and core attributes have been found: table 1: the first ten records of the diabetes dataset used in this study pregnancies plasma glucose blood pressure skinfold thickness insulin bmi dpf age class variable 6 148 72 35 0 33.6 0.627 50 1 1 85 66 29 0 26.6 0.351 31 0 8 183 64 0 0 23.3 0.672 32 1 1 89 66 23 94 28.1 0.167 21 0 0 137 40 35 168 43.1 2.288 33 1 5 116 74 0 0 25.6 0.201 30 0 3 78 50 32 88 31 0.248 26 1 10 115 0 0 0 35.3 0.134 29 0 2 197 70 45 543 30.5 0.158 53 1 bmi: body mass index kanaan m. kaka-khan, et al.: rough set-based feature selection for predicting diabetes uhd journal of science and technology | july 2022 | vol 6 | issue 2 89 f(r1) = a1∨a2∨a6; f(r2) = a1∨a2∨a5∨a8; f(r3) = a2∨a5∨a8; f(r4) = a1∨a2∨a8; f(r5) = a2∨a6∨a8; f(r6) = a1∨a2∨a6∨a8; f(r7) = a2∨a5∨a6; f(r8) = a1∨a2∨a5; table 3: elementary sets samples pregnancies plasma glucose blood pressure skinfold thickness insulin bmi dpf age group 1 0–1 0–22 0–13 0–10 0–94 0–6 0–0.25 21–26 group 2 2–3 23–46 14–28 11–22 95–190 7–14 0.26–0.51 27–33 group 3 4–5 47–70 29–43 23–34 191–286 15–22 0.52–0.77 34–41 group 4 6–7 71–94 44–58 35–46 287–382 23–30 0.78–1.03 42–49 group 5 8–9 95–118 59–73 47–58 383–478 31–38 1.04–1.29 50–57 group 6 10–11 119–142 74–88 59–70 479–574 39–46 1.3–1.55 58–63 group 7 12–13 143–166 89–103 71–82 575–670 47–54 1.56–1.81 64–69 group 8 14–15 167–190 104–118 83–94 671–766 55–62 1.82–2.03 70–75 group 9 16–17 191–199 119–122 95–99 767–846 63–67 2.04–2.42 76–81 table 4: discernibility matrix samples group 1 group 2 group 3 group 4 group 5 group 6 group 7 group 8 group 9 group 1 group 2 a1a2a4a7a8 group 3 a2a3a4a8 a1a3a4a8 group 4 a1a2a4a6a7 a2a3a4a7a8 a1a2a7a8 group 5 a2a3a5a7a8 a1a3a4a8 a1a2a4a6a7 a2a3a5a7a8 group 6 a1a3a5a6a8 a3a4a6a8 a1a3a5a7a8 a2a4a5a7a8 a2a3a5a7a8 group 7 a1a2a4a6a8 a2a4a5a7 a1a2a4a6 a2a3a5a7a8 a2a3a5a8 a5a6a7 group 8 a1a2a4a6a7 a1a3a4a8 a1a2a7a8 a2a3a5a7a8 a2a4a5a7 a3a4a5 a2a4a5 group 9 a2a4a5a7 a1a2a4a7 a3a5a8 a2a5a7a8 a3a4a6 a2a3a8 a2a4a5 a3a4a5 table 5: reducts matrix samples group 1 group 2 group 3 group 4 group 5 group 6 group 7 group 8 group 9 group 1 a1a2a4a7a8 a2a3a4a8 a1a2a4a6a7 a2a3a5a7a8 a1a3a5a6a8 a1a2a4a6a8 a1a2a4a6a7 a2a4a5a7 group 2 a1a2a4a7a8 a1a3a4a8 a2a3a4a7a8 a1a3a4a8 a3a4a6a8 a2a4a5a7 a1a3a4a8 a1a2a4a7 group 3 a2a3a4a8 a1a3a4a8 a1a2a7a8 a1a2a4a6a7 a1a3a5a7a8 a1a2a4a6 a1a2a7a8 a3a5a8 group 4 a1a2a4a6a7 a2a3a4a7a8 a1a2a7a8 a2a3a5a7a8 a2a4a5a7a8 a2a3a5a7a8 a2a3a5a7a8 a2a5a7a8 group 5 a2a3a5a7a8 a1a3a4a8 a1a2a4a6a7 a2a3a5a7a8 a2a3a5a7a8 a2a3a5a8 a2a4a5a7 a3a4a6 group 6 a1a3a5a6a8 a3a4a6a8 a1a3a5a7a8 a2a4a5a7a8 a2a3a5a7a8 a5a6a7 a3a4a5 a2a3a8 group 7 a1a2a4a6a8 a2a4a5a7 a1a2a4a6 a2a3a5a7a8 a2a3a5a8 a5a6a7 a2a4a5 a2a4a5 group 8 a1a2a4a6a7 a1a3a4a8 a1a2a7a8 a2a3a5a7a8 a2a4a5a7 a3a4a5 a2a4a5 a3a4a5 group 9 a2a4a5a7 a1a2a4a7 a3a5a8 a2a5a7a8 a3a4a6 a2a3a8 a2a4a5 a3a4a5 table 2: linear regression statistics of diabetes dataset multiple r 0.550684207 r square 0.303253096 adjusted r square 0.295909255 standard error 0.400210451 coefficients standard error t stat p-value unstandardized regression coefficient (b) intercept 0.853894266 0.085484958 -9.98882 0.00 0.066 pregnancies 0.020591872 0.00512998 4.014026 0.00 1.863 plasma glucose 0.005920273 0.000515123 11.49294 0.00 0.022 blood pressure 0.002331879 0.000811639 2.87305 0.00 0.081 skinfold thickness 0.00015452 0.001112215 0.13893 0.89 0.247 insulin 0.000180535 0.000149819 -1.20502 0.23 0.004 mi 0.013244031 0.00208776 6.343656 0.00 0.000 dpf 0.147237439 0.045053885 3.26803 0.00 0.686 age 0.002621394 0.00154864 1.692707 0.09 0.001 f(r9) = ∨a2∨a5∨a6∨a8. finally, table 6 shows the features that are selected to be used for making diabetes prediction. kanaan m. kaka-khan, et al.: rough set-based feature selection for predicting diabetes 90 uhd journal of science and technology | july 2022 | vol 6 | issue 2 table 3 shows the indiscernibility level of the relation between the patients. table 6 represents the last step of rst process, in which the data are simplified, and the indiscernibility relations are stated. the * symbol means that a certain variable has no impact in a certain case, for example, if the patient’s pregnancy is (0–1) and plasma glucose is (0–22) and dpf is (0-0.25), then the patient has diabetes regardless of the value of other attributes, and so on. 4.4. step 4 in this step, the logistic regression algorithm with stochastic gradient descent technique is applied on the selected features in the previous step. the major steps of the application are as follows: 4.4.1. dataset loading the dataset is loaded into the model through load_dataset() function. 4.4.2. dataset preprocessing the dataset is preprocessed through str column to float(), dataset minmax(), and nor malize dataset() functions accordingly. 4.4.3. dataset splitting into k folds the dataset is split into k-folds and trainset. test set creation for training the model is achieved through cross validation split() function. 4.4.4. coefficients estimating coefficients or weights are the values that determine the model accuracy and can be estimated for training data using stochastic gradient descent. the algorithm uses two parameters to estimate the weights (coefficient), the first one is learning rate to specify the amount of each weight, and it is corrected continuously, while it is updated. the second one is epochs which is the loop through the training process while updating the coefficient. the coefficients estimating is achieved through coefficients sgd() function. 4.4.5. coefficients updating for each instance in the training data, each coefficient is updated throughout all epochs. the error that the model makes is the criteria for updating the coefficients. the simple equation can be used to calculate the error (equation-12). error = (expected output value) – (prediction made with the candidate coefficients) (12) 4.5. step 5 predictions are generated; equation 7 describes the prediction process which is the most important part of the model. prediction process will be needed twice: first in stochastic gradient descent to evaluate candidate coefficient values and second in the model when it is finalized to produce outputs (predictions) on test data. the prediction process is achieved through predict() function. fig. 1 shows the execution flow of the proposed approach. 4.6. step 6 finally, the results obtained are compared. fig. 1 shows the proposed diabetes prediction method. 4.7. model performance evaluation in this research, k-fold cross-validation technique has been used to evaluate the learned model’s performance on unseen data. cross-validation is a resampling procedure used to validate machine learning models on a limited data sample. using k-fold, cross-validation means that k models will be construct, evaluated, and through using mean model error, the model’s performance is estimated. after rounding the predicted value of each row which is a float number between 0 and 1, it will be compared to its actual value. if they are equal, the prediction is considered as a correct result. simple error equation (equation 13) will be used to evaluate each model. accuracy = ∗ no of correct results total no of samples . . 100 (13) the general procedure is as follows: (1) shuffle the dataset randomly. (2) split the dataset into k groups, (3) take a group as a test set and the remaining as a training set, the same procedure will be repeated for each and every group; (4) as usual, the model will be fitting on the training set and evaluating on the test set, and (5) retain the result (evaluation score) the model can be discarded [17], [23]. for this work, table 6: indiscernibility table samples pregnancies plasma glucose insulin dpf age group 1 0–1 0–22 * 0–0.25 * group 2 2–3 23–46 95–190 * 27–33 group 3 * 47–70 191–286 0.52–0.77 34–41 group 4 * 71–94 287–382 * 42–49 group 5 8–9 95–118 * * 50–57 group 6 * 119–142 * 1.3–1.55 58–63 group 7 12–13 143–166 * 1.56–1.81 64–69 group 8 14–15 167–190 671–766 * * group 9 * 191–199 767–846 2.04–2.42 76–81 kanaan m. kaka-khan, et al.: rough set-based feature selection for predicting diabetes uhd journal of science and technology | july 2022 | vol 6 | issue 2 91 a learning rate, training epochs, and k value are (0.1, 100, 5) subsequently. after implementing the model twice; first on the dataset with all features, and second with features selected by applying rst, the results can be discussed as follows: 4.8. making prediction on dataset with all features the aim of using logistic regression is predicting the dependent variable (output variable) based on equation 7, and the aim of using stochastic gradient descent technique is minimizing the error of predicted coefficient values while training the model on the dataset. for model training, k-fold cross-validation technique is used to split out the dataset to 5 folds (groups), a fold is used as a test set and the others as train sets, for example: • mode l: fold1 for test and fold2, fold3, fold4, and fold5 for train • mode 2: fold2 for test and fold1, fold3, fold5, and fold5 for train • mode 3: fold3 for test and fold1, fold2, fold4, and fold5 for train • mode 4: fold4 for test and fold1, fold2, fold3, and fold5 for train • mode 5: fold5 for test and fold1, fold2, fold3, and fold4 for train. for each model, after training for 100 epochs (iterations) and minimizing the errors to a desired results and calculate the accuracy using equation 11, the score can be calculated using equation 14. score sumof all model acuracy results total no of models = � � � � � � .� � (14) table 7: accuracy score of each model used model no. accuracy model 1 73.857 model 2 78.431 model 3 81.699 model 4 75.816 model 5 75.816 score 77.124% fig. 1. proposed diabetes prediction method. kanaan m. kaka-khan, et al.: rough set-based feature selection for predicting diabetes 92 uhd journal of science and technology | july 2022 | vol 6 | issue 2 the total number of models used is five. table 7 summarizes the models result and the overall score. the overall score is 77.12% for the model on the dataset with all features. 4.9. making prediction on dataset with rst-based selected feature the same process applied on the dataset with selected features based on rst, the result is presented in table 8. table 9 shows the comparison between the results obtained from both implementations; implementing the model on the dataset with all features and the rst-based selected features. the results show that rst-based selected features for machine learning compared to the data set with all features give more accurate predictions. the baseline score for the selected dataset is 65% our experiment results which indicated that the proposed approach increased the prediction accuracy for diabetes dataset with all features from 65% to 77% and 80% for rstbased features dataset, as shown in table 10. finally, it can be summarized that implementing the logistic regression algorithm with stochastic gradient descent technique is one of the suitable choices for diabetes predictions on the basis of the results. at the same time, rather than using all features, more precise predictions can be made by feature selection based on rough set for neural network. table 11 summarizes a comparison between our works with some of the most recently published works. 5. conclusion and future work in the health-care sector predicting, the presence or nonpresence of diseases is important to help people know their health status so that they take the necessary steps to control the disease. this paper explores the use of stochastic gradient descent algorithm to apply logistic regression on datasets to make predictions on the presence of diabetes. the pima indian diabetes dataset is used to produce results using the proposed technique. the experiments results show that diabetes can be predicted more accurately using logistic regression with stochastic gradient descent algorithm when rst is used to select the important features on a normalized dataset. this is paper makes a real contribution in the use of interdisciplinary techniques to improve prediction mechanisms in health-care sector in general diabetes prediction in specific. the main purpose of this work is showing the significance of using rst with machine learning algorithms, hence in the future; the same theory can be applied with other algorithms to have a better result. table 8: accuracy and score for all five models for selected features model no. accuracy model 1 77.342 model 2 81.013 model 3 83.874 model 4 78.394 model 5 79.628 score 80.215% table 9: accuracy and score for all five models using all features, rst‑based selected features model no. all features (accuracy) rst-based selected features (accuracy) model 1 73.856 77.342 model 2 78.431 81.013 model 3 81.699 83.874 model 4 75.816 78.394 model 5 75.816 79.628 score 77.124% 80.215% rst: rough set theory table 10: accuracy summery of baseline and proposed algorithm for diabetes model name prediction accuracy (%) baseline score 65 logistic regression with sgd algorithm 77.124 rst-based logistic regression with sgd algorithm 80.215 table 11: dataset classification comparison works data size methods accuracy (%) [24] 768 samples with 9 attributes logistic regression 77 [25] 768 samples with 9 attributes modified pso naïve bayes 78.6 [26] 768 samples with 9 attributes modified weighted knn (sdknn) 83.76 [27] 768 samples with 9 attributes random forest classifier 79.57 our proposed method 768 samples with 9 attributes logistic regression with sgd algorithm 77 768 samples with 6 attributes rst-based logistic regression with sgd algorithm 80.215 rst: rough set theory kanaan m. kaka-khan, et al.: rough set-based feature selection for predicting diabetes uhd journal of science and technology | july 2022 | vol 6 | issue 2 93 references [1] “diabetesatlas”. available from: https://www.diabetesatlas.org [last accessed on 2022 aug 08]. [2] m. anouncia, c. maddona, p. jeevitha and r. nandhini. “design of a diabetic diagnosis system using rough sets”. cybernetics and information technologies, vol. 13, no. 3, pp. 124-169, 2013. [3] f. e. gmati, s. chakhar, w. l. chaari and h. chen. “a rough set approach to events prediction in multiple time series”. in: international conference on industrial, engineering and other applications of applied intelligent systems, vol. 10868, pp. 796807, 2018. [4] h. patel and d. patel. “crop prediction framework using rough set theory”. international journal of engineering and technology, vol. 9, pp. 2505-2513, 2017. [5] s. k. manga. “currency crisis prediction by using rough set theory”. international journal of computer applications, vol. 32, p. 48-52, 2011. [6] b. b. nair, v. mohandas and n. sakthivel. “a decision tree-rough set hybrid system for stock market trend prediction”. international journal of computer applications, vol. 6, no. 9, pp. 1-6, 2010. [7] “pima-indians-diabetes-dataset”. available from: https://www. kaggle.com/datasets/uciml/pima-indians-diabetes-database [last accessed on 2022 may 04]. [8] z. pawlak. “rough set theory and its applications to data analysis”. cybernetics and systems, vol. 29, no. 7, pp. 661-688, 1998. [9] p. achlioptas. “stochastic gradient descent in theory and practice”. stanford university, stanford, ca, 2019. [10] j. brownlee. machine learning algorithms from scratch with python. machine learning mastery, 151 calle de san francisco, us, 2016. [11] h. h. inbarani and s. u. kumar. “a novel neighborhood rough set based classification approach for medical diagnosis”. procedia computer science, vol. 47, pp. 351-359, 2015. [12] e. s. al-shamery and a. a. r. al-obaidi. “disease prediction improvement based on modified rough set and most common decision tree”. journal of engineering and applied sciences, vol. 13, no. special issue 5. pp. 4609-4615, 2018. [13] r. ghorbani and r. ghousi. “predictive data mining approaches in medical diagnosis: a review of some diseases prediction”. international journal of data and network science, vol. 3, no. 2, pp. 47-70, 2019. [14] r. ali, j. hussain, m. h. siddiqi, m. hussain and s. lee. “h2rm: a hybrid rough set reasoning model for prediction and management of diabetes mellitus”. sensors, vol. 15, no. 7, pp. 15921-15951, 2015. [15] s. sawa, r. d. caytiles and n. c. s. iyengar. “a rough set theory approach to diabetes”. in: conference: next generation computer and information technology, 2017. [16] s. ramesh, h. balaji, n. iyengar and r. d. caytiles. “optimal predictive analytics of pima diabetics using deep learning”. international journal of database theory and application, vol. 10, no. 9, pp. 47-62, 2017. [17] k. thangadurai and n. nandhini. “integration of rough set theory and genetic algorithm for optimal feature subset selection on diabetic diagnosis”. ictact journal on soft computing, vol. 8, no. 2, 2018. [18] v. talasila, k. madhubabu, k. madhubabu, m. mahadasyam, n. atchala and l. kande. “the prediction of diseases using rough set theory with recurrent neural network in big data analytics”. international journal of intelligent engineering and systems, vol. 13, no. 5, pp. 10-18, 2020. [19] t. r. gadekallu and x. z. gao. “an efficient attribute reduction and fuzzy logic classifier for heart disease and diabetes prediction”. recent advances in computer science and communications (formerly: recent patents on computer science), vol. 14, no. 1, pp. 158-165, 2021. [20] “medium”. available from: https://www.medium.com/greyatom/ why-how-and-when-to-scale-your-features-4b30ab09db5e [last accessed on 2022 jun 05]. [21] e. rahm and h. h. do. “data cleaning: problems and current approaches”. ieee data engineering bulletin, vol. 23, no. 4, pp. 3-13, 2000. [22] d. borkin, a. némethová, g. michal’conok and k. maiorov. “impact of data normalization on classification model accuracy”. research papers faculty of materials science and technology slovak university of technology, vol. 27, no. 45, pp. 79-84, 2019. [23] “machine learning mastery”. available from: https://www. machinelearningmastery.com/k-fold-cross-validation [last accessed on 2022 aug 06]. [24] g. battineni, g. g. sagaro, c. nalini, f. amenta and s. k. tayebati. “comparative machine-learning approach: a follow-up study on type 2 diabetes predictions by cross-validation methods”. machines, vol. 7, no. 4, pp. 74, 2019. [25] d. k. choubey, p. kumar, s. tripathi and s. kumar. performance evaluation of classification methods with pca and pso for diabetes. network modeling analysis in health informatics and bioinformatics, vol. 9, no. 1, p. 5, 2020. [26] r. patra and b. khuntia. “analysis and prediction of pima indian diabetes dataset using sdknn classifier technique”. iop conference series: materials science and engineering, vol. 1070, no. 1, p. 012059, 2021. [27] v. chang, j. bailey, q. a. xu and z. sun. “pima indians diabetes mellitus classification based on machine learning (ml) algorithms”. neural computing and applications. vol. 34, no. 10, pp. 1-7, 2022. _hlk111803208 _hlk115420386 _goback tx_1~abs:at/tx_2:abs~at 72 uhd journal of science and technology | jan 2023 | vol 7 | issue 1 1. introduction data mining has made an amazing development in the past years; however, the main problem is missing data or value. data mining is the sector wherein experimental facts sets are analyzed to find out thrilling and potentially beneficial relationships [1]. lacking records or value in a datasets can affect the performance of classifier which ends up in difficulty of extracting beneficial information from datasets. plentiful of facts is being gathered and saved each day. those facts can be used to extract interesting patterns. the information that we collect is incomplete normally [2]. therefor everyone wishing to apply statistical information evaluation or information cleaning of any type could have problems with lacking data. we still land in some missing attribute values in a function dataset. people typically tend to depart the income area empty in surveys, for instance, and members once in a while do not have any information available or cannot answer the question. plenty facts can also be lost in the technique of collecting information from multiple resources [1]. using that information to collect some statistics can now yield misleading effects. hence, to eliminate the abnormalities, we need to pre-method the statistics earlier than the usage of it. those instances may be omitted within the case of a small percentage of lacking values, but within the case of huge quantities, ignoring them will now not yield the desired outcome. a number of missing spaces in a dataset is a massive problem. therefore, a few pre-processing of statistics can be accomplished earlier than acting any information mining techniques to extract a few treasured records from a dataset to keep away from such mistakes and as a result enhance statistics first-class. fittingly managing with misplaced values is crucial and difficult venture since it requires careful examination of all occurrences of information to recognize design of missingness within the missing value imputation techniques: a survey wafaa mustafa hameed1,2*, nzar a. ali2,3 1technical college of informatics, sulaimani polytechnic university, sulaimani, 46001, kurdistan region, iraq, 2department of computer science, cihan university sulaimaniya, sulaimaniya, 46001, kurdistan region, iraq, 3department of statistics and informatics, university of sulaimani, sulaimani, 46001, kurdistan region, iraq a b s t r a c t numerous of information is being accumulated and placed away every day. big quantity of misplaced areas in a dataset might be a large problem confronted through analysts due to the fact it could cause numerous issues in quantitative investigates. to handle such misplaced values, numerous methods were proposed. this paper offers a review on different techniques available for imputation of unknown information, such as median imputation, hot (cold) deck imputation, regression imputation, expectation maximization, help vector device imputation, multivariate imputation using chained equation, sice method, reinforcement programming, non-parametric iterative imputation algorithms, and multilayer perceptrons. this paper also explores a few satisfactory choices of methods to estimate missing values to be used by different researchers on this discipline of study. furthermore, it aims to assist them to discern out what approach is commonly used now, the overview may additionally provide a view of every technique alongside its blessings and limitations to take into consideration of future studies on this area of study. it can be taking into account as baseline to solutions the question which techniques were used and that is the maximum popular. index terms: data preprocessing, imputation, mean, mode, categorical data, numerical data access this article online doi: 10.21928/uhdjst.v7n1y2023.pp72-81 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2023 hameed and ali. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) su rv e y uhd journal of science and technology corresponding author’s e-mail: technical college of informatics, sulaimani polytechnic university, department of computer science, cihan university sulaimaniya, sulaimaniya, 46001, kurdistan region, iraq. e-mail id: wafaa.mustafa@spu.edu.iq received: 09-11-2022 accepted: 05-03-2023 published: 28-03-2023 mailto:wafaa.mustafa@spu.edu.iq hameed and ali: imputation techniques uhd journal of science and technology | jan 2023 | vol 7 | issue 1 73 data. numerous strategies were proposed to address such lacking values considering 1980 [2]. this file illustrates distinct varieties of lacking values and the techniques used to address them. it is tremendously vital to note that there may be evaluation in purge and lost value. purge value implies that no value may be doled out though misplaced value implies actual value for that variable exists but not reachable or captured in dataset due to some motives. the information mineworker should separate between purge esteem and lost esteem. once in a while, each the values may be treated as misplaced values. lost records may be due to tools glitch, conflicting with different facts so erased, data no longer entered because of false impression, positive facts might not be considered crucial at the time of statistics collection. a few statistics mining calculations do not require substitution of misplaced values as they are planned and created to handle lost values; however, some data mining calculations cannot good buy with lost values. sometime, these days making use of any strategy of managing with lost values its miles vital to get it why records is misplaced [2], [3]. 2. missing value patterns 2.1. missing completely at random (mcar) mcar is the most improved degree of randomness and it indicates that the layout of misplaced value is completely arbitrary and does not rely on any variable which may additionally or might not be covered inside the examination [3]. it refers to facts that do not rely on the interest variable or every other parameter observed inside the dataset [4]. while missing values are distributed uniformly across all measurements, then we find the records to be absolutely randomly missing. for this reason, a brief test is to compare pieces of data – one with missing observations and the other without missing observations. on a t-test, if there is no mean difference between the two data units, we will expect that the data are mcar [5]. anything that is missing and sometimes because this form of missing facts is not often observed and the best manner to ignore these instances, for example: water damage to paper forms due to flooding before it enters [1], [2] or in a survey, if we get 5% responses missing randomly, it is mcar [6], [7]. this type is described by using the equation p p x y y f l xl m l( | , , ( , ), , )1 0 = where f is a function, that is, the missing data patterns are determined only by the covariate variables x. note here that marx is equivalent to mcar if there are no covariates in the model [7], [8]. 2.2. missing at random (mar) when missed value does not rely on any given or ignored value [8]. often information may not be deliberately missing; however, it can be named “missing at random”. if the data meet the requirement that missingness should not rely on x’s value after accounting for some other parameter, we may also find an x entry to be missing at random. depressed people seem to have less income, as an instance, and the reported earnings now depend on the thing depression. the percentage of lacking records among depressed people could be high as depressed people have lower incomes [1] if we get 10% missing for the male responses in a survey and 5% missing for the woman survey, then it is mar [6]. this kind is defined through the equation p p x y y f l x yl l m l l| , , ( , , ), , ,0 0( ) = in which f is a function, that is, only the covariate variables x and the based variables has been located have an impact on the patterns of lacking statistics. remember the fact that if there may be most effective one dependent variable y then there may be best one missing series that does not encompass any found dependent variables. for models with one structured variable, mar is therefore equal to marx [7]. 2.3. not missing at random (nmar) if the data are not missing at random or informatively, it is labeled “not missing at random.” this kind of situation happens while the technique of messiness depends at the actual value of missing statistics [4]. this type is defined by the equation: p p x y y f l x y yl l m l l m l| , , ( , , , ), , , ,0 0( ) = where f is a function, that is, all three types of variables have an effect on the missing data patterns. it is well known how full information maximum likelihood (fiml) estimation performs under all of these conditions [7]. 2.4. missing in cluster (mic) data are regularly more missing in some attributes than in others. in addition, the missing values in the ones attributes can be correlated. it is extremely tough to use statistical techniques to show multi-attribute correlations of lacking values. on this sample of missing values, the exceptional of statistics is much less homogeneous than that with mar. hameed and ali: imputation techniques 74 uhd journal of science and technology | jan 2023 | vol 7 | issue 1 the effects of any applications of analytical based on the complete facts set have to be cautious, for the reason that pattern data are biased in the attributes with a big number of missing values [7], [8]. 2.5. systematic irregular missing (sim) data can be missing quite irregularly, however systematically. there is probably overly missing correlations among the attributes, but those correlations are extraordinarily tiresome to analyze. an implication of sim is that the data with complete entities are unpredictably under-representative [7]. the first-class of records with this sample of missing values is minimal homogeneous than the ones in mar and additionally less controllable than that with mic. applications of any analytical results based at the whole data set are enormously questionable [9]. 3. strategies of handling missing data managing missing data may be carried out in two exclusive strategies for. the first method is definitely ignoring missing values and second approach is to take into account imputation of missing values. 3.1. ignoring missing values the missing records ignoring technique absolutely releases the state that includes missing data. they are mightily used for handling lacking facts. the earnest problem with this method is that it decreases the dataset size. this is handy whilst the dataset has small amount of lacking values. there are two common methods for ignoring missing data: 3.1.1. listwise deletion complete case analysis approach excludes all observations with missing values for any variable of interest. this approach thus limits the analysis to those observations for which all values are observed. this techniques is simple to use but cause loss of huge data, loss of precision, high effect on variability, and induce bias. 3.1.2. pairwise deletion for all the instances, we perform analysis with in which the variables of interest are present. it does no longer exclude complete unit but uses as lots data as feasible from every unit. this method is straightforward, keeping all available values, that is, best missing values are deleted but motive the loss of data, no longer a higher solution compared to other techniques. the pattern size for every individual evaluation is better than the entire case analysis [2], [10]. 3.2. single imputation single imputation procedures produce a precise value for a dataset’s missing real value. this method necessitates a lower computing cost. researchers have proposed a variety of single imputation strategies. the typical strategy is to analyze other responses and select the greatest possible response. the value can be calculated using the mean, median, or mode of the variable’s available values. single imputation can also be done using other methods, such as machine learning-based techniques. imputed values are considered actual values in single imputation. single imputation ignores the reality that no imputation method can guarantee the true value. single imputation approaches ignore the imputed values’ uncertainty. instead, in future analysis, they recognize the imputed values as actual values [11], [12]. 3.3. multiple imputations the use of distinct simulation models, multiple imputation methods yield several values for the imputation of single missing records. those strategies use imputed data’s variability to generate a diffusion of credible responses. multiple imputation strategies are sophisticated in nature, but in contrast to single imputation, they do no longer suffer from bias values. in multiple imputations, every missing facts point is replaced with m values obtained through m iterations (wherein m > 1 and m generally sits between 3 and 10) [6]. in this technique, a statistical approach used for coping with missing values, it performs through three stages: • imputation: generate m imputed data sets from a distribution which results in m complete data sets. the distribution can be different for each missing entry. • analysis: in this stage each m imputed data sets the analysis is performed, it is known as complete data analysis. • pooling: use simple rules the output obtained after data analysis is pooled to get final result. the resulting inferences form this stage is statistically valid if the methods to create imputations are “decent.” for substituting missing values with possible solutions, the multiple imputation method is used. the missing data set is transformed into complete data set using suitable imputation methods that can then be analyzed by any standard analysis method. therefore, multiple imputations have become popular in the handling of missing data. in this method, the process is repeated multiple times for all variables having missing values as the name indicates and then analyzed to combine hameed and ali: imputation techniques uhd journal of science and technology | jan 2023 | vol 7 | issue 1 75 m number of imputed data set into one imputed data set [7], [11]. 4. missing value imputation technique 4.1. mean imputation using this technique, calculate the mean of missing value through using the corresponding attribute value. this technique is easy to apply; it is built in maximum of the statistical bundle and quicker comparing with other techniques. it introduces precise result when facts is small, but it provides not proper result for large facts, this technique is appropriate for only mar but no longer beneficial for mcar [8], [13]. : ˆ ij k ij ij ki x c x x n∈ = ∑ wherein nk represents the number of non-missing values within the j-th feature of the k-th class ck, is missing [7], [8]. 4.2. hot (cold) deck imputation the concept, in this case, is to use some criteria of similarity to cluster the data earlier than executing the data imputation. this is one of the most used strategies. hot deck strategies impute missing values inside a data matrix by way of the usage of available values from the equal matrix. the item, from which these available values are taken for imputation within some other, is referred to as the donor. the replication of values ends in the trouble, that a single donor might be selected to accommodate multiple recipients. the inherent risk posed through that is that too many, or even all, missing values can be imputed with the values from a single donor. to mitigate this chance, a few hot deck variants restrict the amount of times anyone donor may be selected for donating its values. the similar techniques of hot deck are cold deck imputation method which takes other data source than current dataset. using hot deck, the missing values are imputed by realistically obtained values which avoids distortion in distribution, but bit empirical work for accuracy estimation, creates problem if any other sample has no close relation in entire manner of the dataset [8], [10], [11]. 4.3. median imputation (mdi) due to the affected of the mean through the presence of outliers, it seems better to use the median rather simply to make certain robustness. in this situation, the missing data are changed through the median of all recognized values of that attribute within the class where the instance with the missing characteristic belongs. this method is likewise a considered as a choice whilst the distribution of the values of is skewed. assume that the value xij of the k-th class, ck, is missing. it will get replaced by means of singh and prasad [7]. ( ){ }:ˆ ij k ijij i x c xüüü ∈= 4.4. regression imputation this approach may be apply by the use of known values for the construction of model after which calculates the regression between variables ends with applying that technique to calculate the missing values. the outcomes from applying this technique give greater accurate than mean imputation. the calculated data saves deviations from mean and distribution shape but the degree of freedom gets distort and can increases relationship [10]. y = α0 + α1 x 4.5. expectation maximization imputation (emi) there are forms of clustering algorithms. one is soft clustering and other is hard clustering:• soft clustering: clusters may overlap that is with unique degree of belief the factors belong to multiple clusters at the identical time • hard clustering: clusters do now not overlap that’s mean the element either belong to a cluster or not. • mixture models: the use of a probabilistic manner for doing soft clustering. every cluster corresponds to a generative model this is usually gaussian or multinomial, mvs are imputed by realistically obtained values which avoids distortion in distribution, in this technique, bit empirical work for accuracy estimation creates problem if any other sample has no close relation in entire manner of the dataset [2]. 4.6. k-nearest neighbor imputation (knn) specifying the similarity between two values and replace the missing value with similar one using euclidean distance. the advantage of this technique that for the datasets which having both qualitative and quantitative attributes values knn is suitable. there is no need for creating a predictive model for each attribute of missing data and helpful for multiple missing values. the knn looks for the most similar instances, the algorithm searches through all of the data set and that consider as an obstacle for that approach [12]. hameed and ali: imputation techniques 76 uhd journal of science and technology | jan 2023 | vol 7 | issue 1 4.7. fuzzy k-means clustering imputation (fkmi) in this method, the membership characteristic plays an important position. it is assigned with every data item that describes in what degree the data object is belonging to the precise cluster. data items might not get assign to concrete cluster which is stated using centroid of cluster (i.e., the case of k means), that is due to the various membership degrees of every data with entire k clusters. unreferenced attributes for every uncompleted data are changing by fkmi on the premise of membership degrees and cluster centroid values. the pros of this approach is that it offers quality outcome for overlapping data, higher than k manner imputation and records objects may be a part of multiple cluster middle but the high computation time and noise sensitive, that is, low or no membership degree for noisy objects considered as a cones for the usage of this technique [10]. 4.8. support vector machine imputation (svmi) its regression primarily based technique to impute the missing values. it takes condition attributes (output) and decision attributes. then, the svmi would be carried out for prediction of values of missed condition features. advantages of this approach are the efficient in massive dimensional areas and efficient memory consumption; however, additionally, there may be a cons for using this technique which it is the bad performance if number of samples are plenty lesser than number of features [10], [14]. 4.9. most common imputation (mci) on this imputation method, clustered are first shaped by applying k-means clustering method. like in k-nn, on this method, the nearest neighbors are found using clusters. all the instances in every cluster are referred as nearest neighbor of each other. then, the missing value is imputed the usage of the same technique as is employed through knni imputation approach. this procedure is fast and therefore is ideal for applying in big datasets. this algorithm reduces the intra cluster variance to minimum. here, too value of k parameter is an important factor and is difficult to predict its value. in addition, this algorithm does no longer assure global minimal variance [2], [15], [16]. 4.10. multivariate imputation by chained equation (mice) mice expect that data are lost arbitrarily (damage). it imagines the likelihood of a missing variable depends on the watched facts. mice offers numerous values in the put of one lost esteem through making an arrangement of relapse (or other reasonable) models, tallying on its “method” parameter. in mice, every lost variable is treated as a variable, and other information inside the record is treated as an independent variable. at to begin with, mice foresee missing values utilizing the winning information of other factors. at that point, it replaces missing values utilizing the predicted values and makes a dataset known as ascribed dataset. by cycle, it makes numerous ascribed datasets. every dataset is at that factor analyzed utilizing standard measurable investigation techniques, and numerous investigation comes about are given [17], [18]. 4.11. sice technique it pretends the probability of a missing variable depends on the determined data. it gives multiple values within the place of one missing value through creating a sequence of regression models, each missing variable is treated as a dependent variable, and different data in the record are treated as an independent variable, it predicts missing data using the existing data of other variables. then, it replaces missing values using the predicted values and creates a dataset known as imputed dataset. it achieves 20% higher f-measure for binary data imputation and 11% less errors for numeric data imputations than its competitors with similar execution time. it imputes binary, ordinal and numeric data. it performed well for the imputation of binary and numeric data and fantastic preference for missing data imputation, especially for massive datasets where mice is impractical to use because of its complexity but it could not show better overall performance than mice for the case of ordinal data [6]. 4.12. reinforcement programming impute missing data using learning a policy to impute data thru an action-reward-based totally experience imputes missing values in a column by operating best on the identical column (similar to univarite single imputation) however imputes the missing values within the column with different values thus keeping the variance in the imputed values. it is usually used for dynamic approach for the calculation of missing values using machine learning procedures. it has functionality of convergence and to solving imputation problem through using exploration and exploitation [19], [20]. 4.13. utilizing uncertainty aware predictors and adversarial learning mlp ua-adv. impute the missing values so that the adversarial neural network cannot distinguish real values from imputed ones. in addition, to account for the uncertainty of imputed values, the usage of confidence scores acquired from the adversarial module. the adversarial module objectives to discriminate imputed values from real ones the resulting imputer in addition to estimating a missing entry with high accuracy, it hameed and ali: imputation techniques uhd journal of science and technology | jan 2023 | vol 7 | issue 1 77 table 1: short review with mentioning to the advantage and disadvantage of different techniques to handle missing value techniques note advantages limitations references leastwise deletion technique deletion of cases containing missing values (complete row is deleted) high missing information because of deletion of entire row high impact on variability loss of precision and induce bias. simple to use. loss of precision, loss of enormous data ‑ high effect on variability, induce bias [2], [10] pairwise deletion technique deletion of records best from column containing missing values much less lack of information by using keeping all available values less impact on variability less loss of precision and induce bias. keeping all available values only missing values are deleted. simple to use. not a better solution as compared to other methods. loss of data, [2], [10] mean imputation technique calculate the mean of missing value through using the corresponding attribute value. replace mvs with the mean of facts resultant may be better than that of original. it is built in maximum of the statistical bundle and quicker comparing with other techniques. it introduce precise result when facts is small it provides not proper result for large facts this technique is appropriate for only mar but no longer beneficial for mcar ‑ affected by the presence of outliers. [3], [8] median imputation (mdi) technique missing data replaced by the median of all observed values of that attribute in the class where the features belongs. good choice when the distribution of the values is skewed. ‑ not affect by the presence of outlier [7] hot (cold) deck imputation technique cluster the data earlier than executing the data imputation. impute missing values inside a data matrix by way of the usage of available values from the equal matrix avoid distortion in distribution. empirical for accuracy estimation. creates problem if any other sample has no close relation in entire manner of the dataset. [8], [10], [11] regression imputation technique use the known values for the construction between variables then applying the technique to calculate the missing values very easy and simple technique. calculated data saves deviations from mean and distribution shape only applicable if data is linearly separable that is there is linear relation between attributes. degree of freedom gets distort and may raises relationship. [10] expectation maximization (em) technique ‑ iterative method, finds maximum likelihood two steps: expectation (e step), maximization (m step) using three models soft, hard and mixture clustering iteration goes on until algorithm converges mvs are imputed by realistically obtained values which avoids distortion in distribution bit empirical work for accuracy estimation, creates problem if any other sample has no close relation in entire manner of the dataset [2] fuzzy kmeans clustering imputation (fkmi) technique it is assigned with every data item that describes in what degree the data object is belonging to the precise cluster. unreferenced attributes for every uncompleted data are substituted by fkmi on the basis of membership degrees and cluster centroid values. best outcome for overlapping data, better than k means imputation. data objects may be part of more than one cluster center high computation time. noise sensitive, that is, low or no membership degree for noisy objects [10] support vector machine imputation (svmi) technique takes condition attributes (here, decision attribute i.e., output) and decision attributes (here, conditional attributes) svmi then would be applied for prediction of values of missed condition attribute ‑ efficient in large dimensional spaces. ‑ efficient memory consumption poor performance if number of samples are much less than number of feature [10], [14] (contd...) hameed and ali: imputation techniques 78 uhd journal of science and technology | jan 2023 | vol 7 | issue 1 table 1: (continued) techniques note advantages limitations references k nearest neighbour imputation (knn) technique determining the similarity between two values and replace the missing data with similar one using euclidean. avoids distortion in distribution as missing values are imputed by realistically obtained values no need for creating a predictive model. helpful for multiple missing value obstacle approach since the algorithm search all of the data set prediction of value of k is quite a difficult task. [12] most common imputation (mci) technique it replaces the missing value by the most common attribute or by the mode. while the numerical attribute missing value replaced by the average of the mean corresponding attribute fast and good for applying in big dataset. reduce the intra cluster variance to minimum. ‑ difficult to predict the value if the number of elements too big. dose not guarantee global minimum variance. [2], [15], [16] multivariate imputation by chained equation (mice) technique it pretends the probability of a missing variable depends on the observed data. it provides multiple values in the place of one missing value by creating a series of regression models, each missing variable is treated as a dependent variable, and other data in the record are treated as an independent variable predict missing data using the existing data of other variables. then it replaces missing values using the predicted values and creates a dataset called imputed dataset flexibility: each variable can be modeled using a model tailored to its distribution. can manage imputation of variables defined only on a subset of the data, can also incorporate variables that are functions of other variables, it does not require monotone missingdata patterns. lacking a theoretical rationale ‑ difficulties encountered when specifying the different imputation models [17], [18] sice technique: it is an extension of the popular mice algorithm. two variants of sice presented: sicecategorical and sicenumeric to impute binary, ordinal, and numeric data. twelve existing performance of algorithms implemented to predict house prices imputation methods and compare their performance with sice. achieves 20% higher fmeasure for binary data imputation and 11% less error for numeric data imputations than its competitors with similar execution time. impute binary, ordinal, and numeric data. performed better for the imputation of binary and numeric data. excellent choice for missing data imputation, especially for massive datasets where mice is impractical to use because of its complexity it could not show better performance than mice for the case of ordinal data. [6] reinforcement programming technique impute data through an action rewardbased experience imputes missing values in a column by working only on the same column but imputes the missing values in the column with different values thus keeping the variance in the imputed values. it is generally used for dynamic approach for the calculation of missing values by using machine learning approaches. performs well compared to other univarite single imputation and mlbased imputation approaches. use of numeric data variables only [19], [20] (contd...) hameed and ali: imputation techniques uhd journal of science and technology | jan 2023 | vol 7 | issue 1 79 be able to confuse the adversarial module, it neural network based totally architecture that can train properly with small and large datasets and to estimate the uncertainty of imputed data [19], [21]. 5. review on missing value imputation methods table 1. table 1: (continued) techniques note advantages limitations references utilizing uncertainty aware predictors and adversarial learning mlp uaadv imputer train well with small and large datasets and utilizes a novel adversarial strategy to estimate the uncertainty of imputed data proposed a novel hybrid loss function that enforces the imputers to generate values for missing data that on the one hand, obey the underlying data distribution so that it can confuse the welltrained adversarial module, and on the other hand, predict existing nonmissing values accurately the run time of the methods shows that they are efficient and have less execution time in comparison with that of peer imputer models. plays an important role in the overall performance less runtime compared to other imputers has a very simple structure, can work with any feature type and small and large data set it did not consider the imbalanced nature of the imputation task. [19], [21] table 2: comparing different techniques according to the dataset used in the application datasets techniques notes references iris mean regression imputation; reinforcement programming technique a comparison of different approaches of mice methods on iris datasets. efficiency gain with multiple imputations combined with regression is that it can better use the available information by accommodating nonlinarites [3], [8], [10], [18], [19], [20] iris credits adults mean/mode; hot deck; expectation maximization; knearest neighbor in this paper, the authors compare c5.0 with this newly developed technique known as iitmv and show its performance on different data sets [3], [8], [10], [11], [12], [22] cleveland heart zoo buhl1300 glass ionosphere iris pima sonar waveform2 wine hayesroth led7 solar soybean mean/mode; regression; hot deck; mlp uaadv the result shows that multilayer perceptions (mlp) with different learning rules show better results with quantitative datasets than classical imputation methods. in this paper, the type of missing value is missing completely at random (mcar) [3], [8], [10], [11], [19], [21], [22] iris escherichia coli breast cancer 1 breast cancer 2 mean knearest neighbors (knn) fuzzy kmeans (fkm) multiple imputations by chained equations (mice) mlp uaadv the results show that different techniques are best for different datasets and sizes. mice are useful for small datasets, but, for big ones and fkm are better, the mlp uaadv is better for both small and big datasets [3], [8], [10], [12], [17], [18], [19], [21], [23] hameed and ali: imputation techniques 80 uhd journal of science and technology | jan 2023 | vol 7 | issue 1 6. conclusion the finding of this article summarized in tables 1 and 2, the article shows that the most popular techniques (mean, knn, and mice) are not necessarily the most efficient. it isn’t always surprising for mean in regards to the simplicity of the method: the technique does not make use of the underlying correlation structure of the information and for that reason plays poorly. knn represents a natural improvement of mean that exploits the observed facts structure. mice are complex algorithm and its behavior seems to be related to the size of the dataset: rapid and efficient on the small datasets, its overall performance decreases and it becomes time-intensive when carried out to the massive datasets. the more than one imputation combined with bayesian regression gives better performance than other strategies, which includes mean, knn. however, they only taken into consideration the great of imputation based totally on category strategies without worrying of the execution time that may be an exclude criterion. consequently, fkm may additionally represent the technique of choice but its execution time may be a drag to its use and we take into account bpca as a more adapted solution to high-dimensional data, the article also shows that the mlp ua-adv consider a good choice for large and small data set also with different data type. table 2 shows comparison between the techniques according applications and the dataset used in each one. the strength of this paper that its cover most of the missing value imputation techniques that can be taken into consideration as a reference for other researcher to pick out the most appropriate techniques or make combination from a couple of for imputing the missing values. references [1] b. doshi. handling missing values in data mining. rochester institute of technology, rochester, new york, u s a, 2010. available from: https://www.pdfs.semanticscholar.org/3817/ b208fe1f40891cc661ea0db80c8fccc56b70.pdf [last accessed on 2023 mar 27]. [2] s. gupta and m. k. gupta. “a survey on different techniques for handling missing values in dataset”. international journal of scientific research in computer science, engineering and information technology, vol. 4, no. 1, pp. 2456-3307, 2018. [3] a. jadhav, d. pramod and k. ramanathan. “comparison of performance of data imputation methods for numeric dataset”. applied artificial intelligence, vol. 33, no. 10, pp. 913-933, 2019. [4] j. scheffer. “dealing with missing data”. research letters in the information and mathematical sciences, vol. 3, pp. 153-160, 2002. [5] d. v. patil. “multiple imputation of missing data with genetic algorithm based techniques”. ijca special issue on evolutionary computation, vol. 2, pp. 74-78, 2010. [6] s. i. khan and a. s. hoque. “sice: an improved missing data imputation technique.” journal of big data, vol. 7, no. 1, p. 37, 2020. [7] s. singh and j. prasad. “estimation of missing values in the data mining and comparison of imputation methods.” mathematical journal of interdisciplinary sciences, vol. 1, no. 2, pp. 75-90, 2013. [8] i. pratama, a. e. permanasari, i. ardiyanto and r. indrayani. a review of missing values handling methods on time series data, in: international conference on information technology systems and innovation (icitsi). bandung, bali, ieee, 2016, p. 6. [9] s. wang and h. wang. mining data quality in completeness. university of massachusetts dartmouth, united states of america, 2007. available from: https://www.pdfs.semanticscholar.org/347c/ f73908217751c8d5c617ae964fdcb87674c3.pdf [last accessed on 2023 mar 27]. [10] r. l. vaishnav and k. m. patel. “analysis of various techniques to handling missing value in dataset”. international journal of innovative and emerging research in engineering, vol. 2, no. 2, pp. 191‑195, 2015. [11] a. raghunath. survey sampling theory and applications. academic press, cambridge, 2017. [12] holman and c. a. glas. “modelling non-ignorable missing-data mechanisms with item response theory models”. british journal of mathematical and statistical psychology, vol. 58, no. 1, pp. 1-17, 2005. [13] a. puri and m. gupta. “review on missing value imputation techniques in data mining. international journal of scientific research in computer science, engineering and information technology, vol. 2, no. 7, pp. 35-40, 2017. [14] s. van buuren and k. groothuis-oudshoorn. “mice: multivariate imputation by chained equations in r”. journal of statistical software, vol. 45, no. 3, pp. 1-67, 2010. [15] a. s. kumar and g. v. akrishna. “internet of things based clinical decision support system using data mining techniques”. journal of advanced research in dynamical and control systems, vol. 10, no. 4, pp. 132-139, 2018. [16] j. w. grzymala-busse, l. k. goodwin, w. j. grzymala-busse and x. zheng. handling missing attribute values in preterm birth data sets. vol. 3642. united nations academic impact, new york, 2005, pp. 342-351. [17] j. han, m. kamber and j. pei. data mining: concepts and techniques. 3rd ed. morgan kaufmann publishers, san francisco, ca, usa, 2012. [18] g. chhabra, v. vashisht and j. ranjan. “a comparison of multiple imputation methods for data with missing values”. indian journal of science and technology, vol. 10, no. 19, pp. 1-7, 2017. [19] s. e. awan, m. bennamoun, f. sohel, f. sanfilippo and g. dwivedi. “a reinforcement learning-based approach for imputing missing data”. neural computing and applications, vol. 34, pp. 9701-9716, 2022. [20] i. e. w. rachmawan and a. r. barakbah. optimization of missing value imputation using reinforcement programming, in: hameed and ali: imputation techniques uhd journal of science and technology | jan 2023 | vol 7 | issue 1 81 international electronics symposium (ies). institute of electrical and electronics engineers, piscataway, new jersey, 2015, pp. 128-133. [21] w. m. hameed and n. a. ali. “enhancing imputation techniques performance utilizing uncertainty aware predictors and adversarial learning”. periodicals of engineering and natural sciences, vol. 10, no. 3, pp. 350-367, 2022. [22] t. aljuaid and s. sasi. intelligent imputation technique for missing values, in: conference on advances in computing, communications and informatics (icacci). jaipur, india, pp. 24412445, 2016. [23] p. schmitt, j. mandel and m. guedj. “a comparison of six methods for missing data imputation”. journal of biometrics and biostatistics, vol. 6, no. 1, pp. 1, 2015. tx_1~abs:at/tx_2:abs~at uhd journal of science and technology | jan 2023 | vol 7 | issue 1 43 1. introduction the topic of text processing has drawn the interest of numerous scholars as a result of the rising prevalence of digital texts in modern life. the amount of research in the domain of kurdish text processing seems to be rather minor, despite significant efforts with some of the most popular languages, such as english, persian, and arabic. commonly, the language experts divided the used languages of the world over families which are by ascending: indoeuropean, sino-tibetan, niger-congo, austronesian, and some other families. the indo-european family is the biggest family which speaks by the majority of europe, the lands where the europeans migrated, as well as a large portion of south-west and south asia. this family divided into sub-families [1]. kurdish language dialects are part of the north-western branch of the indo-iranic language family. the kurdish language is an independent language that has its own linguistic continuum, historical origins, grammar rules, and extensive live linguistic skills. the “median” or “proto-kurdish” language is where the kurdish language originated. approximately 30 million people in high land of middle east, kurdistan, talk numerous dialects of kurdish [1]. kurdish is referred to be a dialectical continuity, which means that it has a variety of dialects, it actually has four primary dialects (groups) and sub dialects, including (kurmanjí or kurmanji zhwrw and badínaní) in the north of kurdistan and sorani or kurmanji khwarw in the center kurdish kurmanji lemmatization and spell-checker with spell-correction hanar hoshyar mustafa, rebwar m. nabi technical college of informatics, sulaimani polytechnic university, sulaimani, kurdistan region, iraq a b s t r a c t there are many studies about using lemmatization and spell-checker with spell-correction regarding english, arabic, and persian languages but only few studies found regarding low-resource languages such as kurdish language and more specifically for kurmanji dialect, which increased the need of creating such systems. lemmatization is the process of determining a base or dictionary form (lemma) for a specific surface pattern, whereas spell-checkers and spell-correctors determine whether a word is correctly spelled also correct a range of spelling errors, respectively. this research aims to present a lemmatization and a word-level error correction system for kurdish kurmanji dialect, which are the first tools for this dialect based on our knowledge. the proposed approach for lemmatization is built on morphological rules, and a hybrid approach that relies on the n-gram language model and the jaccard coefficient similarity algorithm was applied to the spell-checker and spell-correction. the process results for lemmatization, as detailed in this article, rates of 97.7% and 99.3% accuracy for noun and verb lemmatization, correspondingly. furthermore, for spell-checker and spell-correction, accordingly, accuracy rates of 100% and 90.77% are attained. index terms: kurdish language, kurmanji dialect, kurdish lemmatizer, kurdish spell-checker and spell-correction, kurdish dataset corresponding author’s e-mail: hanar hoshyar mustafa, technical college of informatics, sulaimani polytechnic university, sulaimani 46001, kurdistan region, iraq. e-mail: hanar.hoshyar.m@spu.edu.iq received: 23-10-2022 accepted: 05-01-2023 published: 22-02-2023 o r i g i n a l re se a rc h a rt i c l e uhd journal of science and technology access this article online doi: 10.21928/uhdjst.v7n1y2023.pp43-52 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2023 mustafa and nabi. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) 44 uhd journal of science and technology | jan 2023 | vol 7 | issue 1 mustafa and nabi: kurdish lemmatizer and spell corrector of kurdistan (sulaimani and mukrayani). kurmanji and sorani are indeed the two main dialects [2]. additionally, the other two important divisions of kurdish language are goraní (hawrami, zazayee and shabak) and luri (mamasani, kurmanshani and kalhuri). furthermore, these are categorized into dozens of dialects and subdialects [3]. this paper focuses on the northern kurdish dialect which is (kurmanji or kurmanji zhwrw) dialect which has the biggest number of speakers in comparison to other kurdish languages dialects [4]. several studies have been done related to common languages such english [5], [6], arabic [7]-[9], and persian [10]-[12]. moreover, there are few studies which are consummated regarding kurdish language [13], [14], despite it, a huge gap can be seen in the case of kurdish kurmanji dialect; therefore, this study has been aimed to serve this gap due to kurmanji dialect in the case of creating lemmatization and spell-checker with spell-correction system. hence, in the future, this study can be used in several applications that include data translation, sentence retrieval, document retrieval, and also can be extend and upgrade to more powerful similar systems. this study presented a toolkit, which consists of a lemmatization system and a spell-checker with spellcor rection for kurdish kur manji. t he aim of the lemmatization is to find a root or dictionary form (calls a lemma) for a specific surface form. it is crucial to be able to normalize words into their most basic forms, particularly for languages with rich morphology such as kurdish language, to better assist processes such as search engines and linguistic case studies. spell-checking algorithms are one of the lemmatizer’s most commonly used applications. with using a spell checker, the system suggests a rating of suggested corrections for each possibly incorrect word. this study presented a combination algorithm which are n-gram language model together with jaccard similarity coefficient for the spell-checker and spell-correction system. furthermore, a rule-based method on the kurdish kurmanji morphological rules is used in creating the lemmatization system. based on the literature and to the best of our knowledge, no study has been conducted regarding the spell-checking and lemmatization systems in kurdish kurmanji dialect. therefore, our study can be the base for further studies for kurdish kurmanji dialect. 2. related work there has been a huge amount of research that has been conducted regarding the word lemmatization, spell-checker, and spell-correction in several common languages, such as english, persian, and arabic. however, when it comes to kurdish language, a large absence can be observed, especially in lemmatization and spell-checking with spell-correction system in kurdish kurmanji dialect. in the case of lemmatizer in english language lemma chase which is a lemmatizer is created [5] address the problems of the most widely used lemmatizers currently available, this research presents a lemmatization model. this model accounts for the nominalized/derived terms for which no lemmatizer currently in use is able to produce the proper lemmas. identifying the morphological structure of any input english word, and in particular understanding the structure of the derivational word, is the main issue in developing a lemmatizer. finding the derivational suffix from morphing words and then extracting the dictionary base word from that derived word is another crucially difficult problem for a lemmatizer. some derivative terms are not handled by well-known and well-liked lemmatizers to retrieve their basis words. lemma chase, the mentioned lemmatizer, accurately retrieves the base word while taking into account the word’s part of speech, several classes of suffix rules, and effectively executing the recoding rules utilizing the wordnet dictionary. all of the derivational and nominalized word forms that are present in any standard english dictionary are successfully used by lemma chase to construct the base word form. in addition, there have been numerous studies on spell checkers in arabic. for instance, build fast and accurate lemmatization for arabic [7] which is a study that covers the need for a quick and precise lammatization to improve arabic information retrieval (ir) outcomes and the difficulty of developing a lemmatizer for arabic, since it has a rich and complex derivational morphology. introduces a new data set that can be used to verify lemmatization accuracy as well as a powerful lemmatization algorithm that works more accurately and quickly than current arabic lemmatization techniques. numerous studies have been published on the use of spell checkers and spell correction in persian as well. for example, automated misspelling detection and correction in persian clinical text [10] is an article that explains the creation of an automatic method for identifying and fixing misspellings in persian free texts related to radiology and ultrasound. uhd journal of science and technology | jan 2023 | vol 7 | issue 1 45 mustafa and nabi: kurdish lemmatizer and spell corrector three distinct forms of free texts associated to abdominal and pelvic ultrasound, head-and-neck ultrasound, and breast ultrasound reports are utilized using n-gram language model to accomplish their aim. for free texts in radiology and ultrasound, the system obtained detection performance of up to 90.29% with correction accuracy of 88.56%. the findings suggested that clinical reports can benefit from high-quality spelling correction. significant cost reductions were also made by the system throughout the documentation and final approval of the reports in the imaging department. kurdish stemmer pre-processing for improving information retrieval conducted by researcher in [13]. this article introduces the kurdish stemming-step method. it is a method that links search phrases and indexing terms in kurdish texts that are connected by morphology. in actuality, the occurrence of words demonstrates a supportive role for the classification process. even though it was planned to produce more or fewer errors to demonstrate the complexity and difficulty of words in the kurdish sorani dialect, the handling of similarity changes was implemented, which helped to boost matching among words and decrease the storage requirements. however, the stemmer used in this work was capable of resolving most of these issues. there are many stop words with added affixes in kurdish sorani writings. therefore, by combining these commonly occurring stop words, it can be stemmed. in addition, it was determined that employing partial words during the pre-processing stage was preferable. likewise building a lemmatizer and a spell-checker for kurdish sorani presented by [14]. this study also presented a lemmatization and word-level error correction system for kurdish sorani. it suggested a hybrid strategy focused on n-gram language modeling and morphological principles. systems for lemmatization and error detection are referred to as peyv and renus, respectively. the peyv lemmatizer is created based on the morphological rules, and for renus, it corrects words both with using a lexicon and without using a lexicon. it indicates that these two basic text processing methods can lead the way for more study on additional natural language processing applications for kurdish sorani. last but not least, intensive literature search has been conducted but no studies have been found considering the kurdish kurmanji dialect. therefore, this article’s primary goal is to propose a lemmatization and word-level spell checker with correction method for a kurdish language dialect known as kurmanji. the benchmark of this paper is [14] which is useful for the research study, despite the different algorithms used in spell-correction tool, the lemmatization tools are nearly similar in using the methods and approaches, both studies suggest a hybrid strategy based on n-gram language model and morphological principles. this study employs the python programming language to process data as well as to create a word processing system that performs lemmatization and spell checking with spell correction at the word level. 3. methods and data this section describes dataset collection, data preparation, and algorithms as well as approaches which have been used in lemmatization and spell checker. 3.1. dataset collection a model dataset was produced in order to carry out this study. the dataset was created by reading books and articles written in the kurdish kurmanji dialect, which were then manually recorded and added to the dataset. kurdish kurmanji dialect words include verbs, nouns, conjunctions, stop words, pronouns, imperative words, superlative words, and question words. there are around 1200 words in the dataset. fig. 1 depicts the dataset’s data amounts in a pie chart. this split results from the differing morphological rules for nouns and verbs, which affect how nouns and verbs are lemmatized. the third dataset has a large number of words that do not accept any affixes. furthermore, it contains a few special terms with only one or two letters. some of the conjunction words, for instance, are written with only one or two letters. 3.2. data preparation the most important features that indicated that the dataset was ready for analysis were its unity and quality. furthermore, fig. 1. dataset quantity pie chart. 46 uhd journal of science and technology | jan 2023 | vol 7 | issue 1 mustafa and nabi: kurdish lemmatizer and spell corrector because the dataset is the primary first-hand collected dataset, it can ensure that the dataset is clean and has no duplicates. the dataset is then divided into three subsets. the first subset includes nouns. the second subset includes verbs, while the third subset contains pronouns, stop words, conjunctions, imperative words, superlative words, and question words. all of the subsets were stored in separate excel files, each with two columns: the id column and the data (word) column. except for the third subset, which contains the verbs, it has four columns: id, chawg, qad, and rag. the id column contains a unique id for each row; the chawg column contains the verb’s base; the qad column contains the verb’s past root; and the rag column contains the verb’s present root. table 1 presents the structure of the third (verb) excel sheet. 3.3. implementation this section describes the approaches and methods used according to noun lemmatization, verb lemmatization and spell-checker. 3.3.1. lemmatization lemmatizations for nouns and verbs are developed separately, after obtaining the fundamental morphological rules in kurdish kurmanji. each of noun and verb lemmatization use different approaches based on the morphological rules. for the lemmatizations a pruning method is used to find out the root of the input word. in the background of the system, each process is contained in a module inside the system, as a result to eliminate complexity and increase simplicity, also to made the system more readable and understandable. the following subsections clarify each of noun and verb lemmatizations in detail. 3.3.1.1. noun lemmatization according to the noun lemmatization, the noun lemmatization was created after clarifying and writing down all the rules in accordance with nouns in kurdish kurmanji dialect. a pruning method is used in this study. the input word to the system went through multiple stages and processes until the system found the proper root for the input noun, which is called a lemma in lemmatization process. during the process of noun lemmatization, predefined affixes and nouns in the dataset are used to find a proper lemma for an input noun. the only condition is to enter the word with the correct spelling. when a noun was entered, a search algorithm was used to look for it in the dataset. if the entered noun was a root without any affixes, the system determined that the input was correct and that no further processing was necessary. the output word in the outcome would be the base of the entered word. fig. 2 shows the flowchart diagram of this process. in other cases when the entered noun is with or attached to some affixes, in this study in the noun lemmatization module, three sets of affixes were defined. first set included prefixes that write before the noun without attaching to the noun directly, in kurdish kurmanji, there are some prefixes that write with a space separated with the noun. second set included the prefixes which are write and attached directly to the beginning of the noun without any space. moreover, the last set included suffixes which are directly attached to the end of the noun. the entered noun went through multiple processes to find the root out. the system first removes any prefixes which are table 1: structure of verb‑dataset verb dataset column include id data id chawg base of verb qad past root of verb rag present root of verb fig. 2. noun lemmatization first process flowchart. uhd journal of science and technology | jan 2023 | vol 7 | issue 1 47 mustafa and nabi: kurdish lemmatizer and spell corrector attached or not attached prefixes to the word, then search in the dataset, if there was no matching for the entered noun, the system decided that it might attached to some suffixes too, then the word went through another process which removed the possible suffixes attached to the noun, after that a search process look to find out if there was any matching word in the dataset, if any matching word found in the dataset, it would be the return root as the result. this process showed in a flowchart diagram in fig. 3. although there were no words that matched, the system made an effort and forwarded the entered noun to a procedure designed to remove prefixes and suffixes one at a time. in another sense, it took away the first prefix attached and looked for a matching root; if no matching root was discovered, it took away the first suffix attached and looked once more. it continued the process until the root was discovered if a matching root had not yet been discovered and there were further prefixes and suffixes linked to the word. at the end, when there were no more affixes, the entered noun was well spelled and the noun root existed in the dataset, the system gave the correct output lemma (root) for the entered noun. however, the system would replay with the message “input word is not in the dataset” if there was no match between the entered noun and nouns in the dataset. fig. 4. shows the process’ flowchart diagram. following these steps, the user sees the procedures’ output, as depicted in figs. 5 and 6. in fig. 5, the true word (کچ) (kiç) which means (girl) with two kurdish kurmanji prefixes (ەک) (ek) and (ا) (a) in the form of means (the (kiçeka) (کچەکا) girl who) entered. the system replayed with (“found”, “کچ”); “found” denotes that the entered word is correct and already exists in the dataset, and “کچ” is the base root of word (کچەکا). however, in fig. 6, the user inputted the incorrect term (کجان) (kican) in the meaning of but with (girls) (kiçan) (کچان) incorrect ending of followed by a ,(ç) (چ) rather than (c) (ج) correct prefix (ان) (an). due to the incorrect spelling of the word, which confounded the system and prevented it from locating the specific base root of the word, the system replayed with the message “input word is not in the dataset.” 3.3.1.2. verb lemmatization verb lemmatization also implemented in a pruning method as the noun lemmatization. after kurdish kurmanji dialect verb morphological rules are defined, the verb lemmatization is applied. the input verb went across several procedures until the tool selected and found the proper root. due to the kurdish verb’s morphology, the addition of prefixes and suffixes to the verb roots, and their ability to alter meaning, finding the root of the verb during the lemmatization process is more difficult and different than finding the root of a noun. therefore, simply omitting the suffix is worthless. fig. 3. noun lemmatization second process flowchart, phase 1. fig. 5. noun lemmatization of a legitimate noun. fig. 4. noun lemmatization second process flowchart, phase 2. fig. 6. noun lemmatization of an incorrect spelled noun. 48 uhd journal of science and technology | jan 2023 | vol 7 | issue 1 mustafa and nabi: kurdish lemmatizer and spell corrector in kurdish language morphology, each verb has three states includes its critical state, which is called (chawg) in kurdish morphology; in this state, every verb ends with an (n) (ن) letter at the end of the word; the (n) (ن) is called (the n of chawg) that determines the critical state of the verb. another state is when the verb turns into its past state, which is called the “past root,” and this is done by removing the (n of chawg) at the end of the verb. whenever the verb is in the past state, it can be used in the past tense. the final state is present, and it has several rules to modify a verb critical state and turn it into its present root. when the verb is changed to its present root, it can be used in the present tense [2]. when the input was processed by the system, any affix containing the verb had to be removed. as a result, three sets of affixes are defined, which include suffixes, prefixes that do not attach to the verb, and prefixes that attach to the verb directly. after removing affixes, the remaining verb had to be compared with the verb dataset in the system. as it is clarified in the verb dataset excel file, there were four columns included (id, chawg, qad, and rag), in which chawg referred to the critical state of the verb, qad referred to the past state, and rag referred to the present state. after the state of the verb was recognized and found, the system returned the critical state, which is the chawg of the verb as the base root of the entered verb. fig. 7 shows the process of finding the root of a verb if the entered verb is already a root; no matter in which tense it appears, the system returns the base root of it. moreover, fig. 8 depicts the processes for locating a verb root if the entered verb is attached to some affixes; the processes are identical to those for locating a root of a noun attached to affixes in noun lemmatization. after completing these stages, the user sees the output of the procedures, as shown in figs. 9-11. in fig. 9, the true word denotes the present tense of (dixom) (دخۆم) the verb (خارن) (xarin) (eat), while the prefix (د) (d) indicates the present term of the verb and the suffix (م) (m) is the pronoun that denotes (i). the system repeated (“found”, “خارن”) in the output, where “found” implies that the word is correctly spelled and that its present root, which is (خۆ) (xo), is available in the dataset, and is the base root for the entered word. in addition, in ”خارن“ fig. 10, the past tense of the same word (خارن) (eat) is entered fig. 7. verb lemmatization first process pseudo code. fig. 9. verb lemmatization of correct present tense of verb (خارن) (xarin) (eat). fig. 8. verb lemmatization second process pseudo code. fig. 10. verb lemmatization of true past tense of verb (خارن) (xarin) (eat). fig. 12. query term bi-gram frequency calculation pseudo code. fig. 11. verb lemmatization of a wrong spelled negative imperative of verb (خارن) (xarin) (eat). uhd journal of science and technology | jan 2023 | vol 7 | issue 1 49 mustafa and nabi: kurdish lemmatizer and spell corrector as (خارمەڤە) (xarmeve), which means (i ate). this time, there are two suffixes: (م) (m), which is the pronoun associated to (i), and (ەڤە) (eve), which indicates that the event occurred and ended completely in the past. once more, the system verified that the word root was correctly spelled that it was included in the dataset; it also displayed the base root of the term. furthermore, in fig. 11, entered the wrong negative imperative phrase (مەخر) (mexir) instead of (mexo) (مەخۆ) or (mexu) (مەخو) which means (don’t eat), but with the improper ending of (ر) (r), rather than (ۆ) (o) or (و) (u). the system displayed the message “input word is not in the dataset” according to the word’s incorrect spelling, which confused the system and prohibited it from finding the precise base root of the word. 3.3.2. spell checker and spell correction the spell checker and spell cor rection mechanisms collaborated in two stages in this study: first, the spell checker indicated whether the word was correct or incorrect, and second, the spell correction process corrected the word by suggesting some correct words by providing the most likely correct word forms. after the word entered the system, it was detected if it was true or not by the spell checker’s check for word frequency in the dataset (including the whole of the three files). the step of finding that the word is true or detecting the word as wrong was done based on using n-grams. the input word, which is called the query term in this paper, is fragmented into bi-grams (two grammatical units). a bi-gram is an n-gram for n = 2. in this study, a 2-g (or bi-gram) is a two-letter sequence of letters. the bi-grams sequences “ha,” “ap,” “pp,” and “py,” for instance, are two-letter grammatical sequences extracted from the word (happy). after the bi-gram of the query term is produced, the system calculates the gram frequencies with the bi-grams of the words in the dataset separately, which is called a dictionary term in this paper. fig. 12 shows the process of calculating the frequency of bi-grams in the query term in comparison to the dictionary terms. after calculation, the system looked up the frequencies of the bi-gram of the query term; if one of the frequencies was equal to zero, then it detected the word as a wrong one as one of its bi-grams had no repetition in comparison with the dictionary terms, and if none of the frequencies were zero, then the word was detected as true. hence, in the event that a query term equals one of the index terms in the dataset, this word will be selected as true, and if the word is detected as true, then the system presents “the word is true spelled” as a result. after detecting the query term as wrong, its bi-grams are handled, and the system goes to the spell correction procedure. the wrong word is then corrected based on the jaccard similarity coefficient method, which is popularly used to compare how close the query terms in the dataset are to one another. here, the procedure of similarity measurement can be used to examine the most comparable terms that are structurally recorded in the dataset if a query does not match any index in the dataset. using the jaccard similarity coefficient [15], equation (1) shows the rule of the jaccard similarity coefficient. jaccard sim (a,b) = p(a∩b)/(bua) (1) measuring the jaccard similarity coefficient between two datasets is done by dividing the number of features that are common to all by the number of properties [15]. the mechanism worked on the query term, and dictionary terms included all three files of the dataset. the spell correction took the query term, looked for the matching dictionary term in file one, if it did not exist, then sent it to the lemmatization files, respectively, because it may be the root of a noun or a verb; also, it might be a noun or a verb with affixes, and the affixes should be removed as a result to check if it was spelled correctly or not. after checked process did go well in detail, if the word found, the system marked it as a true word. otherwise, spell checker predicted words based on the dataset’s three files, then it chose the best matching words based on the highest matching degree, which is calculated using the jaccard coefficient algorithm, and best matches were chosen if their matching degree were greater than the spell checker’s threshold, and finally the five highest matching degree words were chosen. the threshold of this study is equal to 0.15. it has been chosen based on the accuracy of the guess for the correct word or the highest matching words in the dataset for the wrong query term. in kurdish kurmanji, there are words with three letters; if they are written incorrectly by missing a letter, they only have two letters. hence, the threshold should be as small as possible to get a great and accurate result. 4. results and discussion this section presents the results of the algorithms in both lemmatization and spell checker tools. also discuss the benchmarking with the benchmark study of the research. 4.1. noun lemmatization to improve the efficiency and accuracy of the noun lemmatization tool, two random words were chosen with 50 uhd journal of science and technology | jan 2023 | vol 7 | issue 1 mustafa and nabi: kurdish lemmatizer and spell corrector their derivatives which were nine derivatives of (چیا) (mountain) word and 12 derivatives of word. the (boy) (کوڕ) results of lemmatization process of both words were successfully giving correct root in both nine derivatives of first word and 12 derivatives in second word. to ensure the accuracy of noun lemmatization, another 66 random words with possible derivatives were chose and entered into the system; therefore, the noun lemmatization gave correct result in 63 cases out of 66, which means that the noun lemmatization algorithm had an accuracy of approximately 95.45% in lemmatizing words. overall, the accuracy of the noun lemmatization process was approximately about 97.7%. table 2 presents accuracy in noun lemmatization tool. 4.2. verb lemmatization to evaluate the efficiency and accuracy of the verb lemmatization tool, two sets of random verb forms were tested with the tool. the test sets included different verb forms such as present and past tense, imperative and negative imperative, passive, and negative. regarding the verb’s existence in the dataset dictionary, the verb lemmatization tool found the correct root of the input verb. each verb in the test set was entered with all possible derivations made with specific prefixes and suffixes. the first set included 171 different forms of different verbs. the lemmatization tool lemmatized 169 of them correctly; the wrongly lemmatized ones were due to the ordering of the dataset; in the case of imperative and negative imperative of a verb, the lemmatized verb rag was coming before the purposed verb rag, so the system took the first verb rag before it reached the purposed one. for example, the kurdish verb “send” has two forms: and both have the same (nartin) (نارتن) and (nardin) (ناردن) rag (نێر) (“nêr”). if a user entered the imperative tense of this verb, which is (بنێرە) (binêre), and expected to see the base root of in the result, the system replays (nardin) (ناردن) with the base root of because it was recorded (nartin) (نارتن) before the other form (ناردن) (nardin) in the dataset excel file. moreover, it is due to the system that, when it finds a result, it stops without going to the other verbs in the dataset. moreover, it is due to the system, when it finds a result, the system stops and the result appears without going to the other data in the dataset. as a result, the accuracy of lemmatizing the first set was 98.83 percent. in the order of the other set, there were 131 different forms of different verbs with different tenses. due to this set, the lemmatization tool lemmatized all of them, which means it gave the correct root for each of the forms. it can be said that with the two test sets, the verb lemmatization tool overall gave approximately 99.4 percent accuracy. table 3 shows the accuracy of the verb lemmatization tool. 4.3. spell checker and spell correction according to calculate and analyze the accuracy of the spellchecker and spell-correction tool, the process of analyzation is more complex, due to connecting the spell-checker and spell-correction tool with the lemmatization tools. as described in the above section, there was three datasets, so the spell-checker and spell-correction accuracy should be calculated according to all the datasets. the mechanism as said is to first check if the input word is correct or not, and the spell-checker tool is tested with three groups of data which are consisted in the three datasets as well. these three groups included 100 words from first dataset file, 100 nouns from second dataset file, 100 verbs from third dataset file, respectively. the result always returned true which meant the input word spelling is correct, while the data existed in the dataset. hence, it reached to be said that the spell-checker tool returned in all cases successfully. table 4 shows the accuracy of spell-checker tool. for the spell-correction tool a set of random words included noun, verb and others is tested, the contained nouns and verbs included all forms with prefixes and suffixes also simple noun and verbs without prefixes and suffixes. the result shows that whenever a bi-gram of the original correct word came in the input word, it was a higher chance to get the most correct word and most similar word as a result. the more bitable 2: accuracy in noun lemmatization tool sets true lemmatization false lemmatization total accuracy (%) 1st set 21 0 21 100 2nd set 63 3 66 95.45 total 84 3 87 97.7 table 3: accuracy in verb lemmatization tool sets true lemmatization false lemmatization total accuracy (%) 1st set 169 2 171 98.83 2nd set 131 0 131 100 total 300 2 302 99.3 table 4: accuracy in spell checker tool sets true spell checking false spell checking total accuracy (%) 1st set 100 0 100 100 2nd set 100 0 100 100 3rd set 100 0 100 100 total 300 0 0 100 uhd journal of science and technology | jan 2023 | vol 7 | issue 1 51 mustafa and nabi: kurdish lemmatizer and spell corrector grams of the original wanted word came in the input word, the higher similarity degree get and the more accurate results acquire in the outcome. in several occasions, the incorrect lemmatization occurred because of the incorrect input word, and this led to incorrect spell-correction which at the end resulted in a low accuracy degree of the outcome result. to provide the efficiency of the spell-correction a set included 100 of wrong random words with different forms were tested manually. first set included 61 wrong spelled nouns, the spellcorrector with the help of the noun lemmatization resulted an accuracy of 90.16% of the correction process. second set contained 80 wrong spelled verbs, in the result spell-corrector with the use of verb lemmatization gave an accuracy of correction process with 88.75% rate. third set consisted the wrong spelled pronouns, stop words, conjunctions, imperative words, superlative words, and question words, in 107 words the spell-correction system corrected 100 of them successfully which give accurate result as 93.4% of accuracy rate. table 5 displays the accuracy of spell-correction tool. as shown in table 5, the third set had the highest accuracy rate among the other two sets, and as previously stated, some false correction cases occurred due to false lemmatization, so it must be stated that if a dataset is created with all the forms of the words in all three datasets, then more accurate results can be obtained because the spell-corrector can directly look for the right form of the input misspelled word and find it with a high degree of certainty. 5. conclusion and future works information retrieval and text classification can benefit greatly from effective lemmatizer. in addition, incorrect words are detected and corrected by spell-checkers and spell-correction. this paper introduced the kurdish kurmanji lemmatizer and word-level spell-checker with spell-correction methodologies. it is the first attempt that tools of this kind have been made for kurdish kurmanji. a hybrid technique has been utilized for the spell-checker and spell-correction that depends on the n-gram language model and the jaccard coefficient similarity algorithm, also the proposed approach for lemmatization, is based on morphological principles. the outcome demonstrated that, while applying the suggested approach, the accuracy of lemmatization for each noun and verb lemmatization was assessed, respectively, at 97.7% and 99.3%. in addition, the spell-checker and spell-correction accuracy rates were 100% and 90.77%, respectively. the experimental findings show that several false correction cases were caused by incorrect lemmatization led by misspelled input words. furthermore, according to experimental findings, more accurate results may be obtained if a dataset is established with all the word forms in the datasets since the spell-checker will directly search for the correct form of the input misspelled word and discover it with a high level of equality. in the future, this work can be expanded to apply to a bigger dataset of kurdish kurmanji and utilize these approaches for nlp applications like text mining for kurdish kurmanji. as a contrast between this study and its benchmark. actually, this study is done for the kurdish kurmanji dialect, while the benchmark was done for the kurdish sorani dialect, which has completely different morphological rules in so many phases to study and implement in the system. the datasets that were used were different, while this research’s dataset is primary, first-hand, and organized in three subsets. in addition, there are some variances between them in terms of accuracy and the algorithms that have been used. this study achieved 97.7% and 99.3% accuracy for noun and verb lemmatization, respectively, while the benchmark achieved 95% and 89.4% accuracy of two test sets for noun lemmatization and an average of 86.7% accuracy for verb lemmatization. in addition, according to the spellcorrection, this study used the jaccard coefficient similarity algorithm and rated 90.77% accuracy, while the other study, as mentioned, used an edit distance algorithm and obtained 96.4% accuracy with a lexicon while, without a lexicon, the correction system had 87% of accuracy. at the end, it has to be said that the similarities can be seen in the theoretical parts and ideas, but for the practical part, a huge difference can be seen from using different programming languages; this study used the python programming language, while the other used the java programming language, up to and including recreating the system from the beginning to the end. 6. acknowledgment the authors would like to thank spu for providing the opportunity, support, and funding for this study. sulaimani, the kurdish journalist syndicate, is also thanked. table 5: accuracy in spell correction tool sets true correction false correction total accuracy (%) 1st set 55 6 61 90.16 2nd set 71 9 80 88.75 3rd set 100 7 107 93.4 total 226 22 254 90.77 52 uhd journal of science and technology | jan 2023 | vol 7 | issue 1 mustafa and nabi: kurdish lemmatizer and spell corrector references [1] z. kurdî, m.û. zarên wî and h.s. khalid. “kurdish language, its family and dialects”. 2020. available from: https://www.dergipark. org.tr/en/pub/kurdiname/issue/50233/637080 [last accessed on 2022 aug 15]. [2] d.n. mackenzie. “kurdish dialect studies”. oxford university press, london, 1961. available from: https://www.books. g o o g l e . i q / b o o k s / a b o u t / k u r d i s h _ d i a l e c t _ s t u d i e s _ 2 _ 1 9 6 2 . html?id=eaf2zaeacaaj&redir_esc=y [last accessed on 2022 may 31] [3] “kurdish academy of language enables the kurdish language in new horizon”. available from: https://www.kurdishacademy. org/?q=node/41 [last accessed on 2022 jun 04]. [4] n.a. khoshnaw, z.u.z. sulaimaniyah. “awer station”, 2011. available from: https://rezmanikurde.blogspot.com/2018/01/blogpost_26.html?m=1 [last accessed on 2022 jun 09]. [5] r. gupta and a.g. jivani. “lemmachase: a lemmatizer”. international journal on emerging technologies, vol. 11, no. 2, pp. 817-824, 2020. [6] d. hládek, j. staš, s. ondáš, j. juhár and l. kovács. “learning string distance with smoothing for ocr spelling correction”. multimedia tools and applications, vol. 76, no. 22, pp. 24549-24567, 2017. [7] h. mubarak. “build fast and accurate lemmatization for arabic”. vol. proceedings of the european language resources association (elra). miyazaki, japan, 2018. available from: https:// www.aclanthology.org/l18-118 [last accessed on 2022 jun 08]. [8] n. zukarnain, b.s. abbas, s. wayan, a. trisetyarso and c.h. kang. “spelling checker algorithm methods for many languages”, in proceedings of 2019 international conference on information management and technology, (icimtech), 2019, pp. 198-201. [9] a.a. freihat, m. abbas, g. bella and f. giunchiglia. “towards an optimal solution to lemmatization in arabic”. procedia computer science, vol. 142, pp. 132-140, 2018. [10] a. yazdani, m. ghazisaeedi, n. ahmadinejad, m. giti, h. amjadi and a. nahvijou. “automated misspelling detection and correction in persian clinical text”. journal of digital imaging, vol. 33, no. 3, pp. 555-562. 2019. [11] s. mohtaj, b. roshanfekr, a. zafarian and h. asghari, “parsivar: a language processing toolkit for persian,” in proceedings of the eleventh international conference on language resources and evaluation (lrec 2018), 2018. available from: https://www. aclanthology.org/l18-1179 [last accessed on 2022 aug 20]. [12] a. rashidi and m.z. lighvan. hps: a hierarchical persian stemming method. international journal on natural language computing, vol. 3, no. 1, pp. 11-20, 2014. [13] a.m. mustafa and t.a. rashid. kurdish stemmer pre-processing steps for improving information retrieval. journal of information science, vol. 44, no. 1, pp. 15-27, 2018. [14] s. salavati and s. ahmadi. “building a lemmatizer and a spellchecker for sorani kurdish”. corr, vol. abs/1809.10763, 2018. available from: https://www.arxiv.org/abs/1809.10763 [last accessed on 2021 aug 15]. [15] s. niwattanakul, j. singthongcha, e. naenudorn, and s. wanapu. “using of jaccard coefficient for keywords similarity”, in proceedings of the international multi conference of engineers and computer scientists. vol. 1, 2013. available from: https://www. data.mendeley.com/v1/datasets/s9wyvvbj9j/draft?preview=1 [last accessed on 2022 apr 08]. . uhd journal of science and technology | april 2017 | vol 1 | issue 1 1 intelligent techniques in cryptanalysis: review and future directions sufyan t. al-janabi1,2, belal al-khateeb3 and ahmed j. abd3 1department of information systems, college of cs and it, university of anbar, ramadi, anbar iraq 2department of computer science, college of sceince and technology, university of human development, sulaimaniya, kurdistan region iraq 3department of computer science, college of cs and it, university of anbar, ramadi, anbar iraq o r i g i n a l re se a rc h a rt i c l e uhd journal of science and technology 1. introduction the basic aim of cryptography is to transmit messages from one place to another in a secure manner. to satisfy this, the original message called “plaintext” is encrypted and sent to the receiver as “ciphertext.” the receiver decrypts the ciphertext to get the plaintext. this can be done using a cipher which is a tool that hides the plaintext and converts it to the ciphertext (and also can return back the plaintext from the ciphertext). ciphers make use of (cryptographic) keys that determine the relationship between the plaintext and the ciphertext. cryptography can be considered as assemble from security and mathematics. it is used to protect important information and ensure that this information arrives to its destination in peace without violations. ciphers gradually evolved from simple ones which are currently considered to be easily breakable such as caesar cipher through more complex cipher algorithms such as the data encryption standard (des) and the advanced encryption standard (aes) [1], [2]. on the other hand, cryptanalysis means trying to break any security system (or cipher) using unauthorized ways to access the information in that system. thus, cryptanalysis works against cryptography. the cryptanalyst tries to find any weakness in the cryptographic system to get either the source of information (plaintext) or the key used in the encryption algorithm. this process is called an attack. if this attack is successfully applied, then the cryptographic system is said to be broken. cryptography and cryptanalysis together form the field of cryptology [3], [4]. in the recent decades, cryptography developed quickly because of the development in computational resources which increased the speed and decreased the time of encryption and decryption processes. this moved cryptography from solving by hand to more and more complex computer programs that need considerably long time and sophisticated attack a b s t r a c t in this paper, we consider the use of some intelligent techniques such as artificial neural networks (anns) and genetic algorithms (gas) in solving various cryptanalysis problems. we review various applications of these techniques in different cryptanalysis areas. an emphasis is given to the use of gas in cryptanalysis of classical ciphers. another important cryptanalysis issue to be considered is cipher type detection or identification. this can be a real obstacle to cryptanalysts, and it is a basic step for any automated cryptanalysis system. we specifically report on the possible future research direction of using spiking anns for cipher type identification and some other cryptanalysis tasks. index terms: artificial neural networks, cipher identification, classical ciphers, cryptanalysis, genetic algorithms corresponding author’s e-mail: saljanabi@fulbrightmail.org received: 10-03-2017 accepted: 25-03-2017 published: 12-04-2017 access this article online doi: 10.21928/uhdjst.v1n1y2017.pp1-10 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2017 al-janabi, et al. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) sufyan t. al-janabi et al.: intelligent techniques in cryptanalysis: review and future directions 2 uhd journal of science and technology | april 2017 | vol 1 | issue 1 techniques to solve. hence, instead of using the simple caesar cipher which needs no more than few minutes (or seconds) to be broken using brute force attack (trying every possible solution), we are using now more complex ciphers (aes, triple des, etc.) that might need hundreds (or thousands) years to break using brute force attack with the current technology. one important issue to mention is that despite the technological and mathematical complexity, the modern versions of cryptosystems still follow the same classical concepts. thus, it is still prudent to apply certain attacks on classical ciphers and study their evolution aspects before using them with more complex modern ciphers. this is quite justifiable considering the nature of intelligent techniques such as gas, artificial neural networks (anns), and evolutionary algorithms (ea). although several survey works can be found in earlier literature [5]-[7], more work is needed in this direction to shed the light on various aspects of this kind of interdisciplinary research. the aim of this paper is to review various applications of intelligent techniques in cryptanalysis problems and to investigate some possible future research directions. the remaining of this paper is organized as follows: section 2 summarizes various types of ciphers and cryptanalysis attacks in a generic way. the intelligent techniques of anns, gas, and evolutionary computation are reviewed and compared to each other in section 3. then, section 4 reviews the application of gas in cryptanalysis of classical ciphers. the issue of classification or identification of cipher type is considered in section 5. next, we present some insights regarding the future direction of using spiking anns in cipher classification in section 6. finally, the paper is concluded in section 7. 2. classification of ciphers and attacks cryptosystems can be classified in multiple approaches depending on various criteria. this can simplify the study of cryptography science and make it easier to understand and implement. at first, if we take in consideration the amount of data that can be encrypted at a time, we can then classify cryptosystems in two classes:[3] 1. block ciphers, which encrypt block of data at time like des 2. stream ciphers, which encrypt single datum (symbol, byte, or bit) at a time like caesar cipher. second, it is also possible to classify cryptosystems according to the key used in encryption and decryption processes. in this case, we can put a cryptosystem under one of the following: 1. symmetric key ciphers, where the same key is used for encryption and decryption, for example, vigenere cipher. 2. public key ciphers, where one key is used for encryption and another one for decryption, for example, rivestshamir-adleman system. third, we can classify cryptosystems depending on the history and time of invention. thus, we can put cryptosystems under one of the following: 1. classical ciphers, which are those ciphers used in the past and can be solved by hand. they became now breakable, for example, caesar cipher 2. modern ciphers, which are those complex (computerized) ciphers widely used currently and cannot be solved by hand, for example, aes. finally, another classification approach is to classify ciphers according to their building blocks. this approach is typically applied for classical ciphers to divide it into:[3] 1. substitution systems, where every character is replaced by another one, for example, monoalphabetic ciphers 2. transposition systems, where characters are rearranged rather than replaced, for example, columnar cipher. it is also possible to further classify both of the main two categories of classical ciphers: substitution and transposition ciphers. transposition ciphers can be classified into sub classes:[3], [8] • single transposition: this type transposes one letter at a time, for example, the columnar transposition, route transposition, and grille transposition ciphers • double transposition: this type transposes more than one letter at a time. substitution ciphers can be classified into sub classes as follows:[3], [9] • monoalphabetic substitution ciphers: in this type of encryption techniques, one letter of plaintext is represented by one letter in ciphertext, and one ciphertext letter represents one and only one plaintext letter, so it is the simplest for m of substitution techniques. monoalphabetic substitution includes direct monoalphabetic, reversed monoalphabetic, decimated monoalphabetic, and mixed monoalphabetic ciphers • polyalphabetic substitution ciphers: in this type of sufyan t. al-janabi et al.: intelligent techniques in cryptanalysis: review and future directions uhd journal of science and technology | april 2017 | vol 1 | issue 1 3 encryption, one letter of plaintext is represented by multiple ciphertext letters, and one ciphertext letter represents multiple plaintext letters. there are two types of polyalphabetic substitution ciphers: periodic (where there is a keyword repeating along plaintext like the vigenere cipher) and non-periodic (where there is no repeating key, e.g., the running key cipher) • polygraphic substitution ciphers: in this type of substitution, more than one plaintext letters are encrypted at a time by more than one ciphertext letters. this includes digraphic, trigraphic, and tetragraphic ciphers. examples of these ciphers are the playfair cipher and hill cipher • homophonic substitution ciphers: in this type of substitution, one plaintext letter is represented by multiple ciphertext letters or characters, and every ciphertext letters or characters can only represent one plaintext letter, for example, the nomenclator cipher. furthermore, it is possible to define combinations of transposition and substitution ciphers to produce more secure systems. such combinations are used to avoid the weaknesses in pure transposition and pure substitution systems. a classical example of such combined ciphers is when we combine simple substitution with a columnar transposition. in modern cryptography, ciphers are designed around substitution and transposition principles simultaneously. fig. 1 depicts various types of classical systems. similarly, we can also classify cryptanalysis attacks. actually, there are many types of such attacks. some of them can be considered as general types, while others are specific for certain ciphers, protocols, or implementations. here, we are not going to try to list all attack types rather we are only interested in some generic ways for classifying attacks. it is possible to generically classify attacks based on the amount of information available to the attacker. the amount of information that attacker have is important to make any attack so the cryptanalyst should determine what is available in his hand. accordingly, we are going to have cipher text only, known plaintext, chosen ciphertext, chosen plaintext, adaptive chosen plaintext, adaptive chosen ciphertext, and related key attacks. alternatively, we might generically classify attack according to the computational resources (time, memory, and data) required by these attacks [3], [10]. 3. intelligent techniques in this section, we review the relevant intelligent techniques of anns, genetic algorithms (gas), and evolutionary computation. we also give a brief comparison on their characteristics an application scope. a. anns anns are numerical models that use a gathering of basic computational units called neurons that connect with each other to build a network. there are many types of anns; each type is suitable for one or more problems depending on the problems itself. hence, the important thing in anns is how to design the topology of ann that can better describe the problem then solving it using very simple principles to obtain very complex behavior [5], [11]. anns can model human brains and use nervous system to solve the problems by learning it with true examples and giving a chance to generalize all solutions. since the nature of anns that simulate the brain and use parallel processing rather than serial computation, we can put anns in multiple fields according to the huge capabilities that anns can introduce. these fields include classification, approximation, prediction, control, pattern recognition, estimation, optimization, and others. when using ann for solving a problem, the following steps should be chosen carefully to make ann works in an effective way: design of ann topology, choosing suitable learning way, and setting the inputs. there are many ann topologies such as:[12] • feed-forward anns • recurrent anns • hopfield annfig. 1. most important classical cipher types sufyan t. al-janabi et al.: intelligent techniques in cryptanalysis: review and future directions 4 uhd journal of science and technology | april 2017 | vol 1 | issue 1 • elman and jordan anns • long short-term memory • bi-directional anns • self-organizing map • stochastic ann • physical ann. there are three generations of neuron models [13]. the first generation of anns also called perceptrons, which are composed each of two sections: sum and threshold. the sum part receives input from a set of weighted synapses. then, it performs a threshold function on the result of the sum. the input and the output have values that may be equal to either 0 or 1, as shown in fig. 2. the second generation of anns is composed by two stages: • sum of values that are received through weighted synapses • sigmoid function evaluator whose input is the result of the sum previously computed. in this generation, the inputs can be any real-valued number, and the output is defined by the transfer function. for example, the sigmoid unit limits outputs to [0; 1], whereas the hyperbolic function produces outputs in the range [1; 1], as shown in fig. 3. the third generation of anns is composed by spiking neurons: neurons which communicate through short signals called spikes. this generation has two main differences when compared with the previous two generation. at first, this generation introduces the concept of time in the simulation, while earlier, the neural networks were based on abstract steps of simulation. second, such neurons present similarities to biological neurons, as they both communicate using short signals, which in biology are electric pulses (spikes), also known as action potentials, as shown in fig. 4. the spike train generation can be gaussian receptive fields [14], poisson distribution [15], or directed spike generation [16]. indeed, the applied training algorithm for anns is usually the backpropagation [17], while spiking anns use spikeprop [18]. b. gas gas are considered to be one of the best ways to solve a problem, for which there is only a little knowledge. hence, they work well in any search space. all that is required know is what the solution is needed to be able to do well, and a ga will be able to create a high-quality solution. gas apply the both principles of selection and evolution to produce several solutions to a given problem [19]. gas are better applied in an environment in which there is a very large set of candidate solutions and in which the search space is uneven and has many hills and valleys. although gas will do well in any environment, they will be greatly outclassed by more situation-specific algorithms in the simpler search spaces. therefore, gas are not always the best choice. sometimes, they can take quite a while to run and are therefore not always feasible for real-time use. however, they are considered to be among the most powerful methods with which to (relatively) quickly create high-quality solutions to a problem. the proper selection of appropriate mutation operators and fitness functions is necessary for implementing a successful attack [19], [20]. in fact, gas are adaptive heuristic search algorithms based on the evolutionary ideas of natural selection and genetics. fig. 2. the first generation of artificial neural networks[13] fig. 3. the second generation of artificial neural networks[13] fig. 4. the third generation of artificial neural networks[13] sufyan t. al-janabi et al.: intelligent techniques in cryptanalysis: review and future directions uhd journal of science and technology | april 2017 | vol 1 | issue 1 5 thus, they represent an intelligent exploitation of a random search used to solve optimization problems. they exploit historical information to direct the search into the region of better performance within the search space. the basic techniques of the gas are designed to simulate processes in natural systems necessary for evolution, especially those follow the principle of “survival of the fittest.” this is based on our understanding of nature where competition among individuals for scanty resources results in the fittest individuals dominating over the weaker ones [19]. c. evolutionary computation simply, evolutionary computation simulates evolution on a computer. the result of such a simulation is a series of optimization algorithms. these are usually based on a simple set of characteristics. optimization iteratively can improve the quality of solutions to some problem until an optimal (or at least feasible) solution is found. evolutionary computation is an umbrella term that includes gas, evolution strategies, and genetic programing [21]. d. differences between anns, gas, and evolutionary computation an ann is a function approximator. to approximate a function, you needs an optimization algorithm to adjust the weights. an ann can be used for supervised learning (classification and regression) or reinforcement learning and some can even be used for unsupervised learning. gas are an optimization algorithm, in supervised learning, a derivative-free optimization algorithm like a ga is slower than most of the optimization algorithms that use gradient information. thus, it only makes sense to evolve neural networks with gas in reinforcement learning. this is known as “neuroevolution.” the advantage of neural networks like multilayer perceptrons in this setup is that they can approximate any function with arbitrary precision when they have a sufficient number of hidden nodes. an ea deploys a randomized beam search, which means your evolutionary operators develop candidates to be tested and compared by their fitness. those operators are usually nondeterministic and you can design them, so they can both find candidates in close proximity and candidates that are further away in the parameter space to overcome the problem of getting stuck in local optima. eas are slow because they rely on unsupervised learning: eas are told that some solutions are better than others but not how to improve them. neural networks are generally faster, being an instance of supervised learning: they know how to make a solution better using gradient descent within a function space over certain parameters; this allows them to reach a valid solution faster. neural networks are often used when there is not enough knowledge about the problem for other methods to work. 4. cryptanalysis of classical ciphers using gas there are many approaches and tools that are used in the field of cryptanalysis. one of the successful approaches that achieved promising results is based on gas. this is mainly due to the nature of gas that allow reducing the big size of solutions, leading to optimal or likely best solution from this group of solutions. gas use fitness function to evaluate each solution then select the best one or best group of solutions to generate other children solutions and so on until the cipher is broken. in this section, we report on some interesting aspects of applying gas in cryptanalyzing classical ciphers. a. cryptanalysis of monoalphabetic substitution ciphers the ga attack on such cipher can be implemented by generated initial keys consisting of permutation of the set of letters. these keys are generated randomly, and after encrypting using each generated key, we can measure the value of fitness of each key. then, pairs of these keys which have a high fitness value are selected and crossover operation then is used between selected keys to produce new enhancement child keys. after crossover operation is completed, some keys are selected to mutation to enhance the attributes of it by the choice of a random point in a selected key and replacing it with another point. after the two operations are completed, the loop is repeated until the end with suitable stopping [22]. b. cryptanalysis of playfair cipher for attacking the playfair cipher using gas, we should determine the individuals which contain one possible key of the cipher and each individual has its fitness value. one individual is represented as a matrix of 5*5 positions that contain the characters of alphabets distributed randomly. after the generation of the individuals is completed, the selection operation begins according to each individual fitness, so the individual has a highest fitness value that is put in the beginning of the rank. after selection process is completed, the reproduction or crossover operation will begin to produce new children key that may has attribute better than its parents. the crossover operation is implemented by filling sufyan t. al-janabi et al.: intelligent techniques in cryptanalysis: review and future directions 6 uhd journal of science and technology | april 2017 | vol 1 | issue 1 the positions of the child with character of the parents or mutating the child by replacing characters positions locally. the loop continues until meeting the stopping condition. however, the recovery of the plaintext is not easy to implement usually, for several reasons. one is that words that have double letters may not be counted correctly, due to the fact that the double letters might be split up. second, because i and j share a position in the key (typically), all the words that have is and js in them have to be checked using both letters, if the dictionary is fully implemented. third, the plaintext has no white space to delimit words so being able to tell where words end and begin can be difficult [23]. c. cryptanalysis of vernam cipher gas can be used for attacking the vernam cipher by building a dictionary of words that consist of words that are frequently used in english (e.g., they, the, and when). then, the fitness value is calculated according to the following steps:[24] 1. initialize the parameters of the ga and maximum number of iteration 2. generate random keys which are the population of chromosomes as the 0th generation; each key is a vector with size equal to ciphertext size 3. decrypt the ciphertext by all generated keys 4. calculate the fitness function for each chromosome by adding the square value of repeated three letters and four letters which are available in built dictionary. the calculation of fitness function deals with the probability of existing of the three and four letter words in normal english 5. sort the keys based on decreased fitness values 6. apply the crossover operator to the parent keys and produce a new generation. here, a simple two-point crossover can be perfor med. further more, apply mutation operation by generating two random positions and replace the two letters in these positions by others letters randomly 7. the best key is used to decrypt ciphertext to get the best-decrypted text. d. cryptanalysis of vigenere cipher to attack vigenere cipher using gas, we should determine the number of attributes that the ga takes as parameters or inputs such as population size, number of individuals tenured per generation, number of random immigrants per generation, number of generations, key length, maximum key length, ciphertext length, known text length, and number of runs per mutation operator combination. these parameters may be used together or some of them might be ignored. the key length parameter is very important, so it must be firstly identified [25]. e. cryptanalysis of transposition ciphers gas are very useful to break classical transposition ciphers by finding the sequence of characters that the transposition cipher used. this particular class of algorithms can be used because the automated breaking of such ciphers is very difficult. in spite of that, a number of statistical tools aiding automated breaking have been developed for substitution ciphers, cryptanalysis of transpositions is usually considered to be highly interventionist and demands some knowledge of the likely contents of the ciphertext to give an insight into the order of rearrangement used. thus, genetic cryptanalyst enables a known plaintext attack to be successfully made, based on only small portion of some plaintext/ciphertext [26]. 5. identification of classical cipher type the typical sequence of steps needs to be followed by cryptanalyst to break any cryptosystems is:[27] 1. the cryptanalyst should determine if the text encrypted by any cipher or it is compressed or generated randomly 2. the cryptanalyst should determine the language of the text 3. the cryptanalyst should determine the type of cipher used in encryption process 4. the cryptanalyst should determine the key used in encryption process 5. the cryptanalyst then uses the key with encrypted data to extract the original data. when the cryptanalyst wants to identify the cipher type (having just a ciphertext), he/she should extract some features that can lead to estimating the type of cipher. the list below shows a group of features that may help the cryptanalyst in the estimation process: 1. frequency analysis: every language has frequency characteristics for its characters such that each character has repeating ratio recognizing it from other characters in normal texts. in english, for example, the letter “e” has the greatest frequency ratio (12.70), but the letter “x” has the lowest (0.15) [8]. frequency analysis can be done based on single letter frequency and/or multiple letter frequency (double, triple, etc.). fig. 5 depicts the typical frequency distribution of single letters in normal english text. frequency analysis is very useful in differentiating between transposition ciphers and sufyan t. al-janabi et al.: intelligent techniques in cryptanalysis: review and future directions uhd journal of science and technology | april 2017 | vol 1 | issue 1 7 substitution ciphers. frequency analysis can be used in three main directions:[28] • the first one is to compute the frequency of ciphertext letters and compare it with the frequency of the original data such that compare the frequency of the letter in ciphertext and natural text and compute the changing in two texts • the second direction is to compute the frequency of ciphertext letters and find which letters in normal text have the same repeating ratio such that if the letter “j” in ciphertext has the same repeating ratio of the letter “a” in the original text, we can say the letter “a” is encrypted by the letter “j.” • third one is to use frequency analysis to compute if there is any shifting occurs in ciphertext characters such that when the letter “x” gives the same ratio of letter “a,” this indicates that possibly the caesar cipher which encrypts “a” by “x” has been used 2. ciphertext length: the length of ciphertext plays an important role in identification of cipher type where some ciphertext length is exactly divisible by 2 like the playfair cipher case. other ciphers (e.g., hill cipher) can produce ciphertext length divisible by 3, etc. 3. ciphertext characters number: some ciphers employee few number of characters such the baconian cipher which uses just two letters “a” and “b” in encryption process and the playfair cipher that uses 25 letters 4. repeating sections: periodic polyalphabetic substitution ciphertext has repeating sections with a constant period. this feature can help to identify this type of ciphers [29], [30] 5. ab-ba feature: ciphertext may contain double sections with its reverse such as “xy” and “yx.” this feature appears in ciphertext produced from playfair cipher [31] 6. ciphertext characters type: some ciphers employee just letters in encryption process another cipher employee letters and numbers [9] 7. adjacent characters: it can be useful to check if there are any adjacent characters have the same value [28]. 6. future research directions this work lies within a larger team project aiming to design and implement a general cryptanalysis platform for pedagogical purposes. considering the architectural design of the proposed general cryptanalysis platform, the platform has a number of components or modules including the supervisory module, the crypto-classifier, parallel cryptanalysis modules, feedback and reporting module, graphical analyzer, and the steganography module. here, we are mainly interested in the crypto-classifier module that is responsible for the identification and classification of the ciphertext type. at least, two levels of classification need to be implemented:[32] 1. level 1 crypto-classifier: in this module, a first level classification of the considered ciphertext needs to be done so as to decide the general cryptographic category (e.g., classical cipher, block cipher, and public-key cipher) of it. information obtained from various resource need to be used, and some intelligent classification techniques (such as artificial intelligence, genetics, and neural networks) have to be developed 2. level 2 crypto-classifier: in the second level of classification, specific algorithm(s) or cipher(s) should fig. 5. frequency distribution of single letters in normal english text[28] sufyan t. al-janabi et al.: intelligent techniques in cryptanalysis: review and future directions 8 uhd journal of science and technology | april 2017 | vol 1 | issue 1 be assigned for the ciphertext in accordance with the classification done at the first level. for example, if the classifier of level 1 deduced that the ciphertext belongs to the category of block ciphers; level 2 classifier job is to decide which specific block cipher has been used (e.g., des, aes, and twofish). besides the information deduced by different means, some distinguishing characteristics for different ciphers must be known. concerning the future research, we are specifically interested in using the estimation capabilities of anns to identify the ciphers type. as mentioned previously, anns use parallel processing rather than serial computation. this behavior may enable us to move from typical statistical techniques of analyzing any cipher to more powerful generations that provide many solutions at a time. thus, the analyzing process will depend on how to model ann fig. 6. data flow of the proposed artificial neural network-based cipher identification process sufyan t. al-janabi et al.: intelligent techniques in cryptanalysis: review and future directions uhd journal of science and technology | april 2017 | vol 1 | issue 1 9 in the correct way and manage the training processes rather than spend the time in mathematical computation of the cipher. fig. 6 shows the data flow of the proposed estimation process. the ann box will have two types of inputs; the first one is a group of training data and the second is a group of testing data. these two groups are managed by anns to correct errors produced from estimation process. anns would use supervised learning to estimate the cipher type. the number of neurons in the input, hidden, and output layers depend on the number of ciphers used and how much the analyst can extract features from ciphertext. several previous works on using anns and other techniques for cipher type classification can be found [33]-[37]. however, to the best of authors’ knowledge, we could not see specific previous work on using spiking anns for this task. hence, our focus will be directed to this specific application of spiking anns. in the first stage, classification of classical ciphers will be considered. in the next stages, other modern cipher types will be taken into consideration also. 7. conclusion this work is mainly concerned in building automatic tools for various cryptanalysis tasks. this definitely requires the use of suitable intelligent techniques such as gas and anns. the focus here has been on using gas for cryptanalysis of classical ciphers and adoption of anns for cipher type identification. more specific results of cipher classification based on spiking anns are going to be presented in a subsequent paper. references [1] b. carter, and t. magoc. “introduction to classical ciphers and cryptanalysis.” a technical report, 11 sep. 2007. available: http://www. citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.125.8165. [feb. 5, 2017]. [2] r. j. anderson. security engineering: a guide to building dependable distributed systems, usa: john wiley & sons, 2010. [3] w. stallings. cryptography and network security principles and practice, 6th ed, upper saddle: pearson education, inc., 2014. [4] m. j. banks. “a search-based tool for the automated cryptanalysis of classical cipher.” meng. thesis, department of computer science, the university of york, 2008. [5] s. ibrahim, and m. a. maarof. “a review on biological inspired computation in cryptology.” journal teknologi maklumat, vol. 17, no. 1, pp. 90-98, 2005. [6] s. r. baragada, and s. reddy. “a survey of cryptanalytic works based on genetic algorithms.” international journal of emerging trends and technology in computer science (ijettcs), vol. 2, no. 5, pp. 18-22, sep. oct. 2013. [7] h. bhasin, and a. h. khan. “cryptanalysis using soft computing techniques.” journal of computer sciences and applications, vol. 3, no. 2, pp. 52-55, 2015. [8] department of the army. basic cryptanalysis: field manual no. 34-40-2, headquarters, washington, dc: department of the army, 1990. [9] f. a. stahl. “a homophonic cipher for computational cryptography.” afips ‘73 proceedings of the national computer conference and exposition, new york, pp. 565-568, 4-8 jun. 1973. [10] a. k. kendhe, and h. agrawal. “a survey report on various cryptanalysis techniques.” international journal of soft computing and engineering (ijsce), vol. 3, no. 2, may. 2013. [11] s. haykin. neural networks and learning machines, 3rd ed, upper saddle river, new jersey: pearson education, inc., 2009. [12] k. suzuki. artificial neural networks-methodological advances and biomedical applications, rijeka, croatia: intech, 2014. [13] s. davies. “learning in spiking neural networks.” ph.d. thesis, school of computer science, university of manchester, uk, 2012. [14] s. m. bohte, h. la poutré, and j. n. kok. “unsupervised clustering with spiking neurons by sparse temporal coding and multilayer rbf networks.” ieee transactions on neural networks, vol. 13, no. 2, pp. 426-435, mar. 2002. [15] m. fatahi, m. ahmadi, m. shahsavari, a. ahmadi, and p. devienne. “evt_mnist: a spike based version of traditional mnist.” the 1st international conference on new research achievements in electrical and computer engineering, 2016. [16] a. tavanaei, and a. s. maida. “a minimal spiking neural network to rapidly train and classify handwritten digits in binary and 10-digit tasks.” (ijarai) international journal of advanced research in artificial intelligence, vol. 4, no.7, pp. 1-8, 2015. [17] r. rojas. neural networks, berlin: springer-verlag, 1996. [18] s. m. bohtea, j. n. koka, and h. la poutre. “error-backpropagation in temporally encoded networks of spiking neurons.” neurocomputing, vol. 48, no. 1, pp. 17-37, 2002. [19] d. goldberg. genetic algorithms, new delhi: pearson education, 2006. [20] k. p. bergmann, r. scheidler, and c. jacob. “cryptanalysis using genetic algorithms.” genetic and evolutionary computation conference gecco’08, acm, atlanta, georgia, usa, pp. 10991100, 12-16 jul. 2008. [21] d. b. fogel. evolutionary computation: toward a new philosophy of machine intelligence, 3rd ed, new york: john wily & sons, inc., publication, 2006. [22] s. s. omran, a. s. al-khalid, and d. m. al-saady. “using genetic algorithm to break a mono-alphabetic substitution cipher.” ieee conference on open systems, malaysia, pp. 63-68, 5-7 dec. 2010. [23] b. rhew. “cryptanalyzing the playfair cipher using evolutionary algorithms.” 9 dec. 2003. available: http://www.citeseerx.ist.psu. edu/viewdoc/summary?doi=10.1.1.129.4325. [jul. 15, 2016]. [24] f. t. lin, and c. y. kao. “a genetic algorithm for ciphertext-only attack in cryptanalysis.” ieee international conference on systems, man and cybernetics, vol. 1, pp. 650-654, 1995. [25] k. p. bergmann. “cryptanalysis using nature-inspired optimization algorithms.” m.sc. thesis, department of computer science, the university of calgary, alberta, 2007. [26] r. a. muhajjar. “use of genetic algorithm in the cryptanalysis of sufyan t. al-janabi et al.: intelligent techniques in cryptanalysis: review and future directions 10 uhd journal of science and technology | april 2017 | vol 1 | issue 1 transposition ciphers.” basrah journal of scienec a, vol. 28, no.1, pp. 49-57, 2010. [27] k. n. haizel. “development of an automated cryptanalysis emulator (ace) for classical cryptogram.” m.sc. thesis, faculty of computer science, university of new brunswick, new brunswick, 1996. [28] p. maheshwari. “classification of ciphers.” master of technology thesis, department of computer science and engineering, indian institute of technology, kanpur, 2001. [29] m. nuhn, and k. knight. “cipher type detection.” information sciences institute, university of southern california, emnlp, 2014. available: https://www.semanticscholar.org/paper/ciphertype-detection-nuhn-knight/81e5e15afba9301558a7aaca1400b 69e0ddaa027#paperdetail. [jun. 10, 2016]. [30] k. pommerening. “polyalphabetic substitutions.” fachbereich physik, mathematik, informatik der johannes-gutenberg-universit at saarstraße, mainz, 25 aug. 2014. available: http://www.staff. uni-mainz.de/pommeren/cryptology/classic/2_polyalph/polyalph. pdf. [jul. 5, 2016]. [31] g. sivagurunathan, v. rajendran, and t. purusothaman. “classification of substitution ciphers using neural networks.” ijcsns international journal of computer science and network security, vol. 10, no. 3, pp. 274-279. mar. 2010. [32] s. al-janabi, and w. a. hussien. “architectural design of general cryptanalysis platform for pedagogical purposes, i-manager’s.” journal on software engineering, vol. 11, no. 1, pp. 1-12, jul. sep. 2016. [33] a. d. dileep, and c. c. sekhar. “identification of block ciphers using support vector machines.” international joint conference on neural networks, vancouver, bc, canada, pp. 2696-2701, 1621 jul. 2006. [34] j.g. dunham, m. t. sun, and j. c. r. tseng. “classifying file type of stream ciphers in depth using neural networks.” the 3rd acs/ieee international conference on computer systems and applications, pp. 97, 2005. [35] s. o. sharif, l. i. kuncheva, and s. p. mansoor. “classifying encryption algorithms using pattern recognition techniques.” ieee international conference on information theory and information security (icitis), pp. 1196-1172, 17-19 dec. 2010. [36] c. tan, and q. ji. “an approach to identifying cryptographic algorithm from ciphertext.” 8th ieee international conference on communication software and networks, pp. 19-23, 2016. [37] w. a. r. de souza, and a. tomlinson. “a distinguishing attack with a neural network.” ieee 13th international conference on data mining workshops, pp. 154-161, 2013. tx_1~abs:at/tx_2:abs~at uhd journal of science and technology | july 2022 | vol 6 | issue 2 105 1. introduction the internet expands at an unprecedented rate. most of the time, malicious software is spread via the internet. malicious websites can be referred to as any website that has been designed to cause harm. it is similar to a legitimate url for regular users but hosts unsolicited content. the attacker usually builds a website identical to the target or embeds the exploit code of browser vulnerabilities on the webpage. then, it tricks the victim into clicking on these links to obtain the victim’s information or control the victim’s computer [1]. in many circumstances, people do not check the complete website url, and the attacker can obtain essential and personal information once they visit a malicious website [2]. malicious url detection always comes at the top in the research area. however, having protection against these attacks is not an option anymore. according to google’s transparency report, 2.195 million websites made their list of “sites deemed dangerous by safe browsing” category as of january 17, 2021. the vast majority of those (over 2.1 million) were phishing sites. only 27,000 of google’s removed websites were delisted because of malware [3]. several forms of a malicious url proceed with the attack and deliver unsolicited content, mainly named spam, phishing, and drive-by download. spam is a web page with many links to unwanted websites for other purposes; the malicious url detection using decision tree-based lexical features selection and multilayer perceptron model warmn faiq ahmed, noor ghazi m. jameel technical college of informatics, sulaimani polytechnic university, sulaimani 46001, kurdistan region, iraq a b s t r a c t network information security risks multiply and become more dangerous. hackers today generally target end-to-end technology and take advantage of human weaknesses. furthermore, hackers take advantage of technology weaknesses by applying various methods to attack. nowadays, one of the greatest dangers to the modern digital world is malicious urls, and stopping them is one of the biggest challenges in the field of cyber security. detecting harmful urls using machine learning and deep learning algorithms have been the subject of various academic papers. however, time and accuracy are the two biggest challenges of these tools. this paper proposes a multilayer perceptron (mlp) model that utilizes two significant aspects to make it more practical, lightweight, and fast: using only lexical features and a decision tree (dt) algorithm to select the best relevant subset of features. the effectiveness of the experimental outcomes is evaluated in terms of time, accuracy, and error reduction. the results show that a mlp model using 35 features could achieve an accuracy of 94.51% utilizing only url lexical features. furthermore, the model is improved in time after applying the dt as feature selection with a slight improvement in accuracy and loss. index terms: multilayer perceptron, lexical feature, feature selection, malicious url, synthetic minority oversampling technique corresponding author’s e-mail: warmn.faiq.a@spu.edu.iq received: 20-08-2022 accepted: 01-10-2022 published: 13-11-2022 access this article online doi: 10.21928/uhdjst.v6n2y2022.pp105-116 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2022 ahmed and jameel. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l r e s e a r c h a r t i c l e uhd journal of science and technology warmn faiq and noor ghazi: malicious url detection using dt-based lexical features selection and mlp model 106 uhd journal of science and technology | july 2022 | vol 6 | issue 2 pages may pretend to provide assistance or facts about a subject. phishing is a type of social engineering attack used to steal sensitive data. finally, drive-by downloads refer to the unintentional download of malicious code to the device, leaving it open to a cyber-attack [4]. there are currently several approaches to detect dangerous websites on the internet. nowadays, a malicious url is mainly detected by black and white list-based and machine learning-based url detection methods. according to the first technique, a website cannot be viewed until the url is checked against the blacklist database to ensure it is not on the list. blacklist is essentially a listing of urls that were previously identified as malicious. its advantage is that it is fast, easy, and has a meager false-positive (fp) rate. however, the main problem with this method is that it has a high false-negative (fn) rate and fails to detect newly generated urls [1], [5], [6]. nevertheless, it has been widely utilized in several major browsers, including mozilla firefox, safari, and chrome, among others, due to its simplicity and efficiency [5]. in addition, the blacklisting approach is also utilized by many antivirus systems and internet businesses. however, due to some limitations, the blacklisting strategy is insufficient to identify non-blacklisted threats [7]. whitelist is another aspect that provides security when accessing a website. it is similar to the blacklist method technique. the difference is that in the whitelist, only those websites are allowed to access that is in the list. the limitation of this method is denying access to many newly generated websites that are legal and safe to visit [5]. on the other hand, machine learning techniques use a collection of urls specified as a set of attributes and train a prediction model based on them to categorize a url as good or bad, enabling them to recognize new, possibly harmful urls [1]. in this paper, the multilayer perceptron (mlp) model is used to detect malicious urls based on the features of the urls. since a lightweight method is challenging for time efficiency, lexical features are utilized and extracted from the dataset to train the model. the model is tested first without and then with feature selection (fs) to see the result and the differences. the main contribution of this paper is the development of a malicious url detection system that utilizes only lexical features to construct a light model and selects only high-ranked features to reduce feature extraction (fe) time. moreover, using decision tree (dt) as a fs algorithm is an advantage to select the best relevant features based on features importance score to improve the model performance and decrease the fe time during the detection process. the paper is organized as follows. section 2 is related works. the proposed malicious url detection system with its phases including dataset collection, features extraction, features selection using dt algorithm, model development, and evaluation is presented in section 3. all the experimental results and discussions are provided in section 4. finally, section 5 illustrates the conclusion of the paper. 2. related works many kinds of research in the area of detecting malicious websites with various techniques, algorithms, and methods exist. the machine learning technique is one of the approaches used to solve the problem of malicious url detection. multiple studies have been done in the era. xuan et al. proposed support vector machine (svm) and random forest (rf) as machine learning algorithms to classify benign and malicious urls by extracting features and behaviors of the urls. the researchers created an extensive set of features to improve the model’s ability and use it as a free tool to detect malicious urls [8]. subha et al. tested various machine learning algorithms to detect malicious urls. according to the results, rf scored better than all svm, naïve base, and artificial neural network (ann) with an accuracy of 97.98 and the f1 score of 92.88 [9]. furthermore, islam et al. used three machine learning algorithms to detect malicious urls: nn, k-nearest neighbor (knn), dt, and rf. the results showed that the neural network (nn) scored the worst, whereas dt and rf achieved the best scores. the study mentioned that the lack of ability to detect malicious urls by nn is due to the small size of the dataset, while nn is suitable for large datasets [10]. besides, some of the researches used nns as a solution for classifying malicious urls from benign ones. liu and lee proposed a detection method using a convolutional neural network (cnn). the research adopted the end user’s perspective and used cnn to learn and recognize screenshot images of the websites. the results showed that although the training period is lengthy, it is tolerable, especially with powerful graphics processing units. the testing is efficient once the training is completed; therefore, time is often not an issue with this procedure [11]. balamurugan et al. proposed a nn to classify the websites as good and bad urls with optimizing network parameters using genetic algorithms. the article showed a good improvement when optimizers were applied to the nn model in both classification and convergence [12]. furthermore, chen et al. used cnn for malicious url detection. the study showed that the warmn faiq and noor ghazi: malicious url detection using dt-based lexical features selection and mlp model uhd journal of science and technology | july 2022 | vol 6 | issue 2 107 proposed method achieved satisfying detection accuracy with an accuracy of 81.18% [13]. moreover, hybrid systems are also proposed by some recent studies as a solution to the problem. naresh et al. proposed a machine learning-based system that combines a svm with logistic regression using a combination of url lexical options, payload size, and python supply options as features to recognize the malicious urls. as a result, an accuracy of 98% was achieved, which is an improvement compared to a conventional method. according to some recent articles, using nns as a hybrid system can achieve satisfying performance [14]. yang et al. proposed a system to detect malicious websites based on integrated cnns and rf system. the results showed that the proposed integrated system achieved better results than traditional machine learning algorithms due to their shallow design, which cannot examine the complicated link between safe and malicious urls [2]. another research is by das et al. who tested three nn algorithms, rnn, lstm, and cnn-lstm, to see the effectiveness of these algorithms in classifying benign and malicious urls. the results showed that with an accuracy of 93.59%, the cnn-lstm architecture exceeds the other two [15]. furthermore, peng et al. proposed attention-based cnn-lstm for malicious url detection. the results showed that the proposed method achieved better than shallow nns and single deep nns such as cnn and lstm individuals with an accuracy of 96.74 [16]. 3. the proposed malicious url detection system the proposed system is constructed using a lightweight method. only lexical features are utilized to build the model. python is used for programming the phases of the proposed system with famously fast and reliable libraries such as pandas, numpy, scikit-learn, imblearn, pyplot, tensorflow, and keras. the architecture of the proposed system starts with loading the dataset and then preprocessing stages to prepare the data for training. the training stage starts after the data are prepared. then the testing stage; the trained model classifies whether the url is malicious or benign. finally, evaluation metrics are applied to compute the performance of the model. the system architecture is shown in fig. 1. 3.1. dataset collection in this work, a proposed model was trained and tested on a dataset conducted from malicious and benign websites that were utilized to create the suggested model and evaluate its predictions [17]. the dataset initially consisted of 420,464 urls, 344,821 benign (good), and the rest of 75,643 websites are malicious (bad), as shown in table 1. therefore, the number of urls in each class is imbalance, as shown in fig. 2. a sample of the instances is shown in fig. 3. 3.2. data preprocessing 3.2.1. data cleaning one of the most critical preprocessing stages in machine learning is data cleaning. having clean, accurate noiseless data give precise models and results. starting with cleaning the data, 9216 duplicated urls were found and removed. the dataset was then checked for missing values, and there were no missing values in the dataset. 3.2.2. url lexical feature extraction several characteristics separate a safe url and its webpage from a malicious url. in certain instances, attackers employ direct ip linkages rather than domain names. another tactic use by attackers is short names or abbreviations for websites unrelated to legitimate brand names. algorithms for the detection method involve a wide variety of characteristics. to detect malicious websites using machine learning techniques, several distinct characteristics were retrieved from various academic research, such as lexical, host-based, and content-based features. since lexical features are fast to extract, they are also more applicable due to facing some casual problems when using content-based and host-based features. most of the time, content-based features cannot be extracted from malicious urls since most are blacklisted and cannot be accessed to get the contents such as html, javascript, and visual features. besides, the security risks when accessing such websites need precautions such as using special sandbox services to reduce the risk. host-based fe also faces problems such as a very long time taking due to the vast number of online requests from the database servers such as whois that sometimes lead to another problem: closing sockets for some of the websites and not getting the required information. in this study, lexical features are utilized to recognize malicious websites and distinguish them from legitimate ones. these characteristics are derived from the url address’s elements like a string. it should be able to identify malicious urls because it bases its decision on how the url appears. by replicating the names and making minor modifications, many attackers may make dangerous urls seem normal. however, from the perspective of machine learning, it is not feasible to take the actual name of the url. instead, the url’s string must be handled to obtain valuable properties. sixty lexical warmn faiq and noor ghazi: malicious url detection using dt-based lexical features selection and mlp model 108 uhd journal of science and technology | july 2022 | vol 6 | issue 2 features were collected from literature, then extracted from the web links as listed in table 2. 3.2.4. feature scaling feature scaling or normalization is often advised and sometimes crucial. normalization is vital for nns since unnormalized inputs to activation functions might cause trapping in a relatively flat domain region. feature scaling helps optimize nn algorithms by accelerating training and preventing optimization from being trapped in local optima. models of nns establish a mapping between input and output variables. as a result, each variable’s size and distribution of the data extracted from the domain may change. input variables can have distinct scales because of fig. 2. dataset class distribution. data cleaning url lexical feature extraction feature selection using decision tree algorithm data collection data sampling feature scaling data preprocessing mlp model development (training phase) training data testing data trained model (classifier) malicious benign fig. 1. the proposed system architecture. table 1: dataset description type no. of urls benign 344,821 malicious 75,643 total urls 420,464 warmn faiq and noor ghazi: malicious url detection using dt-based lexical features selection and mlp model uhd journal of science and technology | july 2022 | vol 6 | issue 2 109 fig. 3. sample of the dataset instances. table 2: list of url lexical features feature no. feature names data type description references f0 count dots integer number of character “.” in url [7], [8], [18]-[21] f1 url depth integer the depth of the url [8] f2 url length integer the length of the url [7], [8], [14], [16], [18]-[20], [22]-[26] f3 hyphen integer number of the dash character “-” (hyphen) [8], [20], [22], [23] f4 at symbol boolean there exists a character “@” in url [8], [22], [23], [27] f5 tide symbol boolean there exists a character “~” in url [8] f6 numunderscore integer number of the underscore character [8], [22] f7 numpercent integer number of the character “%” [8], [20] f8 numampersand integer number of the character “&” [8], [20], [22] f9 numhash integer number of the character “#” [8], [22] f10 countquestionmark integer count the number of “?” in url [20] f11 countsemicolon integer count the number of “;” in url [22] f12 httpsinurl boolean check if there exists a https in website url [8], [19], [22], [28] f13 ipaddress boolean check if the ip address is used in the hostname of the website url [7], [8], [16], [22], [23], [25] f14 urlredirection boolean there exists a slash “//” in the link path [8], [19], [22], [23], [27] f15 count alpha integer number of the alphabetic character [20], [22] f16 alpha ratio floating point the proportion of alphabetic characters in the url to the total length of the url [22] f17 count digit integer number of the numeric character [8], [20], [22], [29] f18 digit ratio floating point the proportion of numeric characters in the url to the total length of the url [22] f19 count special chars integer number of any special characters like”,' %”,”$”,”,’ =”, etc. [4], [7], [8], [14], [16], [18], [19], [22], [24]-[26] f20 special chars ratio floating point the proportion of special characters in the url to the total length of the url [16], [22] f21 count lowercase integer the number of lowercase english letters in the url [16], [22] f22 lowercase ratio floating point the proportion of lowercase english letters in the url to the total length of the url [16], [22] f23 count uppercase integer the number of uppercase english letters in the url [16], [22] f24 uppercase ratio floating point the proportion of uppercase english letters in the url to the total length of the url [16], [22] f25 count_subdomain integer number of subdomains in the url [8], [18] f26 short url boolean using tiny url/short url service [14], [23], [25] f27 length_of_ hostname integer length of hostname [8], [19] (contd...) warmn faiq and noor ghazi: malicious url detection using dt-based lexical features selection and mlp model 110 uhd journal of science and technology | july 2022 | vol 6 | issue 2 table 2: (continued) feature no. feature names data type description references f28 length_of_path integer length of the link path [8], [19], [20] f29 length_of_query integer length of the query [8], [20] f30 length_of_scheme integer length of the url scheme [20] f31 presence_sus_ file_ext boolean checking the url string for the presence of the following file extensions·exe,·scr,·vbs,·js,.xml, .docm,·xps, .iso, .img, doc, .rtf,·xls, pdf, .pub, .arj, .lzh, .r01, .r14, .r18, .r25, .tar, .ace, .zip, .jar, .bat, .cmd, .moz, .vb, .vbs, .js, .wsc, .wsh, .ps1, .ps1×ml, .ps2, .ps2×ml, .psc1 and .psc2. [25] f32 count_ar_num integer the number of arabic numerals in the url [16] f33 is_tld_in_top5 boolean whether the top-level domain is the top five domains (com, cn, net, org, cc) [16] f34 paypal_in_path boolean if “paypal” is contained in the path section. [30] f35 ali_in_path boolean if “ali” is contained in the path section. [30] f36 jd_in_path boolean if “jd” is contained in the path section. [30] f37 safety_in_path boolean if “safety” is contained in the path section. [30] f38 verify_in_path boolean if “verify” is contained in the path section. [30] f39 google_in_path boolean if “google” is contained in the path section. [30] f40 apple_in_path boolean if “apple” is contained in the path section. if_facebook_u [30] f41 facebook_in_path boolean if “facebook” is contained in the path section. [30] f42 amazon_in_path boolean if “amazon” is contained in the path section. [30] f43 porn_in_path boolean if “porn”-related words are contained in the path section. [30] f44 gamble_in_path boolean if “gamble” related words are contained in the path section. [30] f45 paypal_in_domain boolean if “paypal” is contained in the domain section. [30] f46 ali_in_domain boolean if “ali” is contained in the domain section. [30] f47 jd_in_domain boolean if “jd” is contained in the domain section. [30] f48 safety_in_domain boolean if “safety” is contained in the domain section. [30] f49 verify_in_domain boolean if “verify” is contained in the domain section. [30] f50 google_in_domain boolean if “google” is contained in the domain section. [30] f51 apple_in_domain boolean if “apple” is contained in the domain section. [30] f52 facebook_in_ domain boolean if “facebook” is contained in the domain section. [30] f53 amazon_in_domain boolean if “amazon” is contained in the domain section. [30] f54 porn_in_domain boolean if “porn” related words are contained in the domain section. [30] f55 gamble_in_domain boolean if “gamble” related words are contained in the domain section. [30] f56 has keyword “client” boolean if the word “client” is contained in the url [31] f57 has keyword “admin” boolean if the word “admin” is contained in the url [31] f58 has keyword “server” boolean if the word “server” is contained in the url [31] f59 has keyword “login” boolean if the word “login” is contained in the url [31] their varied. the difficulty of the problem being modeled could be exacerbated by differences in the scales across the input variables. a model may learn tremendous weight values due to large input values, such as a spread of thousands of units, makes the result to be biased toward the bigger units. when features are of comparable size and nearly normally distributed, several machine learning methods work better or converge more quickly. min-max algorithm is used to scale all the features between 0 and 1. equation (1) uses for minmax feature scaling which helps the model to understand and learn better and faster without biasing to the more significant values [20]. x x x x xscaled min max min � � � � (1) where, xmax and xmin are the maximum and the minimum values of the feature (x), respectively. 3.2.5. data sampling initial examination of the dataset revealed that there were 5.18 times fewer occurrences of harmful websites than benign ones. therefore, due to the stark disparity in the number of malicious and benign website instances, the model affect to be biased due to this significant class imbalance warmn faiq and noor ghazi: malicious url detection using dt-based lexical features selection and mlp model uhd journal of science and technology | july 2022 | vol 6 | issue 2 111 as it learns from a far higher percentage of benign website occurrences. a balanced class dataset is necessary for classification issues. as most machine learning algorithms used for classification were developed based on the presumption that there are an equal number of instances of each class, the imbalance of types in classification presents problems for predictive modeling. therefore, a balanced classification dataset is also necessary for a classification model to produce accurate judgments. there are several ways to handle an imbalanced dataset. the synthetic minority oversampling technique (smote) was utilized to address this issue. the smote technique uses knn machine learning algorithm to produce new instances. using it, additional instances of the minority class have been created, matching the proportion of instances of each class to the majority class to balance the classes. to balance the dataset, the minority class must thus be oversampled unless both groups have almost an equal number of cases. after balancing, the minority class were oversampled, which caused the data size to grow. finally, the 344,800 occurrences of each class result in a balanced distribution, as shown in fig. 4. 3.2.6. feature selection using dt algorithm the quality of fs and importance is one of the crucial differentiators in every machine learning task. due to computational limitations and the need to remove noisy variables for more accurate prediction, fs becomes necessary when there is a large amount of data that the model may process. in this study, a dt algorithm is used to select the best and most relevant lexical features based on the feature importance score. dts apply various techniques to decide whether to divide a node into two or more sub-nodes. the homogeneity of newly formed sub-nodes is increased by sub-node formation. the threshold value of an attribute is used to divide the nodes in the dt into sub-nodes. the classification and regression tree algorithm uses the gini index criteria to find the sub-nodes with the best homogeneity. the dt divides the nodes based on all factors that are accessible before choosing the split that produces the most homogenous sub-nodes. at the same time, the target variables are considered while selecting an algorithm. it is a visual depiction of every option for making a choice based on specific criteria according to the algorithm. conditions on any characteristics are used to make judgments in both situations. the leaf nodes reflect the selection based on the conditions, whereas the inside nodes represent the conditions. finding the attribute that provides the most information is necessary for dt construction. by building the tree in this way, feature importance scores can be accessed and used to help interpret the data, ranking, and select features that are most useful to a predictive model. it aids in determining which variable is chosen to be used in producing the decisive internal node at a specific point. the steps of fs using a dt are described in an (algorithm 1). at this phase, the list of features with their importance values is calculated and selected by the dt algorithm. algorithm 1. classification and regression tree [32]. 3.3. mlp model the most practical variety of nns is mlp which is frequently used to refer to the area of anns. a perceptron is a singleneuron model that serves as the basis for more extensive nns. artificial neurons are the basic units of nns. the feed-fig. 4. dataset after data sampling using smote. warmn faiq and noor ghazi: malicious url detection using dt-based lexical features selection and mlp model 112 uhd journal of science and technology | july 2022 | vol 6 | issue 2 forward nn is supplemented by the mlp. there are three layers: the input layer, the output layer, and the hidden layer. the proposed mlp model consists of three hidden layers besides the input and output layers to describe the model. the first hidden layer has 400 neurons, the second hidden layer has 300 neurons, and the last hidden layer has 200 neurons. the output layer has one neuron as it is a binary classification with two outputs, 1 and 0, whereas 1 represents a malicious url and 0 represents a benign one. the other parameters are set as a batch size of 200, a learning rate of 0.005, a sigmoid function as an activation function, and adam as an optimizer, as shown in table 3. 3.4. model evaluation the goal is not just to create a predictive model. it involves building and choosing a model that performs well on out-ofsample data. therefore, verifying the model’s correctness is essential before computing estimated values. to assess the models, many indicators are considered. a crucial phase in the machine learning pipeline is evaluating the learned model’s effectiveness. machine learning models are either adaptable or non-adaptive based on how effectively they generalize to new input. when an ml model is applied to new data without being adequately evaluated using a variety of metrics and without relying on accuracy, it may produce inaccurate predictions. besides, the accuracy, precision, recall, and f1 score have been taken into account for the model reliability and considering the aspect of the errors when the model classifies between malicious and benign urls. the definition of classification accuracy, which may be the most straightforward criterion to use and apply, is the ratio of correct predictions to all other predictions and calculated using equation (2) [33]. accuracy number of correct predictions total number of pred = � � � � � � iictionsmade� (2) confusion matrix produces a matrix that summarizes the overall effectiveness of the model. for example, the confusion matrix for binary classification, which is the case in this work, is a two-by-two matrix. the confusion matrix shows the number of correct and incorrect classification for both actual and predicted values, including true positive indicates the number of samples that are correctly classified as positive and true negative shows the number of instances that are correctly identified as negative, besides, there is fp that indicates the number of samples that are incorrectly identified as positive, and finally, fn that indicates the number of instances that are incorrectly identified as negative. the confusion matrix for binary classification is shown in table 4. from the confusion matrix, some important metrics are calculated and taken into consideration along with the accuracy to ensure that the model performs well and is not biased because of issues such as dataset imbalance. therefore, precision, recall, and f1 score are used as model evaluation metrics. precision indicates how accurate the positive predictions are, recall is the coverage of actual positive samples, and the f1 score is the harmonic mean of precision and recall, and they are calculated using equations (3), (4), and (5), respectively [22], [29], [34]. precision truepositives truepositives falsepositives � � � � � (3) recall truepositives truepositives falsenegatives � � � � � (4) f score precision recall precision recall 1 2� � � � � � � � � (5) table 3: the parameters of the proposed mlp model layer no. no. of neurons/dim optimizer activation function learning rate batch size no. of epochs layer 1 400 adam sigmoid 0.005 200 1500 layer 2 300 sigmoid layer 3 200 sigmoid table 4: confusion matrix actual values predicted values negative positive negative tn fp positive fn tp tp: true positive, tn: true negative, fp: false positive, fn: false negative table 5: list of used hardware and software specifications hardware and software specification description pc core i3 gen6 ram 20 gb storage ssd sata 256 gb operation system windows 10 pro warmn faiq and noor ghazi: malicious url detection using dt-based lexical features selection and mlp model uhd journal of science and technology | july 2022 | vol 6 | issue 2 113 4. experimental results and discussion in this section, the details of the experimental results are presented. the experiments are implemented on a malicious url dataset [19] aiming to find the set of relevant url lexical features based on their importance score using dt algorithm and evaluating the mlp model performance using the selected features. the final prepared dataset after the main steps of data preprocessing which includes data cleaning, data sampling, and fe, consists of a total of 689,600 urls with 60 lexical features and a class label that has a 0 for benign and 1 for malicious. the software and hardware specifications used for the experiments are explained in table 5. after running the dt algorithm for fs, the importance score or weight for each variable was calculated. features with lowest importance scores were deleted and features with highest scores were kept. this type of fs can simplify the problem that is being modeled, speed up the modeling process, and improve the performance of the model. the list of all lexical features’ importance scores is illustrated in table 6. after this phase, 35 features were selected and 25 features were eliminated. the selected features are the top 35 features with highest importance values which are f0, f1, f2, f3, f4, f5, f6, f7, f8, f10, f11, f15, f16, f17, f18, f19, f20, f21, f22, f23, f24, f25, f26, f27, f28, f29, f31, f33, f34, f35, f39, f41, f57, f58, and f59. as a result of eliminating 25 features, a significant decrease in fe time achieved, which is an essential factor in this problem situation, as shown in table 7 and fig. 5. for mlp model evaluation, the 35 selected features were fed to the model as input. the stratified technique was used for splitting the dataset into train and test sets to preserve the same proportions of instances in each class as in the original dataset. it is obvious that most of the data in the dataset are advised to be used for training to let the model learn well. different ratios for training and testing have been used by the researchers such as 80% for training and the other 20% for testing or 70% for training by 30% for testing. many factors are taken into consideration when train test split is done, such as the number of instances in the dataset, hyperparameters table 6: list of features with their importance score feature no. feature importance feature no. feature importance feature no. feature importance feature no. feature importance f0 0.11828 f16 0.05732 f32 0 f48 0 f1 0.07532 f17 0.04211 f33 0.07169 f49 0 f2 0.03691 f18 0.04414 f34 0.00132 f50 0 f3 0.01727 f19 0.01206 f35 0.00158 f51 0 f4 0.00161 f20 0.13231 f36 0.00022 f52 0 f5 0.00185 f21 0.02187 f37 0.00009 f53 0 f6 0.01472 f22 0.02058 f38 0.00041 f54 0 f7 0.00227 f23 0.00755 f39 0.00241 f55 0 f8 0.0018 f24 0.01264 f40 0.00031 f56 0.00053 f9 0.00009 f25 0.02609 f41 0.00228 f57 0.01168 f10 0.00874 f26 0.01038 f42 0.00009 f58 0.00089 f11 0.00997 f27 0.1204 f43 0.00017 f59 0.02466 f12 0.00017 f28 0.05412 f44 0 f13 0.00007 f29 0.00539 f45 0 f14 0.00058 f30 0.00068 f46 0 f15 0.01788 f31 0.0065 f47 0 fig. 5. fe time differences before and after fs. table 7: feature extraction time before and after feature selection no. of features feature extraction time in seconds 60 features, the whole dataset (before fs) 134 s 35 features, whole dataset (after fs) 92 s warmn faiq and noor ghazi: malicious url detection using dt-based lexical features selection and mlp model 114 uhd journal of science and technology | july 2022 | vol 6 | issue 2 has been tested using a learning rate of 0.005, batch size of 200, and different number of epochs and neurons. the list of scenarios is described in table 8. after executing all the 10 scenarios described in table 8, from the results shown in table 9, it is obvious that with increasing the number of epochs, the accuracy will increase along with training time, and the training loss will decrease eventually. in this system, the more important parameters for detecting malicious urls are higher values for test accuracy, precision, and recall with lower training loss. the least important parameter is the training time. training phase is a one-time process, sometimes it requires a long time to develop a welltrained model with high accuracy and less training loss. since the last scenario, 1500 epochs outperformed the best scores for the mentioned parameters, it has been chosen to train the model and used for malicious url detection. as a result, table 8: list of tested scenarios scenario no. of epochs no. of features batch size learning rate no. of neurons in hidden layers s1 100 35 200 0.005 200, 120, 80 s2 100 35 200 0.005 400, 200, 100 s3 100 35 200 0.005 400, 300, 200 s4 100 35 200 0.005 600, 400, 200 s5 100 35 200 0.005 800, 600, 400 s6 500 35 200 0.005 400, 300, 200 s7 500 35 200 0.005 600, 400, 200 s8 500 35 200 0.005 800, 600, 400 s9 1000 35 200 0.005 400, 300, 200 s10 1500 35 200 0.005 400, 300, 200 table 9: results of all the 10 scenarios scenarios train time in seconds test time in seconds train loss train accuracy (%) test accuracy (%) precision recall fscore confusion matrix s1 933.4 15.0 0.145 93.90 92.82 0.923 0.935 0.929 ([95321 8119] [6735 96705]) s2 2258.5 28.7 0.142 94.00 92.95 0.919 0.943 0.930 ([94797 8643] [5938 97502]) s3 2553.3 17.8 0.123 94.79 93.45 0.927 0.944 0.935 ([95733 7707] [5840 97600]) s4 2847.4 23.3 0.122 94.86 93.51 0.927 0.944 0.936 ([95807 7633] [5798 97642]) s5 6984.2 31.8 0.125 94.74 93.51 0.935 0.936 0.935 ([96659 6781] [6636 96804]) s6 10487.2 18.3 0.091 96.21 94.18 0.937 0.948 0.942 ([96822 6618] [5415 98025]) s7 17460.9 25.3 0.098 96.00 94.08 0.939 0.943 0.941 ([97118 6322] [5918 97522]) s8 27800.3 37.7 0.095 96.09 94.15 0.937 0.946 0.942 ([96877 6563] [5546 97894]) s9 22684.7 19.7 0.086 96.49 94.25 0.938 0.947 0.943 ([97010 6430] [5460 97980]) s10 62791.6 30.3 0.075 96.93 94.51 0.941 0.950 0.945 ([97233 6207] [5146 98294]) fig. 6. train accuracy for the 10 different scenarios. to tune, the used classifier, and the model use case. due to the good amount of instances in the dataset, 70% of the final dataset considered for training, while the remaining 30% is used for testing. the model with several scenarios warmn faiq and noor ghazi: malicious url detection using dt-based lexical features selection and mlp model uhd journal of science and technology | july 2022 | vol 6 | issue 2 115 the model achieved an accuracy of 94.51, recall of 94.1, the precision of 95.0, and training loss of 0.075. the results are shown in table 9 and illustrated in figs. 6-8. 5. conclusion one of the serious threats on the internet is malicious url. hackers have several techniques and algorithms to obfuscate urls to bypass the defenses. the problem of detecting malicious urls has been studied in this research with explaining types of possible attacks, features, and detection techniques. the study developed a lightweight malicious url detection model using url lexical features only instead of content or host-based features. content and host-based features take a long time during the extraction. to extract content-based features, the websites should be available for accessing their source code. host-based features extraction process needs connection with special servers such as whois to get the required information. dt has been used to get the importance scores of all lexical features to select the best features to build a malicious url detection system with better performance and efficiency. the study shows that using only relevant lexical features, which is more practical to apply, is enough to create a robust lightweight detection model using mlp algorithm. experiment results have been shown and discussed to explain the differences before and after applying each technique. references [1] j. yuan, g. chen, s. tian and x. pei. “malicious url detection based on a parallel neural joint model,” ieee access, vol. 9, pp. 9464-9472, 2021. [2] r. yang, k. zheng, b. wu, c. wu and x. wang. “phishing website detection based on deep convolutional neural network and random forest ensemble learning,” sensors, vol. 21, no. 24, pp, 8281, 2021. [3] s. cook. “malware statistics in 2022: frequency, impact, cost and more,” 2022. available from: https://www.comparitech.com/ antivirus/malware-statistics-facts [last accessed on 2022 aug 18]. [4] s. kumi, c. lim and s. g. lee. “malicious url detection based on associative classification.” entropy, vol. 23, no. 2, pp. 1-12, 2021. [5] w. bo, z. b. fang, l. x. wei, z. f. cheng and z. x. hua. “malicious urls detection based on a novel optimization algorithm.” ieice transactions on information and systems, vol. e104.d, no. 4, pp. 513-516, 2021. [6] z. chen, y. liu, c. chen, m. lu and x. zhang. “malicious url detection based on improved multilayer recurrent convolutional neural network model.” security and communication networks, vol. 2021, pp. 9994127, 2021. [7] s. m. nair. “detecting malicious url using machine learning: a survey.” international journal for research in applied science and engineering technology, vol. 8, no. 5, pp. 2670-2677, 2020. [8] c. do xuan, h. dinh nguyen and t. victor nikolaevich. “malicious url detection based on machine learning.” international journal of advanced computer science and applications, vol. 11, pp. 148-153, 2020. [9] v. subha, m. s. pretha and r. manimegalai. “ malicious url classification using data mining techniques.” journal of analysis and computation (jac), pp. 148-153, 2018. [10] m. maminur islam, s. poudyal and k. datta gupta. “map reduce implementation for malicious websites classification.” international journal of network security and its applications, vol. 11, no. 5, pp. 27-35, 2019. [11] d. liu and j. h. lee. “cnn based malicious website detection by invalidating multiple web spams.” ieee access, vol. 8, pp. 97258-97266, 2020. [12] p. balamurugan, t. amudha, j. satheeshkumar and m. somam. “optimizing neural network parameters for effective classification of benign and malicious websites.” journal of physics conference series, vol. 1998, no. 1, 2021. [13] y. chen, y. zhou, q. dong and q. li. “a malicious url detection method based on cnn.” in: 2020 ieee conference on telecommunications, optics and computer science, tocs 2020. ieee, piscataway, 2020, pp. 23-28. [14] n. khan, r. naresh, a. gupta and s. giri. “ayon gupta and sanghamitra giri, malicious url detection system using combined svm and logistic regression model.” international journal of advanced research in science, engineering and technology, vol. 11, no. 4, pp. 63-73, 2020. [15] a. das, a. das, a. datta, s. si and s. barman. “deep approaches on malicious url classification.” in: 2020 11th international fig. 7. test accuracy for the 10 different scenarios. fig. 8. train loss for the 10 different scenarios. warmn faiq and noor ghazi: malicious url detection using dt-based lexical features selection and mlp model 116 uhd journal of science and technology | july 2022 | vol 6 | issue 2 conference on computer networks and communication technologies. icccnt 2020, ieee, piscataway, 2020. [16] y. peng, s. tian, l. yu, y. lv and r. wang. “malicious url recognition and detection using attention-based cnn-lstm.” ksii transactions on internet and information systems, vol. 13, no. 11, pp. 5580-5593, 2019. [17] adamyong. “github-adamyong-zbf/url_detection: data set.” 2020. available from: https://github.com/adamyong-zbf/url_ detection [last accessed on 2022 aug 18]. [18] l. m. camarinha-matos, n. farhadi, f. lopes and h. pereira, editors., technological innovation for life improvement, vol. 577. springer international publishing, cham, 2020. [19] s. singhal, u. chawla and r. shorey. “machine learning concept drift based approach for malicious website detection.” in: 2020 international conference on communication systems networks, comsnets 2020, ieee, piscataway, pp. 582-585, 2020. [20] maheshwari s, b. janet and r. j. a. kumar. “malicious url detection: a comparative study.” in: proceedings international conference on artificial intelligence and smart systems, icais 2021. ieee, piscataway, pp. 1147-1151, 2021. [21] y. peng, s. tian, l. yu, y. lv and r. wang. “a joint approach to detect malicious url based on attention mechanism.” international journal of computational intelligence and applications, vol. 18, no. 3, 2019. [22] a. s. raja, r. vinodini and a. kavitha. “lexical features based malicious url detection using machine learning techniques.” materials today proceedings, vol. 47, pp. 163-166, 2021. [23] s. d. vara prasad and k. r. rao. “a novel framework for malicious url detection using hybrid model.” turkish journal of computer and mathematics education, vol. 12, pp. 2542, 2021. [24] s. ahmad and a. tamimi, “detecting malicious websites using machine learning,” m.s. thesis, department of graduate programs & research, rochester institute of technology, rit dubai, april. 2020. [online]. available from: https://scholarworks.rit.edu/theses [25] t. manyumwa, p. f. chapita, h. wu and s. ji. “towards fighting cybercrime: malicious url attack type detection using multiclass classification.” in: proceedings 2020 ieee international conference on big data, big data 2020, ieee, piscataway, pp. 1813-1822, 2020. [26] f. alkhudair, m. alassaf, r. ullah khan and s. alfarraj. “detecting malicious url.” ieee, piscataway, 2020. [27] r. r. rout, g. lingam and d. v. l. somayajulu. “detection of malicious social bots using learning automata with url features in twitter network.” ieee transactions on computational social systems, vol. 7, no. 4, pp. 1004-1018, 2020. [28] y. c. chen, y. w. ma and j. l. chen. “intelligent malicious url detection with feature analysis.” in: proceedings second ieee symposium on computer and communications. vol. 2020. ieee, piscataway, 2020. [29] s. he, j. xin, h. peng and e. zhang. “research on malicious url detection based on feature contribution tendency.” in: 2021 ieee 6th international conference on cloud computing and big data analytics, icccbda 2021, pp. 576-581, 2021. [30] t. li, g. kou and y. peng. “improving malicious urls detection via feature engineering: linear and nonlinear space transformation methods.” information systems, vol. 91, pp. 101494, 2020 [31] r. ikwu. in: r. e. ikwu, editor. “extracting feature vectors from url strings for malicious url detection.” towards data science,” canada, 2021. available from: https://towardsdatascience.com/ extracting-feature-vectors-from-url-strings-for-malicious-urldetection-cbafc24737a [last accessed on 2022 aug 16]. [32] g. s. kori and d. m. s. kakkasageri. “classification and regression tree (cart) based resource allocation scheme for wireless sensor networks.” social science research network, rochester, ny, 2022. [33] n. hosseini, f. fakhar, b. kiani and s. eslami. “enhancing the security of patients’ portals and websites by detecting malicious web crawlers using machine learning techniques.” international journal of medical informatics, vol. 132, pp. 103976, 2019. [34] m. chatterjee and a. s. namin. “deep reinforcement learning for detecting malicious websites.” computer science, vol. 15. pp. 55, 2019. tx_1~abs:at/tx_2:abs~at 6 uhd journal of science and technology | jan 2023 | vol 7 | issue 2 1. introduction number of cars has significantly increased nowadays, [1], [2] consequently, traffic congestion problem has been arise around the world [3]. subsequently, vehicle clashing and crashing and dramatic increase of co2 emission per year [4] are threatening sustainable mobility of future [5]. further more, traffic control needs man power to be controlled [6]. the traffic control devices are time dependent and designed to flow the traffic in all directions. on top of that, sometimes during turning the lights from green to red causes traffic deadlock in a direction without having a noticeable flow in the other direction [7]. congestions caused by traffic signals could negatively impact on economy in terms of transportation due to fuel [8], time expenditure [9], and air pollution [10]. moreover, injuring even sometimes death caused by accidents happened in deadlock traffics [8], on the other hand, reducing congestion may have economic, environmental, and social benefits. in general, to make the optimization problem manageable, several assumptions have to be made. the main problem that arises is that these assumptions deviate and sometimes do so significantly from the real world. meanwhile, many factors have effects on drivers in real world traffics such as on driver’s preference interactions with vulnerable road users (e.g., pedestrians, cyclists, etc.), weather and road conditions [11]. on the other hand, computer vision has an important role in managing and controlling traffic signals with great success [6], [12]. the best way to control traffic flow in big and busy cities is to utilize intelligent traffic signal [6], the system has ability to approximately evaluate density estimation, a review of computer vision–based traffic controlling and monitoring kamaran h. manguri1,2, aree a. mohammed3 1department of technical information systems engineering, erbil technical engineering college, erbil polytechnic university, erbil, iraq, 2department of computer science, college of basic education, university of raparin, ranya 46012, iraq, 3department of computer science, college of science, university of sulaimani, sulaymaniyah, iraq a b s t r a c t due to the rapid increase of the population in the world, traffic signal controlling and monitoring has become an important issue to be solved with regard to the direct relation between the number of populations and the cars’ usage. in this regard, an intelligent traffic signaling with a rapid urbanization is required to prevent the traffic congestions, cost reduction, minimization in travel time, and co2 emissions to atmosphere. this paper provides a comprehensive review of computer vision techniques for autonomic traffic control and monitoring. moreover, recent published articles in four related topics including density estimation investigation, traffic sign detection and recognition, accident detection, and emergency vehicle detection are investigated. the conducted survey shows that there is no fair comparison and performance evaluation due to the large number of involved parameters in the abovementioned four topics which can control the traffic signal controlling system such as (computation time, dataset availability, and an accuracy). index terms: traffic signaling system, intelligent traffic, computer vision, traffic congestion, traffic monitoring, review. corresponding author’s e-mail: kamaran@uor.edu.krd received: 20-12-2022 accepted: 26-04-2023 published: 10-08-2023 access this article online doi: 10.21928/uhdjst.v7n2y2023.pp6-15 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2023 kamaran h. manguri and aree a. mohammed. this is an open access article distributed under no derivatives license 4.0 (cc by-nc-nd 4.0) re v i e w a r t i c l e uhd journal of science and technology manguri and mohammed: traffic controlling and monitoring: a review uhd journal of science and technology | jan 2023 | vol 7 | issue 2 7 traffic signals detection and recognition, emergency and police car detection, and accident detection. even though a better infrastructure can improve the traffic flow [13]. usually in quiet intersections, the traffic is controlled by human or system controls [6]. in most congestions, cameras have been put for purposes other than traffic control, such as security, vehicle detection, and arrangement [14]. these cameras can be utilized for the reason of analyzing traffic scenes simply by employing specific hardware. the main advantage is that there is no need for replacing the cctvs. the main objective of this survey is to fill the research gap that exists in the field of traffic signal controlling and monitoring. the importance of this survey is to propose some technique based on computer vision for reducing the road congestion and keeping the environment green and public health. in this study, different approaches based on computer vision for traffic signaling controls are reviewed. for this purpose, the literatures over the period january 2015–january 2022 are surveyed. the structure of this review is as follows: section i provides an introduction to the traffic and its problems. background of traffic management addressed in section ii. in section iii, a literature review is provided for the existing solution of the intelligent traffic signaling. section iv provides a discussion of review of the existing solutions. finally, conclusion remarks are presented in last section. 2. review strategy this review is aimed in analyzing the recent literature for the vision-based methods for traffic controlling and managing, which have been published from january 2015 to january 2022 in terms of journal papers and conference proceedings. the reviewed papers were chosen after an extensive manual search of databases including ieee xplore, springer, elsevier, and google scholar. keywords used to explore the databases are shown in table 1. in addition, vision or image processing keywords are selected as the main keywords in the title of papers. moreover, one of the sub search keywords has been used with main keyword to find the studies in the above mentioned period. 3. urban traffic light system usually, each traffic light contains three color lights precisely, green, yellow, and red lights. they are put in the four parallel and perpendicular directions [15]. fig. 1 shows a common intersection that formed by two perpendicular and parallel lanes. globally, the meaning of the lights for the drivers is as follows, green light means that the current lane has right to move forward meanwhile all other three directions are red which means they are not allowed to flow [11]. besides, models of controlling traffic signaling and monitoring using computer vision required cctv camera to acquisition images from the live traffic intersection. the simulation of traffic controlling in the cross road is shown in fig. 2 [16]. 4. literature review to improve traffic signaling control and monitoring, scientists and researchers proposed many methods based on machine vision. computer vision-based architecture of traffic signaling controlling and monitoring includes image acquisition, preprocessing and applying advance computer vision techniques density estimation, traffic sign detection and recognition, accident detection, and emergency vehicle detection. in this review, papers are randomly selected according to proposed methods in the recent years (between january 2015 and january 2022) for controlling and monitoring traffic signals. 4.1. density estimation density estimation is a key aspect for automatic traffic signaling control and reducing congestion in the intersection areas. different approaches by reviewers to estimate traffic density are detailed below: table 1: search parameters of the literature review date range database main keywords (or) and/sub search key-words january 2015–january 2022 ieee xplore springer elsevier google scholar vision image processing machine learning deep learning traffic controlling traffic density traffic congestion crowd detection accident accident detection accident identification emergency vehicle traffic sign manguri and mohammed: traffic controlling and monitoring: a review 8 uhd journal of science and technology | jan 2023 | vol 7 | issue 2 garg et al. [17] presented the approach for estimating traffic density based on vision which forms the fundamental building block of traffic monitoring systems. due to the low accuracy of vehicle counting and tracking of existing techniques, the sensitivity to light changes, occlusions, congestions, etc. are made. moreover, the authors addressed another problem of existing holistic-based methods by difficulty of implementation in real-time because the high computational complexity is required. to handle this issue, density is calculated using block processing approach for busy road segments. the proposed method involves two steps including marking of region of interest (roi), generating block of interest, and background construction in the first step. recurring process has been applied in the second step which involves background update, occupied block detection, shadow block elimination, and traffic density estimation. finally, the proposed methods are evaluated and tested using the trafficdb dataset. in biswas et al. [1] density estimated based counting cars, background subtraction (bs) method and overfeat framework are implemented. the accuracy of the proposed system is evaluated by manual counting of cars. furthermore, the comparative study was conducted before and after outperforming overfeat framework. average accuracy reached 96.55% after applied overfeat framework from 67.69% average accuracy for placemeter and 63.14% average accuracy for bs, respectively. furthermore, this study confirmed that the overfeat framework has another application area. the advantages and shortcomings of the bs and six individual obtained traffic videos have used for analyzing overfeat framework with regarding different perspectives such as camera angles, weather conditions, and daily time. biswas et al. [3] implemented single shot detection (ssd) and mobilenet-ssd for estimating traffic density. for this purpose, 59 individual traffic cameras used for analyzing the ssd and mobilenet-ssd framework advantages and shortcomings. moreover, two algorithms are compared with manually estimated density. the ssd framework demonstrates significant potential in the field of traffic density estimation. according to their experiment, the significant accuracy of detection achieved, numerically speaking the precisions were 92.97% and 79.30% for ssd and mobilenet-ssd, respectively. bui et al. [18] developed a method for analyzing traffic flow, advanced computer vision technologies have been used to extract traffic information. for finding traffic density estimation in intersections data acquired from video surveillance. moreover, yolo and deepsort techniques turned for the detection, tracking, and counting of vehicles have enveloped to estimate the road traffic density. to evaluated the proposed method, data collected in a real-world traffic through cctv during 1 day. a new technique for estimating traffic density utilizing a macroscopic approach has been developed by kurniawan et al. [19]. the proposed method contains two parts including background construction and a traffic density estimation algorithm. the background construction obtained from detected non-moved vehicles in the front or behind vehicles. moreover, background of the image founded using the edge detection technique. density estimated by founding the ration between the number of roi containing object and the total number of roi. eamthanakul et al. [20] proposed a method-based image processing techniques for congestion detection. the fig. 1. four road lanes intersection. fig. 2. vision-based crossroad model. manguri and mohammed: traffic controlling and monitoring: a review uhd journal of science and technology | jan 2023 | vol 7 | issue 2 9 technique contains three parts: (1) image background substation used for separating vehicles from the background, (2) morphological techniques applied for removing the image noises, and (3) traffic density calculated from the obtained image from cctv. finally, the results of the process are sent to transport plan database. 4.2. traffic sign detection and recognition traffic sign recognition plays a key role in driver assistance systems and intelligent autonomous vehicles. furthermore, it can be helpful for automatic traffic signals which leads to prevent pass across the intersections in the case of read signals. novel approaches proposed in berkaya et al. [21] for traffic sign detection and recognition. a new method developed to detect traffic sign under the name of circle detection algorithm. in addition, rgb-based color thresholding technique was proposed by berkaya et al. [21]. moreover, three algorithms have been used to recognize traffic signs including histogram of oriented gradients (hog), local binary patterns and gabor features are employed within a support vector machine (svm) classification framework. the performance of the proposed methods for both detection and recognition evaluated on german traffic sign detection benchmark (gtsdb) dataset. based on the obtained results from experiments, the proposed system better than the reported literatures and can be used in a real-time operation. yang et al. [22] presented a method for traffic sign detection and recognition, the method includes three steps. thresholding of hsi color space components used to segment image in the first step. applying the blobs extracted to the first step for detecting traffic signs in the second step. the contribution of their method in the first step, machine learning algorithms not used classify shapes instead of this invariant geometric moments have been used. second, inspired by the existing features, new method has been proposed for the recognition. the hog features have been extended to the hsi color space and combined with the local self-similarity (lss) features to get the descriptor. as a classifier, random forest and svm classifiers have been tested together with the new descriptor. gtsdb and the swedish traffic signs (sts) data sets have been used to test the proposed system. finally, the results of the presented technique compared with existing techniques. salti et al. [23] combined solid image analysis and pattern recognition techniques for detecting traffic sign in mobile mapping data. the system designed base on interest regions extracting which makes a significant with other existing systems that sliding window detection have been used. furthermore, with having challenging conditions such as varying illumination, partial occlusions, and large scale variations, the proposed system good performance demonstrated. three variant category traffic signs aimed to detect including mandatory, prohibitory and danger traffic signs, according to the experimental setup of the recent gtsdb competition. with having a very good performance of the proposed method in the online competition, the proposed method challenging dataset mobile mapping of italian signs the pipeline has been evaluated and showed its successfully be deployed in real-world mobile mapping data. in du et al. [24] designed the robust and fast performance classifier-based detector. they addressed two algorithms for detection and classification. first, aggregate channel features based on three types of features, which including the color feature, the gradient magnitude, and gradient histograms proposed. second, boosted trees classifier multiscale and multiphase detector have been proposed based on real adaboost algorithm. the obtained results from experiments of this study show high average-recall and speed which is evaluated on daimler, lisa, and lara datasets. real-time traffic signs’ detection and recognition are necessary for smart vehicles to make them more intelligent. to deal with this this issue. shao et al. [25] are proposed a new approach that includes two steps; in the first one acquitted images from the road scene converted to grayscale images. then simplified gabor wavelets (sgw) filter has been applied to the optimized parameters of grayscale images. furthermore, traffic sings bounded by edge detection which helps preparing the obtained result to the next process. in the second, the roi extracted using the maximally stable extremal regions algorithm and the superclass of traffic signs are classified by svm. to classify their subclasses, the traffic signs convolution neural networks (cnn) with input by simplified gabor feature maps, where the parameters were the same as the detection stage is used. finally, the proposed method tested on gtsdb and ctsd datasets and the results obtained from the experiments show that the method is fast and accurate by 6.3 frames per second and 99.43%, respectively. berkaya et al. [21] presented new ideas to provide colorful graphics to improve traffic in terms of object recognition and problem detection. two digital image processing methods, namely, circle detection algorithm and rgb which based on the simplest image segmenting method have been improved to develop the ability of traffic sign manguri and mohammed: traffic controlling and monitoring: a review 10 uhd journal of science and technology | jan 2023 | vol 7 | issue 2 detection. the classification framework, namely, svm has been formed through assembling three main attributes including gabor features hog, and local binary patterns in the smart system. the presented technique is validated by german traffic sign detection and recognition benchmark datasets, correspondingly. according to the practical results, their technique is by far more efficient than the quoted approaches in this paper; the results are also aligned with the real time operation. a new approach for detecting and recognizing traffic signs proposed in ellahyani et al. [26] which includes three main steps. thresholding of his has been used to segment the image based on components of color spaces in the first step. it followed by applying blobs by the result of extracted from the former step. then, the traffic signs recognition performed for the detected signs in the last step. moreover, in their study, two different approaches used to classify signs. instead of machine learning algorithms, invariant geometric moments used to classify shapes in the first step. second, inspired by the existing features, new ones have been proposed for the recognition. hsi color space taken from the hog features and combined with the lss features to get the descriptor while used in the proposed algorithm. then, last test has been done based machine learning algorithms which are random forest and svm classifier. finally, the performance of proposed method evaluated and tested on german traffic sign recognition benchmark (gtsrb), gtsdb, and sts datasets. convolutional neural networks (cnn) machine learning algorithm is applicable for object recognition by having power full recognition rate and less time required for execution. in shustanov and yakimov [27] implemented traffic sign recognition using cnn. furthermore, several architectures of cnn compared together. meanwhile, tensor flow library is used for training and massively parallel architecture for multithreaded programming cuda. the entire procedure for traffic sign detection and recognition is executed in real time on a mobile gpu. finally, their method efficiency evaluated on gtsrb dataset and it is obtained very good result by 99.94% for classification images. 4.3. accident detection a main aspect of traffic monitoring is the identification and tracking of vehicles. monitoring vehicles helps to report and detect in the situation of the traffic junctions. one of the main aspects of traffic monitoring is the identification and tracking of vehicles. in this section, accident prediction and detection approaches are faced. tian et al. [28] developed a cooperative vehicle infrastructure systems (cvis) and proposed machine based-vision that can be used to detect car accident automatically. the study includes two phases; cad-cvis database has been created to improve the accuracy of accident detection in the first phase. cad-cvis dataset with regarding different traffic situations consists of various types of accidents, weather conditions and accident location. in the second phase, to detect accident deep neural network model yolo-ca based on cad-cvis and deep learning algorithms developed. moreover, to improve the performance of the model for detection small objects multiscale feature fusion and loss function with dynamic weights utilized. the results showed the proposed method faster than the previous methods, it can detect car accident in milliseconds with a very good average precision by 90.02%. finally, the proposed methods compared with existing methods, and the results determined accuracy improved and real-time over other models. a neoteric framework proposed for detecting accident in ijjina et al. [29]. for accurate object detection, mask r-cnn capitalized in the proposed framework by an efficient centroid-based object tracking algorithm for surveillance footage. the basic idea is to determine an accident after overlapping vehicles together are speed and trajectory anomalies in a vehicle after an overlap with other vehicles. this framework was found to be dominant and paves the way to the development of general-purpose vehicular accident detection algorithms in real-time. the framework tested and evaluated by the proposed dataset with the different weather condition. in saini et al. [30], a vehicle tracking technique based on image processing is developed without applying background subtraction for extracting the roi. instead, a hybrid of feature detection and region matching approach is suggested in their study, which is helpful for estimating the trajectory of vehicles over consequent frames. later, as the vehicle path through an intersection, the tracked direction is monitored for the occurrence of any specific event. it is found that the proposed method has capability to detect an accident between two vehicles. wenqi et al. [31] proposed the tap-cnn model for predicting accident based on cnn in the highways. traffic state and cnn model are described by some accident factors such as traffic flow, weather, and light to build a state matrix. in addition, the way of increasing tap-cnn model accuracy for predicting traffic accident different iterations are analyzed. accident data collected for inflected learning and evaluation of the model. finally, the experimental results show that the proposed model manguri and mohammed: traffic controlling and monitoring: a review uhd journal of science and technology | jan 2023 | vol 7 | issue 2 11 named tap-cnn is more effective than the traditional neural network model for producing traffic accidents. dogru and subasi [32] presented an intelligent system for accident detection in which vehicles exchange their microscopic vehicle variables with each other. based on the vehicle speed and coordinates, data collected from vehicular ad-hoc networks (vanets) simulated model in the proposed system and then, it sends traffic alerts to the drivers. furthermore, it shows how to use machine learning methods to detect accidents on freeways in its. two parameterizes help to analyze and detect accident easy which are position and velocity values of every vehicle. in addition, oob data set has been used to test the proposed method. finally, the results show that the rf is better than ann and svm algorithms by with 91.56%, 88.71, and 90.02% accuracy, respectively. vision-based algorithms have been used in yu et al. [33] to detect traffic accident including an st-iht algorithm for improving the robustness and sparsity of spatiotemporal features and weighted extreme learning machine detector for distinguishing between traffic accident and normal traffic. furthermore, a two-point search technique is proposed to find a candidate value adaptively for lipschitz coefficients to improve the tuning precision. for testing and evaluating the proposed method 30 traffic videos collected from youtube website. finally, the results show that the proposed method performance for detecting a traffic accident outperforms other existing methods. accelerometer is a widely employed method that used to detect a crash. in this research work [34], after calibration of accelerometer value of acceleration is use to detect an accident. due to the limitation of accelerometer accuracy and providing the efficient accident detection, cnn machine learning algorithm is tuned. for detecting an accident, image classification technique is used; however, cnn takes a lot of time, data, and computing power to train. transfer learning methods have been innovatively applied to alleviate these problems and for the accident detection application, which involves retraining the already trained network. for this purpose, inception-v3 classifier that developed by google for image was incorporated. finally, the proposed method efficiency compared with the traditional accelerometer-based techniques for detecting accident by 84.5% of accuracy for transfer learning algorithm. 4.4. emergency vehicles detection the success of law enforcement and public safety relies directly on the time needed for first responders to arrive at the emergency scene. emergency cars include ambulance, firefighter, and police car. many methods are proposed to detect emergency cars and some of them as example are reviewed in this section. in borisagar et al. [35], two methods of computer vision are used to detect and localize the emergency vehicle. the used methods including object detection and instance segmentation. the proposed method implementation includes faster rcnn for object detection and mask rcnn for instance segmentation. the results show that the proposed method is accurate, most importantly, and suitable for emergency vehicle detection in disordered traffic conditions are deliberated. in addition, a custom dataset used for detecting emergency vehicles which contains 400 images and labeled using the label me tool. roy and rahman [36] are proposed a model for detecting emergency cars from cctv footage such as ambulance and fire-fighter on a heavy traffic road. in this model, priority given to these cars and clearing the emergency lane to pass the traffic intersection. for traffic police, sometimes deciding opening the specific lane for emergency vehicles is difficult or even impossible. deep cnn and coco dataset have been used for automated emergency car detection. the result of presented method for detecting and identifying all kinds’ emergency vehicles generally is reasonable. e. p. jonnadula and p. m. khilar [37] are presented a hybrid architecture for detecting emergency vehicles by combining features of image processing and computer vision. also, search space decreased by using region of interest. prediction of ambulance helps to decrease the number casualty on the real traffic in case of having emergency situation on the road. to cover this problem, lin et al. [38] presented a novel approach-based machine learning techniques and features are extracted using the multi-nature which can extract ambulance characteristics on demand. furthermore, they experimentally evaluate the performance of next-day demand prediction across several state-of-theart machine learning techniques and ambulance demand prediction methods, using real-world ambulatory and demographical datasets obtained from singapore. finally, various machine learning techniques used for different natures and scdf-engineered-socio dataset have been used to show the proposed method accuracy. the existing traffic light system is a lack of information in emergencies such as ambulances, firefighters, and police when a car is in. suhaimy et al. [39] developed an embedded manguri and mohammed: traffic controlling and monitoring: a review 12 uhd journal of science and technology | jan 2023 | vol 7 | issue 2 table 2: proposed methods in literature reviews type of traffic management reference (s) algorithm (s) dataset (s) accuracy % contribution (s) density estimation [17] block variance the trafficdb 93.70 traffic density estimation with the low computational cost [1] background subtraction, over feat framework, and place meter imagenet 96.55 defining roi by over feat framework [3] detection (ssd) and mobilenet-ssd data collected from cameras with different places 92.97 (ssd), 79.30 (mobilenet-ssd) new path opened for real time traffic density estimation [18] yolo and deepsort collected data from cctv 87,88 (day, congestion) 93,88 (day, normal) 82,1 (night, normal) detecting, tracking and counting vehicles [19] roi and edge detection n/a new technique developed for estimating traffic density [20] background subtraction and morphological techniques n/a traffic density estimated [40] cnn ucsd 99.01 traffic density estimation model proposed based cnn and computer vision traffic sign detection and recognition approaches [22] hsi, hog, lss, and svm gtsdb, ctsd 98.24 (gtsdb), 98.77 (ctsd) developed circle detection algorithm and an rgb-based color thresholding technique [21] hog, lss, random forest, and svm gtsdb 97.04 in the first step, machine learning algorithms not used classify shapes instead of this invariant geometric moments have been used. second, method has been proposed for the recognition [23] roi, hog, svm, and context aware filter gtsdb 99.43 (prohibitory) 95.01 (mandatory) 97.22 (danger) online detecting mandatory, prohibitory and danger traffic signs [24] aggregate channel features and boosted trees classifier daimler, lisa and lara 84.314 (daimler), 90.33 (lisa), 92.048 (lara) proposed the high average-recall and speed method [26] hog, lss, and svm gtsrb, gtsdb and tst 97.43 shapes classified byusing invariant geometric moments [25] sgw and svm gtsdb and ctsd 99.43 speed of detection and classification improved which is more than 6 frames per second [27] cnn gtsrb 99.94 cnn process described [41] proposed model named capsnet tl_dataset the proposed capsnet is employed for traffic sign recognition. accident detection [28] deep neural network model yolo-ca cad-cvis 90.02 cad-cvis dataset created and the proposed method more fast and accurate [29] mask r-cnn proposed 71 developing vehicular accident detection algorithms in real-time [30] hybrid of feature detection and region matching real world dataset n/a accident detection between two vehicles [31] cnn accident data collected 78.5 accident predicted using cnn [32] ann, svm, and random forests (rf) oob data set 91.56 (rf), 88.71 (ann), 90.02 (svm) the proposed method can provide estimated geographical location of the possible accident (contd...) manguri and mohammed: traffic controlling and monitoring: a review uhd journal of science and technology | jan 2023 | vol 7 | issue 2 13 machine learning application, including acquisition of data, features extraction, different algorithms exploration, tuning, and deploying the model to a good output model in a simulation application. specifically, a classifier of ambulance siren sound into “ambulance arrive” and “no ambulance arrive” has been developed, which is the traffic light system could be used to track an ambulance’s arrival in an emergency. this paper suggests an approach based on mel-frequency spectral coefficients-svm (mfcc-svm) on matlab r2017b tools. 5. discussion according to the results of this review, several attempts have been made to develop intelligent traffic controlling visionbased methods. some challenges can be seen when researchers try to develop vision based automatic traffic signals. one of the challenges is that there is no available framework to cover all traffic problems because huge amount of data and computational time are required. another challenge is the power consumption to get real traffic data to testing their proposed methods in the different weather conditions. the table 2: (continued) type of traffic management reference (s) algorithm (s) dataset (s) accuracy % contribution (s) [33] st-iht, spatio-temporal features and w-elm collected dataset 87.4±0.3 (svm), 94.3±0.2 (elm), 95.5±0.3 (w-elm) (i) robust fractures extraction proposed based on of-dsift and st-iht (ii) detect imbalance between traffic accident and normal traffic [42] yolov4 video sequences collected from youtube n/a presents a new efficient framework for accident detection emergency vehicles detection [35] faster rcnn and mask rcnn custom dataset 81 (object detection), 92 (instance segmentation) the computational and accuracy for emergency vehicle detection are suitable [36] deep convolutional neural network coco 97.97 detecting and identifying all kinds emergency cars [37] yolo + resnet coco n/a hybrid architecture presented for detection of emergency vehicles in a real time [38] svr, mlp, rbfn, and lightgbm scdf engineered-socio n/a varying degrees to the model training in lightgbm [39] mfcc-svm 97 effectively distinguish audio events from audio signals ssd: single shot detection, cnn: convolution neural networks, hog: histogram of oriented gradients, lss: local self‑similarity, svm: support vector machine, gtsdb: german traffic sign detection benchmark, and gtsrb: german traffic sign recognition benchmark table 3: used datasets type of traffic management used in references dataset (s) type no. of images density estimation [17] the trafficdb image [1] imagenet image 3.2 million [3], [18] data collected from cameras with different places [40] ucsd video traffic sign detection and recognition approaches [21], [22], [23], [25], [26] gtsdb image 50, 000 [24] daimler image 5,000 [24] lisa video [24] lara video [26], [27] gtsrb image 900 [41] tl_dataset image 46,000 accident detection [28] cad-cvis video [29], [30], [31], [33], [42] proposed, real, and collected data [42] yolov4 emergency vehicles detection [35], [43] custom dataset [36], [37] coco 328,000 gtsdb: german traffic sign detection benchmark, gtsrb: german traffic sign recognition benchmark manguri and mohammed: traffic controlling and monitoring: a review 14 uhd journal of science and technology | jan 2023 | vol 7 | issue 2 third challenge is the lack of standardized dataset for testing and training methods. the results show that there is no any comprehensive dataset for traffic controlling and monitoring. for example, in density estimation, most of the researchers have created their own datasets while in traffic sign detection and recognition; they have used some publicly available datasets such as (gtsdb and gtsrb). on the otherhand, both the accident and emergency vehicle detection methods have only collected and prepared (customized) real data which captured by cctv cameras. finally, current systems while studied in the literature provide a low-cost solution for traffic applications in the expense of the system accuracy and they are applicable. 5.1. survey of technique’s summary according to the literature sur vey, researchers have proposed and developed many approaches for controlling and monitoring signal system based on computer vision algorithms. in table 2, presenting methods for each survey topics associated with a summary of their (methods, datasets, and contributions). the reason that why the performance of the reviewed methods is not evaluated is a non-availability of a common datasets. 5.2. datasets for testing and evaluating the proposed methods, researchers worked on the public datasets and on their collected datasets. the used datasets are described in table 3. 6. conclusion in recent years, reducing road congestion has become a key challenge because of the threatening rise in the number of vehicles on the roads. in this review, the existing studies on autonomic traffic controlling and monitoring are reviewed in the computer vision community. furthermore, computer vision is considered as the areas with the most studies are for the future technologies. the intelligent traffic systems perceive the density estimation investigate, traffic sign detection and recognition, accident detection, and emergency vehicle detection. furthermore, name of the used datasets in the reviewed papers are presented. the main gap that founded in this review is a non-availability of dataset for traffic controlling and monitoring. finally, intelligent traffic systems can play a key role in reducing congestion in the intersection areas and traffic flow management. the conducted survey indicates the accuracy finding of each method as described in table 2. this research work could be having a potential impact for further researches in the same field of study. various challenges such as (weather conditions, lighting, and traffic patterns) can be considered with all techniques based on computer vision and machine learning methods. consequently, these conditions will improve our survey in the future work. references [1] d. biswas, h. su, c. wang, j. blankenship and a. j. s. stevanovic. “an automatic car counting system using overfeat framework,” sensors, vol. 17, no. 7, p. 1535, 2017. [2] n. k. jain, r. k. saini and p. mittal. “a review on traffic monitoring system techniques,” in soft computing: theories and applications, singapore, springer singapore, pp. 569-577, 2019. [3] d. biswas, h. su, c. wang, a. stevanovic and w. wang. “an automatic traffic density estimation using single shot detection (ssd) and mobilenet-ssd,” physics and chemistry of the earth, parts a/b/c, vol. 110, pp. 176-184, 2019. [4] m. c. coelho, t. l. farias and n. m. rouphail. “impact of speed control traffic signals on pollutant emissions,” transportation research part d: transport and environment, vol. 10, no. 4, pp. 323-340, 2005. [5] q. guo, l. li and x. ban. “urban traffic signal control with connected and automated vehicles: a survey,” transportation research part c: emerging technologies, vol. 101, pp. 313-334, 2019. [6] s. k. kumaran, s. mohapatra, d. p. dogra, p. p. roy and b. g. kim. “computer vision-guided intelligent traffic signaling for isolated intersections,” expert systems with applications, vol. 134, pp. 267-278, 2019. [7] m. h. malhi, m. h. aslam, f. saeed, o. javed and m. fraz. “vision based intelligent traffic management system,” in 2011 frontiers of information technology. ieee, new jersey, pp. 137-141, 2011. [8] c. j. lakshmi and s. kalpana. “intelligent traffic signaling system,” in 2017 international conference on inventive communication and computational technologies (icicct), pp. 247-251, 2017. [9] p. jing, h. huang and l. j. i. chen. “an adaptive traffic signal control in a connected vehicle environment: a systematic review,” information, vol. 8, no. 3, p. 101, 2017. [10] s. s. s. m. qadri, m. a. gökçe and e. öner. “state-of-art review of traffic signal control methods: challenges and opportunities,” european transport research review, vol. 12, no. 1, p. 55, 2020. [11] b. ghazal, k. elkhatib, k. chahine and m. kherfan. “smart traffic light control system,” in: 2016 3rd international conference on electrical, electronics, computer engineering and their applications (eecea), ieee, united states, 2016, pp. 140-145. [12] h. jeon, j. lee and k. j. sohn. “artificial intelligence for traffic signal control based solely on video images,” ieee acces, vol. 22, no. 5, pp. 433-445, 2018. [13] y. wang, x. yang, h. liang and y. liu. “a review of the self-adaptive traffic signal control system based on future traffic environment,” journal of advanced transportation, vol. 2018, 1096123, 2018. [14] s. d. khan and h. ullah. “a survey of advances in vision-based vehicle re-identification,” computer vision and image understanding, vol. 182, pp. 50-63, 2019. [15] y. s. huang and t. h. chung. “modeling and analysis of urban traffic light control systems,” journal of the chinese institute of engineers, vol. 32, no. 1, pp. 85-95, 2009. [16] k. h. k. manguri. “traffıc sıgnalıng control at hıghway manguri and mohammed: traffic controlling and monitoring: a review uhd journal of science and technology | jan 2023 | vol 7 | issue 2 15 intersectıons usıng morphologıcal image processıng technıque,” türkiye, hasan kalyoncu üniversitesi, 2016. [17] k. garg, s. k. lam, t. srikanthan and v. agarwal. “real-time road traffic density estimation using block variance,” in 2016 ieee winter conference on applications of computer vision (wacv), new jersey, ieee pp. 1-9, 2016. [18] k. h. n. bui, h. yi, h. jung and j. cho. “video-based traffic flow analysis for turning volume estimation at signalized intersections,” in intelligent information and database systems, cham, springer international publishing, pp. 152-162, 2020. [19] f. kurniawan, h. sajati, o. dinaryanto. “image processing technique for traffic density estimation,” international journal of engineering and technology, vol. 9, no. 2, pp. 1496-1503, 2017. [20] b. eamthanakul, m. ketcham and n. chumuang. “the traffic congestion investigating system by image processing from cctv camera,” in 2017 international conference on digital arts, media and technology (icdamt), new jersey, ieee, pp. 240-245, 2017. [21] s. k. berkaya, h. gunduz, o. ozsen, c. akinlar and s. gunal. “on circular traffic sign detection and recognition,” expert systems with applications, vol. 48, pp. 67-75, 2016. [22] y. yang, h. luo, h. xu and f. wu. “towards real-time traffic sign detection and classification,” ieee transactions on intelligent transportation systems, vol. 17, no. 7, pp. 2022-2031, 2016. [23] s. salti, a. petrelli, f. tombari, n. fioraio and l. di stefano. “traffic sign detection via interest region extraction,” pattern recognition, vol. 48, no. 4, pp. 1039-1049, 2015. [24] x. du, y. li, y. guo and h. xiong. “vision-based traffic light detection for intelligent vehicles,” in 2017 4th international conference on information science and control engineering (icisce), pp. 1323-1326, 2017. [25] f. shao, x. wang, f. meng, t. rui, d. wang and j. j. s. tang. “real-time traffic sign detection and recognition method based on simplified gabor wavelets and cnns,” sensors, vol. 18, no. 10, p. 3192, 2018. [26] a. ellahyani, m. e. ansari and i. e. jaafari. “traffic sign detection and recognition based on random forests,” applied soft computing, vol. 46, pp. 805-815, 2016. [27] a. shustanov and p. yakimov. “cnn design for real-time traffic sign recognition,” procedia engineering, vol. 201, pp. 718-725, 2017. [28] d. tian, c. zhang, x. duan and x. wang. “an automatic car accident detection method based on cooperative vehicle infrastructure systems,” ieee access, vol. 7, pp. 127453-127463, 2019. [29] e. p. ijjina, d. chand, s. gupta and k. goutham. “computer vision-based accident detection in traffic surveillance,” in 2019 10th international conference on computing, communication and networking technologies (icccnt), pp. 1-6, 2019. [30] a. saini, s. suregaonkar, n. gupta, v. karar and s. poddar. “region and feature matching based vehicle tracking for accident detection,” in 2017 tenth international conference on contemporary computing (ic3), pp. 1-6, 2017. [31] l. wenqi, l. dongyu and y. menghua. “a model of traffic accident prediction based on convolutional neural network,” in 2017 2nd ieee international conference on intelligent transportation engineering (icite), pp. 198-202, 2017. [32] n. dogru and a. subasi. “traffic accident detection using random forest classifier,” in 2018 15th learning and technology conference (l&t), pp. 40-45, 2018. [33] y. yu, m. xu and j. gu, “vision-based traffic accident detection using sparse spatio-temporal features and weighted extreme learning machine,” iet intelligent transport systems, vol. 13, no. 9, pp. 1417-1428, 2019. [34] p. borisagar, y. agrawal and r. parekh. “efficient vehicle accident detection system using tensorflow and transfer learning,” in 2018 international conference on networking, embedded and wireless systems (icnews), pp. 1-6, 2018. [35] s. kaushik, a. raman and k. v. s. r. rao. “leveraging computer vision for emergency vehicle detection-implementation and analysis,” in 2020 11th international conference on computing, communication and networking technologies (icccnt), pp. 1-6, 2020. [36] s. roy and m. s. rahman. “emergency vehicle detection on heavy traffic road from cctv footage using deep convolutional neural network,” in 2019 international conference on electrical, computer and communication engineering (ecce), pp. 1-6, 2019. [37] e. p. jonnadula and p. m. khilar. “a new hybrid architecture for real-time detection of emergency vehicles.” in: computer vision and image processing. springer, singapore, 2020, pp. 413-422. [38] a. x. lin, a. f. w. ho, k. h. cheong, z. li, w. cai, m. l. chee, y. y. ng, x. xiao and m. e. h. ong. “leveraging machine learning techniques and engineering of multi-nature features for national daily regional ambulance demand prediction,” international journal of environmental research and public health, vol. 17, no. 11, p. 4179, 2020. [39] m. a. suhaimy, i. s. a. halim, s. l. m. hassan and a. saparon. “classification of ambulance siren sound with mfcc-svm,” in aip conference proceedings, united states, aip publishing llc vol. 2306, no. 1, p. 020032, 2020. [40] l. a. t. nguyen and t. x. ha. “a novel approach of traffic density estimation using cnns and computer vision,” european journal of electrical engineering and computer science, vol. 5, no. 4, pp. 8084, 2021. [41] x. liu and w. q. yan. “traffic-light sign recognition using capsule network,” multimedia tools and applications, vol. 80, no. 10, pp. 15161-15171, 2021. [42] h. ghahremannezhad, h. shi and c. liu. “real-time accident detection in traffic surveillance using deep learning,” in 2022 ieee international conference on imaging systems and techniques (ist), pp. 1-6, 2022. . 24 uhd journal of science and technology | jul 2019 | vol 3 | issue 2 1. introduction biometric identification is a new technology to recognize a person based on a physiological or behavioral characteristic that attracting a lot of attention recently [1-3]. as the level of counterfeit and deceptive transactions increases rapidly, so this causes the need for highly secure identification technologies and personal verification [4-6]. the existing methods of shared secrets such as pins or passwords, key devices, and smart cards, these are not sufficient in many applications [7-9]. biometric characteristic can realize this issue that is unique and realize the characteristic of a human [10-12]. the use of biometrics for personal authentication becomes practical and considerably more accurate than the current methods [13-15]. the biometric characteristics are classified into two main categories [16,17]: physiological characteristics related to the shape or part of the body, such as iris, fingerprint, face, dna, retina, and the geometry of the hand [18-20]. the behavior characteristics are related to the human behavior, such as gait, voice, signature, and keystroke dynamics [21-23]. biometrics can be applied in companies, governments, military, border control, hospitals, banks …, etc. [24-26]. these characteristics are used to verify the identity of a person for allowing access to certain information [27-29]. the most important characteristics of the iris do not change the texture of the iris through a person life [30,31]. this stability of iris features over a long time, leading to guarantees the long period of validity of the data and it does not need to update; in addition, iris characteristics are well protected from the environment [32-34]. this advantage allows iris identification as the most accurate and reliable biometric identification [35-37]. in the entire human population, there is no similarity two irises in their mathematical details, even between identical twins [38-40]. the probability of finding efficient biometric iris recognition based on iris localization approach muzhir shaban al-ani1, salwa mohammed nejrs2 1department of information technology, university of human development, college of science and technology, sulaymaniyah, krg, iraq, 2department of physics, university of misan, college of science, iraq a b s t r a c t biometric recognition is an emerging technology that has attracted more attention in recent years. biometric is referred to physiological and behavioral characteristics to identify individuals. iris characteristic is related to physiological biometric characteristics. iris recognition approaches are among the most accurate biometric technologies with immense potential for use in global security applications. the aim of this research is to implement an efficient approach to process the diseased eye images to verify the second iris examination for the same person by inserting an improvement step to the iris recognition system. the improvement step makes a correction of boundary around the pupil and removes the corrupted regions. this approach demonstrates its ability to detect the inner limit of the iris. the obtained results show that 90% success in the detection of diseased eye images, which make the iris recognition system more accurate and safe. index terms: biometric recognition, iris localization, iris recognition, template matching corresponding author’s e-mail: muzhir shaban al-ani, department of information technology, university of human development, college of science and technology, sulaymaniyah, krg, iraq. e-mail: muzhir.al-ani@uhd.edu.iq received: 16-05-2019 accepted: 23-07-2019 published: 31-07-2019 access this article online doi: 10.21928/uhdjst.v3n2y2019.pp24-32 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2019 al-ani and nejrs. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) re v i e w a r t i c l e uhd journal of science and technology muzhir shaban al-ani and salwa mohammed nejrs: iris localization approach uhd journal of science and technology | jul 2019 | vol 3 | issue 2 25 two people with an identical iris is almost approach zero, and the probability that two irises are similar; it is approximately 1 in 1010 [41-43]. iris recognition is an effective aspect of human identification for its dissimilarity between iris characteristics. these research aims are to introduce an efficient biometric iris recognition approach based on iris localization method. this approach tries to improve the identification process through certain processes on iris image. 2. iris recognition the recognition of the iris is an automatic method of biometric identification that uses mathematical techniques of pattern recognition in video images of one or both irises of an individual’s eyes [44,45]. the complex iris patterns are unique, stable, and visible from a distance [46,47]. the iris recognition technology determines the identity of an individual through many steps, as shown in fig. 1 [48,49]. these steps of iris recognition are as follows: • iris image acquisition: this step deals with using of electronic devices that converting the object into digital images such as digital camera and digital scanner [50,51]. • image preprocessing: the iris image is preprocessed to obtain a useful region iris image such as to illustrate the detection of the inner and outer boundaries of the iris. this step detects and removes the eyelids and eyelashes that may cover the eye image [52]. the iris image has low contrast and uneven illumination caused by the position of the light source, so preprocessing try to recover these aspects. all of these factors can be compensated in the image preprocessing step [53]. • feature extraction: this step deals with generating of features applying the texture analysis method to extract features from the normalized iris image [54]. important features of the iris are extracted for precise identification purposes [55]. • template matching: this step deals with comparing the user model with the database models using a corresponding matching statistic [56]. the corresponding metric will give a measure of similarity between two iris models or template. it provides a range of values when comparing models of the same iris and another range of values when comparing different iris models [57]. finally, a high confidence decision is made to identify whether the user is authenticated or not [58]. 3. literature review many literature reviews are published related to iris recognition. this section introduces some of the updated researches related to iris recognition subject. rai and yadav (2014) considered a new method for recognition of iris patterns using a combination of hamming’s distance and support vector machine. the zone of the zigzag collar of the iris is selected for the extraction of iris characteristics because it captures the most important areas of the complex iris pattern and a higher recognition rate is achieved. the proposed approach also used the detection of parabolas and the cut medium filter for the detection and removal of eyelids and eyelashes, respectively. the proposed method is efficient from a computer and reliable point of view, with a recognition rate of 99.91% and 99.88% based on the image data of cassia and check, respectively [59]. hamouchene et al. (2014) implemented a new iris recognition system using a new feature extraction method. the proposed method, neighborhood-based binary pattern, compares each neighbor of the center pixel with the next neighbor to code it for 1 if it is greater than the center pixel or 0 if it is smaller than the center pixel. the resulting binary code is converted into a decimal number to build the nbp image. to deal with the problem of rotation, we propose a coding process to obtain an invariant image by rotation. this image is subdivided into several blocks and the average of each block is calculated; then, the variations of the averages are encoded by a binary code [60]. santos et al. (2015) focused on the biometric recognition in mobile environments using iris and periocular information template maching feature extraction image preprocessing iris image acquisition fig. 1. iris recognition system. muzhir shaban al-ani and salwa mohammed nejrs: iris localization approach 26 uhd journal of science and technology | jul 2019 | vol 3 | issue 2 as main characteristics. this study makes three main contributions: first demonstrated the utility of an iris and a set of periocular data, which contains images acquired with 10 different mobile configurations and the corresponding data of iris segmentation. this data set allows us to evaluate iris segmentation and recognition methods, as well as periocular recognition techniques; second reported the results of device-specific calibration techniques that compensate for the different color perceptions inherent in each configuration; and third proposed the application of well-known iris and periocular recognition strategies based on classic coding and matching techniques, as well as the demonstration of how they can be combined to overcome the problems associated with mobile environments [61]. umer et al. (2015) proposed a new set of characteristics for personal verification and identification based on iris images. the method has three main components: image preprocessing, feature extraction, and classification. during image preprocessing, iris segmentation is performed using the hough restricted circular transformation. then, only two disjoint quarters of the segmented iris pattern are normalized, which allow the extraction of characteristics for classification purposes. here, the method of extracting characteristics of an iris model is based on a morphological operator of multiple scales. then, the characteristics of the iris are represented by the sum of the dissimilarity residues obtained by the application of a morphological top-hat transform [62]. thomas et al. (2016) in this work, our system introduces a more accurate method called random sample consensus to adjust the ellipse around the non-circular iris boundaries. you can locate the iris boundaries more accurately than the methods based on the hough transformation. we also use the daugman rubber sheet model for iris normalization and elliptical unpacking, and correspondence based on the correlation filter for in-class and interclass evaluation. peak side lobe ratio is the measure of similarity used for the corresponding models. through these, the recognition process improves with the daugman method. the wvu database is used to perform experiments and promising results are obtained [63]. hajari et al. (2016) showed that iris recognition is a difficult problem in a noisy environment. their main objective is to develop a reliable iris recognition system that can operate in a noisy image environment and increase the rate of iris recognition in the casia and mmu iris datasets. they proposed two algorithms: first, a new method to eliminate the noise of the iris image and, second, a method to extract the characteristics of the texture through a combined approach of the local binary model and the gray level cooccurrence matrix. the proposed approach provided the highest recognition rate of 96.5% and low error rate and required less uptime [64]. soliman et al. (2017) introduced a rough algorithm to solve the computational cost problem while achieving an acceptable precision. the gray image of the iris is transformed into a binary image using an adaptive threshold obtained from the analysis of the intensity histogram of the image. the morphological treatment is used to extract an initial central point, which is considered the initial center of the iris and pupil boundaries. finally, a refinement step is performed using an integrodifferential operator to obtain the centers and the final rays of the iris and the pupil. this system is robust against occlusions and intensity variations [65]. naseem et al. (2017) proposed an algorithm to compare the vanguard spatial representation classification with bayesian fusion for several sectors. the proposed approach has shown that it overall performs the implemented algorithm in standard databases. the complexity analysis of the proposed algorithm shows a decisive superiority of the proposed approach. in this research, the concept of class-specific dictionaries for iris recognition is proposed. essentially, the query image is represented as a linear combination of learning images of each class. the well-conditioned inverse problem is solved using the least squares regression and the decision is judged in favor of the class with the most accurate estimate [66]. llano et al. (2018) presented a robust and optimized multisensor scheme with a strategy that combines the evaluation of video frame quality with robust segmentation fusion methods for image recognition and simultaneous image iris recognition. as part of the proposed scheme, they presented a fusion method based on the modified laplacian pyramid in the segmentation stage. the experimental results in the casia-v3-interval, casia-v4-mile, ubiris-v1, and mbgc-v2 databases show that the robust optimized scheme increases recognition accuracy and is robust for different types of iris sensors [67]. zhang et al. (2018) implemented a generalized stimulation framework to solve some problems of practical recognition of the iris at a distance, namely, the detection of the iris, the detection of the poor location of the iris, the detection of iris, and iris recognition. this solution takes advantage of a set muzhir shaban al-ani and salwa mohammed nejrs: iris localization approach uhd journal of science and technology | jul 2019 | vol 3 | issue 2 27 of carefully designed features and well-adjusted stimulation algorithms. basically, there are two main contributions. the first is an exploration of the intrinsic properties of remote iris recognition, as well as robust features carefully designed for specific problems. the second important contribution is the methodology on how to adapt adaboost’s learning to specific problems [68]. 4. research methodology 4.1. iris image dataset the construction of iris image dataset is a difficult job due to many reasons such as distance, lighting, and the resolution of the used device. this research needs to collect iris images of real patients those have some diseases on their iris. the captured iris images are of 8-bit gray images with a resolution of 480*640. in general, the iris is approximately form a circular shape. the diameter of the iris in the captured image in this dataset is about 200 pixels. twenty eye images of 10 patients infected with anterior uveitis are applied in this research. 4.2. implemented system during a brief reviewing in this field, you can find many systems and algorithms are implemented for biometric recognition including iris recognition. the proposed implemented approach for iris recognition contains the following components (fig. 2): • eye image acquisition: in this step, the eye object is captured using sensitive device to convert the real iris object into digital image contains of number of effective pixels. • eye image preprocessing: in this step, the acquired digital image is converted into standard image that can be adapted for the next step of processing. this step passed into many processes such as converting the image into gray scale image, image filtering, and image resizing. • iris image localization: in this step, the diseased eye image is enhanced to track the iris region to detect and localize the iris region. • iris image normalization: in this step, the iris image is normalized and then converted into gray scale to generate a standard iris image to be adequate for the next step for processing. • feature extraction: this step deals with the generation of features or characteristics related to the indicated iris image. feature is extracted using two-dimensional discrete wavelet transform (2d dwt). 2d dwt is performed through passing low-pass filter and high-pass filter for both rows and columns of the image as shown in the following two equations: x x k g n klow k n n = [ ] − =− ∑ � [ ]2 (1) x x k h n khigh k n n = [ ] − =− ∑ � [ ]2 (2) where, x represents the input array and both g and h represent low-pass and high-pass filters, respectively. • template matching: in this step, the template matching is generated that can be used to decide the personal authentication based on the selected threshold. 4.3. detection of diseased eye there are many differences between diseased eye and normal eye as shown in fig. 3. one important issue is to identify the diseased eye, in which the pupil of the diseased eye with anterior uveitis is not circular and may cause changes in iris architecture or atrophy. in this study, two factors are considered to separate between diseased and normal eyes: • to localize the pupil boundary or iris boundary as a circle, its radius must fall within the specific range. in the specified database, the range of iris radius value is within 90–150 pixels, while the pupil radius ranges are within 28–75 pixels. • the pupil is always within the iris region; hence, the pupil boundary must be within the iris boundary for normal eye image acquisition eye image preprocessing iris image localization iris image normalization feature extraction template matching fig. 2. implemented approach of iris recognition. muzhir shaban al-ani and salwa mohammed nejrs: iris localization approach 28 uhd journal of science and technology | jul 2019 | vol 3 | issue 2 eyes, while in the diseased eyes, the pupil boundary is localized away from iris region. this gives an evidence that the eye is diseased and an enhancement must be introduced before iris localization step. 4.4. enhancement of iris image when decision is taken that the eye is infected or diseased, then the procedure goes directly to the enhancement process. the enhancement process helps to localize the pupil boundary. the enhancement process is implemented through the following steps: 1. determine the upper boundary of iris, which leads to cover the area within the pupil (fig. 4). at this process, three parameters are stored: the upper radius (rupper), xupper boundary of the iris (xupper), and yupper boundary of the iris (yupper). 2. resizing the eye image to isolate the iris image as shown in fig. 5. the pupil boundary will be localized within the iris region instead of the whole eye region. hence, the new iris image will be determined as below: p 1 = (xupper−rupper, yupper−rupper) (3) p 2 = (xupper+rupper, yupper+rupper) (4) 3. adjusting the intensity of iris image according to the incident light as shown in fig. 6. 4. adjusting the threshold value to create the binary image as shown in fig. 7. 5. discriminate the irregular pupil by determining the minimum and maximum points on each axis as shown in fig. 8. four points must be calculated in this step: px min refers to the minimum point on x-axis found in pupil pixels (5) py min refers to the minimum point on y-axis found in pupil pixels (6) fig. 3. diseased and normal eyes. fig. 4. iris boundary localization. fig. 5. iris resizing. fig. 6. adjusting iris image. muzhir shaban al-ani and salwa mohammed nejrs: iris localization approach uhd journal of science and technology | jul 2019 | vol 3 | issue 2 29 px max refers to the maximum point on x-axis found in pupil pixels (7) py max refers to the maximum point on y-axis found in pupil pixels (8) 6. round the irregular pupil area by a rectangular shape as shown in fig. 9. this rectangular shape contains all the pixels of the irregular pupil. two points must be calculated in this step: p 1 min = (xpx min, ypy min) (9) p 1 max = (xpx max, ypy max) (10) 7. calculate the center of the rectangular according to the previous step: pcenter = (xcenter, ycenter) (11) 8. draw a circle around the pupil to complete the circular form of the pupil as shown in fig. 10. 9. update the iris image to the same position on the original image. 10. compare the processed image with the images stored in the database to identify the person. 5. results and discussion hamming distance measures the fraction of disagreeing bits resulting from bit-by-bit comparison of the two regions of interest. the obtained result indicated that the criterion is chosen to be 0.40, which means that a matching decision is never declared between two iris codes if it is exceed 40% of the disagreed bits. fig. 11 illustrates that the change in hamming distance before and after applying the proposed approach; in addition, it is clear that applying this approach causes significant decreasing in the hamming distance value. fig. 12 illustrates a comparison between the hamming distance of the diseased eye images before and after treatment fig. 8. determine rectangular dimensions. fig. 9. round the irregular pupil area by a rectangular shape. fig. 10. drawing a circle around the pupil. fig. 7. thresholding of iris image. muzhir shaban al-ani and salwa mohammed nejrs: iris localization approach 30 uhd journal of science and technology | jul 2019 | vol 3 | issue 2 and the hamming distance for the images of the same diseased eye after treatment. this figure indicated that the hamming distance values for the two diseased eye images after treatment are less than the hamming distance values for the images of the diseased eye before and after treatment. these results are caused by the pupil of the treated eyes seem normal unlike diseased eyes whose pupil becomes a little larger after applying the proposed enhancement method. fig. 13 indicated the differences in pupil radius values between the images of the diseased eyes after applying the enhancement and the images of the same diseased eyes after receiving the processed image. according to pupil distortion in the diseased eyes, the size of the pupil will enlarge affecting the size of iris region which should be considered when calculating hamming distance in iris recognition algorithm; this caused increasing the hamming distance values. the implemented approach is evaluated through performance evaluation in terms of false rejection rate (frr) and false acceptance rate (far). far and frr are shown in fig. 14 based on the hamming distance. perfect recognition is not possible due to the overlapping distributions. an accurate recognition rate is achieved through threshold of 0.40, in which a false accept rate and false reject rate of 0.000% and 0.100%, respectively, are obtained. 6. conclusions iris recognition is an effective method for biometric human identification. the implemented iris recognition approach is passed into many steps to achieve good system performance. this research studied the effects of infected eye on the recognition process through introducing different types of eye images. in addition, treating and enhancing processes are inserted in the overall approach to prepare an adequate iris image for processing. the obtained results indicated that the recognition performance of the implemented approach is 90%. the experimental results show that the proposed method is an effective approach in iris recognition. fig. 11. change in hamming distance before and after applying the proposed approach. fig. 12. the hamming distance outcomes of the diseased eye images before and after treatment and the hamming distance outcomes for the images of the same diseased eye after treatment. fig. 13. pupil radius values of diseased and treated eyes. fig. 14. performance evaluation in terms of false rejection rate and false acceptance rate. muzhir shaban al-ani and salwa mohammed nejrs: iris localization approach uhd journal of science and technology | jul 2019 | vol 3 | issue 2 31 references [1] m. s. al-ani. biometric security, source title: handbook of research on threat detection and countermeasures in network security. igi global, pennsylvania (usa), 2015. [2] a. e. osborn-gustavson, t. mcmahon, m. josserand and b. j. spamer. the utilization of databases for the identification of human remains. in: new perspectives in forensic human skeletal identification. ch. 12. academic press, san diego, 2018, pp. 129139. [3] m. s. al-ani. “happiness measurement via classroom based on face tracking. uhd journal of science and technology, vol 3, no. 1, pp. 9-17, 2019. [4] m. viner. overview of advances in forensic radiological methods of human identification. in: new perspectives in forensic human skeletal identification. ch. 19. academic press, san diego, 2018, pp. 217-226. [5] m. s. al-ani. biometrics: identification and security, source title. in: multidisciplinary perspectives in cryptology and information security. igi global, pennsylvania (usa), 2014. [6] s. s. muhamed and m. s. al-ani. “signature recognition based on discrete wavelet transform”. uhd journal of science and technology, vol. 3, no. 1, pp. 19-29, 2019. [7] a. m. christensen and g. m. hatch. advances in the use of frontal sinuses for human identification. in: new perspectives in forensic human skeletal identification. ch. 20, academic press, san diego, 2018, pp. 227-240. [8] m. s. al-ani and k. m. ali alheeti. precision statistical analysis of images based on brightness distribution. advances in science, technology and engineering systems journal, vol. 2, no. 4, pp. 99-104, 2017. [9] m. s. al-ani. efficient architecture for digital image processing based on epld. iosr journal of electrical and electronics engineering, vol. 12, no. 6, pp. 1-7, 2017. [10] m. s. al-ani, t. n. muhamad, h. a. muhamad and a. a. nuri. effective fingerprint recognition approach based on double fingerprint thumb, 2017 ieee, 2017 international conference on current research in computer science and information technology (iccit). ieee, slemani-iraq, 2017. [11] j. l cambier. adaptive iris capture in the field. biometric technology today, vol. 2014, no. 2, pp. 5-7, 2014. [12] a. rodriguez and b. v. k. vijaya kumar. segmentation-free biometric recognition using correlation filters. academic press library in signal processing. ch. 15. vol 4. carnegie mellon university, pittsburgh, pa, usa, 2014, pp. 403-460. [13] p. tome, r. vera-rodriguez, j. fierrez and j. ortega-garcia. facial soft biometric features for forensic face recognition. forensic science international, vol. 257,, pp. 271-284, 2015. [14] m. s. nixon, p. l. correia, k. nasrollahi, t. b. moeslund and m. tistarelli. on soft biometrics. pattern recognition letters, vol. 68, pp. 218-230, 2015. [15] i. rigas and o. v. komogortsev. eye movement-driven defense against iris print-attacks. pattern recognition letters, vol. 68, pp. 316-326, 2015. [16] f. davoodi, h. hassanzadeh, s. a. zolfaghari, g. havenith and m. maerefat. a new individualized thermoregulatory bio-heat model for evaluating the effects of personal characteristics on human body thermal response. building and environment, vol. 136, pp. 62-76, 2018. [17] p. connor and a. ross. biometric recognition by gait: a survey of modalities and features. computer vision and image understanding, vol. 167, pp. 1-27, 2018. [18] k. nguyen, c. fookes, s. sridharan, m. tistarelli and m. nixon. super-resolution for biometrics: a comprehensive survey. pattern recognition, vol. 78, pp. 23-42, 2018. [19] g. batchuluun, j. h. kim, h. g. hong, j. k. kang and k. r. park. fuzzy system based human behavior recognition by combining behavior prediction and recognition. expert systems with applications, vol. 81, pp. 108-133, 2017. [20] s. gold. iris biometrics: a legal invasion of privacy? biometric technology today, vol. 2013, no. 3, pp. 5-8, 2013. [21] m. gomez-barrero, j. galbally and j. fierrez. efficient software attack to multimodal biometric systems and its application to face and iris fusion. pattern recognition letters, vol. 36, pp. 243-253, 2014. [22] k. aloui, a. nait-ali and m. s. naceur. using brain prints as new biometric feature for human recognition. pattern recognition letters, vol. 113, in press, 2017. [23] s. kumar and s. k. singh. monitoring of pet animal in smart cities using animal biometrics. future generation computer systems, vol. 83, pp. 553-63, 2018. [24] s. crihalmeanu and a. ross. multispectral scleral patterns for ocular biometric recognition. pattern recognition letters, vol. 33, no. 14, pp. 1860-1869, 2012. [25] k. c. reshmi, p. i. muhammed, v. v. priya and v. a. akhila. a novel approach to brain biometric user recognition. procedia technology, vol. 25, pp. 240-247, 2016. [26] h. wechsler and f. li. biometrics and robust face recognition. in: conformal prediction for reliable machine learning. ch. 10. morgan kaufmann publishers inc., san francisco, 2014, pp. 189215. [27] m. s. al-ani and q. al-shayea. speaker identification: a novel fusion samples approach. international journal of computer science and information security, vol. 14, no. 7, pp. 423-427, 2016. [28] r. s. prasad, m. s. al-ani and s. m. nejres. hybrid fusion of two human biometric features. international journal of business and ict, vol. 2, pp. 1-2, 2016. [29] q. al-shayea and m. s. al-ani. biometric face recognition based on enhanced histogram approach. international journal of communication networks and information security, vol. 10, no. 1, pp. 148-154, 2018. [30] t. bergmüller, e. christopoulos, k. fehrenbach, m. schnöll and a. uhl. recompression effects in iris recognition. image and vision computing, vol. 58, pp. 142-157, 2017. [31] m. trokielewicz, a. czajka and p. maciejewicz. implications of ocular pathologies for iris recognition reliability. image and vision computing, vol. 58, pp. 158-167, 2017. [32] y. alvarez-betancourt and m. garcia-silvente. a keypoints-based feature extraction method for iris recognition under variable image quality conditions. knowledge-based systems, vol. 92, pp. 169182, 2016. [33] a. k bhateja, s. sharma, s. chaudhury and n. agrawal. iris recognition based on sparse representation and k-nearest subspace with genetic algorithm. pattern recognition letters, vol. 73, pp. 13-18, 2016. [34] r. pasula, s. crihalmeanu and a. ross. a multiscale sequential fusion approach for handling pupil dilation in iris recognition. muzhir shaban al-ani and salwa mohammed nejrs: iris localization approach 32 uhd journal of science and technology | jul 2019 | vol 3 | issue 2 in: human recognition in unconstrained environments. ch. 4. academic press, chicago, il, 2017, pp. 77-102. [35] f. jan. segmentation and localization schemes for non-ideal iris biometric systems. signal processing, vol. 133, pp. 192-212, 2017. [36] d. gragnaniello, g. poggi, c. sansone and l. verdoliva. using iris and sclera for detection and classification of contact lenses. pattern recognition letters, vol. 82, pp. 251-257, 2016. [37] y. hu, k. sirlantzis and g. howells. iris liveness detection using regional features. pattern recognition letters, vol. 82, pp. 242250, 2016. [38] m. de marsico, c. galdi, m. nappi and d. riccio. firme: face and iris recognition for mobile engagement. image and vision computing, vol. 32, no. 12, pp. 1161-1172, 2014. [39] j. liu, z. sun and t. tan. distance metric learning for recognizing low-resolution iris images. neurocomputing, vol. 144, pp. 484-492, 2014. [40] r. s. prasad, m. s. al-ani and s. m. nejres. human identification via face recognition: comparative study. iosr journal of computer engineering, vol. 19, no. 3, pp. 17-22, 2017. [41] g. i. raho, m. s. al-ani, a. a. k. al-alosi and l. a. mohammed. signature recognition using discrete fourier transform. international journal of business and ict, vol. 1, pp. 1-2, 2015. [42] r. s. prasad, m. s. al-ani and s. m. nejres. an efficient approach for human face recognition. international journal of advanced research in computer science and software engineering, vol. 5, no. 9, pp. 133-136, 2015. [43] r. s. prasad, m. s. al-ani and s. m. nejres. an efficient approach for fingerprint recognition. international journal of engineering innovation and research, vol. 4, no. 2, pp. 303-313, 2015. [44] k. nguyen, c. fookes, r. jillela, s. sridharan and a. ross. long range iris recognition: a survey. pattern recognition, vol. 72, pp. 123-143, 2017. [45] m. karakaya. a study of how gaze angle affects the performance of iris recognition. pattern recognition letters, vol. 82, pp. 132-143, 2016. [46] k. b. raja, r. raghavendra, v. k. vemuri and c. busch. smartphone based visible iris recognition using deep sparse filtering. pattern recognition letters, vol. 57, pp. 33-42, 2016. [47] k. w. bowyer, e. ortiz and a. sgroi. iris recognition technology evaluated for voter registration in somaliland. biometric technology today, vol. 2015, no. 2, pp. 5-8, 2015. [48] a. f. m. raffei, h. asmuni, r. hassan and r. m. othman. a low lighting or contrast ratio visible iris recognition using iso-contrast limited adaptive histogram equalization. knowledge-based systems, vol. 74, pp. 40-48, 2015. [49] swathi s. dhage, s. s. hegde, k. manikantan and s. ramachandran. dwt-based feature extraction and radon transform based contrast enhancement for improved iris recognition. procedia computer science, vol. 45, pp. 256-265, 2015. [50] s. umer, b. c. dhara and b. chanda. a novel cancelable iris recognition system based on feature learning techniques. information sciences, vol. 406-407, pp. 102-118, 2015. [51] y. jung, d. kim, b. son and j. kim. an eye detection method robust to eyeglasses for mobile iris recognition. expert systems with applications, vol. 67, pp. 178-188, 2017. [52] i. tomeo-reyes and v. chandran. part based bit error analysis of iris codes. pattern recognition, vol. 60, pp. 306-317, 2016. [53] haiqing li, q. zhang and z. sun. iris recognition on mobile devices using near-infrared images. in: human recognition in unconstrained environments. ch. 5. institute of automation, chinese academy of sciences, beijing, pr china, 2017, pp. 103117. [54] s. s. barpanda, b. majhi, p. k. sa, a. k. sangaiah and s. bakshi. iris feature extraction through wavelet mel-frequency cepstrum coefficients. optics and laser technology, vol. 110, pp. 13-23, 2019. [55] m. sardar, s. mitra and b. u. shankar. iris localization using rough entropy and csa: a soft computing approach. applied soft computing, vol. 67, pp. 61-69, 2018. [56] s. zhang and y. zhou. template matching using grey wolf optimizer with lateral inhibition. optik-international journal for light and electron optics, vol. 130, pp. 1229-1243, 2017. [57] p. samant and r. agarwal. machine learning techniques for medical diagnosis of diabetes using iris images. computer methods and programs in biomedicine, vol. 157, pp. 121-128, 2018. [58] z. lin, d. ma, j. meng and l. chen. relative ordering learning in spiking neural network for pattern recognition. neurocomputing, vol. 275, pp. 94-106, 2018. [59] h. rai and a. yadav. iris recognition using combined support vector machine and hamming distance approach. expert systems with applications, vol. 41, no. 2, pp. 588-593, 2014. [60] i. hamouchene and s. aouat. a new texture analysis approach for iris recognition. aasri procedia, vol. 9, pp. 2-7, 2014. [61] g. santos, e. grancho, m. v. bernardo and p. t. fiadeiro. fusing iris and periocular information for cross-sensor recognition. pattern recognition letters, vol. 57, pp. 52-59, 2015. [62] s. umer, b. c. dhara and b. chanda. iris recognition using multiscale morphologic features. pattern recognition letters, vol. 65, pp. 67-74, 2015. [63] t. thomas, a. george and k. p. i. devi. effective iris recognition system. procedia technology, vol. 25, pp. 464-472, 2016. [64] k. hajari, u. gawande and y. golhar. neural network approach to iris recognition in noisy environment. procedia computer science, vol. 78, pp. 675-682, 2016. [65] n. f. soliman, e. mohamed, f. magdi, f. e. a. el-samie and a. m. elnaby. efficient iris localization and recognition. optik-international journal for light and electron optics, vol. 140, pp. 469-475, 2017. [66] i. naseem, a. aleem, r. togneri and m. bennamoun. iris recognition using class-specific dictionaries. computers and electrical engineering, vol. 62, pp. 178-193, 2017. [67] e. g. llano, m. s. g. vázquez, j. m. c. vargas, l. m. z. fuentes and a. a. r. acosta. optimized robust multi-sensor scheme for simultaneous video and image iris recognition. pattern recognition letters, vol. 101, pp. 44-51, 2018. [68] m. zhang, z. he, h. zhang, t. tan and z. sun. towards practical remote iris recognition: a boosting based framework. neurocomputing, vol. 330, in press, 2018. tx_1~abs:at/tx_2:abs~at 82 uhd journal of science and technology | jan 2023 | vol 7 | issue 1 1. introduction diabetes is a major public health issue, projected to be the seventh leading cause of death by 2030 [3]. patients with t2dm patients with suboptimal glycemic control and hba1c levels are more likely to develop microvascular problems and cardiovascular disease [1,13]. hba1c levels have been shown to be affected by modifiable psychosocial variables such self-care habits and attitude [2,5]. without good self-care practices, it might be difficult to keep hba1c levels in check [3,6] frequency, population distribution. the authors express concern that diabetes might develop into a regional public health problem and suggest measures to combat the disease [4,5]. developing healthy self-care habits is essential for managing hba1c levels, which can increase without proper exploring the relationship between attitudes and blood glucose control among patients with type 2 diabetes mellitus in chamchamal town, kurdistan, iraq hawar mardan mohammed1*, samir yonis lafi1 1department of nursing, university of raparin, kurdistan regional government, iraq, 2department of nursing, college of nursing, university of human development, kurdistan regional government, iraq a b s t r a c t background: diabetes mellitus type 2 is an endocrine disorder characterized by a progressive elevation in blood glucose levels. it is a persistent and incapacitating illness that may result in mortality if not properly managed. objectives: the objective of this research is to explore the relationship between the attitudes of individuals with type 2 diabetes mellitus and their ability to regulate blood glucose levels. in particular, the study aims to investigate the potential correlation between participants’ attitudes and their capacity to manage blood glucose levels following their participation in an educational program. moreover, the research seeks to analyze the association between individuals’ attitudes and diabetes control. ultimately, the study intends to evaluate the levels of participants’ attitudes through appropriate measures. materials and methods: the study is designed as a cross-sectional investigation and utilizes data from a diabetic outpatient center in chamchamal. the study population consists of outpatients from the evening public clinic and chronic disease control center. participants are required to complete questionnaires on their diabetes attitude. the study was conducted between august 11, 2019, and january 5, 2022. to explore the efficacy of the attitude with diabetes control, we used a correlation coefficient test and a t-test with p-value of 0.05 as our alpha level of significance. results and conclusion: the study found that the majority of patients with type 2 diabetes mellitus had low levels of educational attainment, were married and had insufficient monthly income. in addition, 85% of the patients reported not smoking, and 48.3% were classified as overweight. these findings highlight the need for health-care providers to consider sociodemographic factors in the management of diabetes mellitus. index terms: attitude, type 2 diabetes mellitus, blood glucose control, diabetic complications, self-care management corresponding author’s e-mail: hawar mardan mohammed, department of nursing, university of raparin, kurdistan regional government, iraq. e-mail: hawar.mardan84@gmail.com received: 28-11-2022 accepted: 11-03-2023 published: 12-04-2023 access this article online doi: 10.21928/uhdjst.v7n1y2023.pp82-91 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2023 mohammed and lafi. this is an open access article distributed under the creative commons attribution noncommercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l r e s e a r c h a r t i c l e uhd journal of science and technology mohammed and lafi: attitude and glucose control uhd journal of science and technology | jan 2023 | vol 7 | issue 1 83 self-care [6]. attitude is the degree to which a person believes he or she is capable of doing a job, and attitudes precede actions [7,8]. the amount of self-assurance that individual possesses regarding their capacity to carry out a task [15]. is referred to as their attitude, and it is normal for an individual’s attitudes to come before their actions. patients with diabetes need to make lifestyle changes to manage their blood glucose levels [9]. this study aims to investigate the potential correlation between attitudes of individuals with type 2 diabetes mellitus and their ability to regulate blood glucose levels [10]. patients with diabetes can also improve their health and prevent further complications by losing weight and lowering their body mass index (bmi) [11]. diabetes attitude is a patient’s attitude toward managing the disease, controlling blood sugar, reducing complications, and preventing short-term problems [12]. effective patient attitude management strategies can reduce the risk of chronic complications and prevent acute complications in type 2 diabetes [10]. individuals who maintain optimal glycemic control are at a reduced risk of developing microvascular complications, such as those that affect the kidneys, nerves, and eyes. these complications can manifest in the form of cataracts, glaucoma, renal failure, and lower limb amputations. conversely, when blood glucose levels are maintained at appropriate levels, macrovascular complications, including heart attacks and strokes, appear to be averted [14]. this study aims to investigate the relationship between attitude and ability to manage blood glucose. 2. methodology 2.1. study design sixty patients were studied in this cross-sectional study from the diabetes and chronic disease control center in the chamchamal district of sulaimaniyah, iraq, between july 7, 2020, and november 7, 2020. 2.2. sample size raosoft’s sample size calculator was used to determine the appropriate sample size. only 60 patients out of a possible 2000 at the diabetes and chronic disease control center were included in this research. 2.3. inclusion criteria the research study exclusively included adult patients who had been diagnosed with type 2 diabetes and met the rigorous eligibility criteria set forth by the trial. to be included in the study, participants were required to provide informed consent and meet all the necessary prerequisites for research participation. the eligible individuals who met the inclusion criteria are described in detail below. 2.4. exclusion criteria patients with t1dm, pregnant women with t2dm, liver failure, impairments or special requirements, and gestational diabetes were excluded from the study. 2.5. ethical approval the university approved the moral viewpoints expressed by the ethics committee of the college of nursing at raparin. in addition, participants were informed of the purpose and nature of the research. 2.6. patient informed consent before data were collected, participants were asked to sign informed consent forms and give their verbal and written informed consent in kurdish. they were also what might come out of the study. furthermore, a lot of thought goes into patients’ rights, privacy, and the safety of their information. 2.7. questionnaire a questionnaire to evaluate a patient’s attitude and behavior was designed and composed of 3 parts that covered sociodemographic factors, clinical parameters, and attitude behaviors evaluation. each section uses a likert scale to rate the respondent’s degree of agreement with each statement. the participants’ total replies were computed on a scale from 1 to 30, with always = 1, sometimes = 2, and never = 3. then, the attitude score was determined for each participant based on their responses to sets of 30 questions. the likert questionnaire had a reliability of 0.92 based on the results of the cronbach’s alpha test; then the items were presented to all patients in the same order. after taking the patient’s height and weight, the bmi (bmi; kg/m2) was determined. bmi stands for body mass index, and it is a measure of a person’s body fat based on their height and weight. it is calculated by dividing a person’s weight in kilograms by their height in meters squared (kg/m²). bmi is a commonly used metric for determining whether an individual’s weight is within a healthy range or if they are overweight or obese. it is often used in both clinical and research settings as a quick and easy screening tool for assessing a person’s weight status and associated health risks. however, it should be noted that bmi is not a perfect measure and has certain limitations, such as not taking into account body composition or distribution of body fat, a researcher used a targeted sampling technique to obtain data. mohammed and lafi: attitude and glucose control 84 uhd journal of science and technology | jan 2023 | vol 7 | issue 1 2.8. measure of the clinical parameter the bmi was classified according to the who criteria in which <18.5 kg/m2 = underweight, 18.5–24.9 kg/m2 = normal weight, 25.0–29.9 kg/m2 = pre-obesity, 30.0–34.9 kg/m2 = obesity class i, 35.0–39.9 kg/m2 = obesity class ii, and <40 kg/m2 = obesity class iii world health organization (2021). 2.9. statistical analysis spss version 25 was utilized for conducting data analysis. 3. results 3 . 1 . p a r t i c i p a n t s ’ d e m o g r a p h i c a n d c l i n i c a l characteristics patients in this research had a mean age of 58.07 ± 0.309 years, a median age of 57.5 years, and an age range of 39–81 years. regarding educational attainment, the majority of patients (55%) were illiterate, followed by elementary school graduates (28.3%), and only 1.7% were college graduates or postgraduates. in contrast, 90% of patients were married, compared to 1.7% who were single or separated (not living together). regarding the patients’ employment, the majority (40%) were housewives, whereas the minority (10%) were retirees. the majority of patients have insufficient monthly income (65%), reside in urban areas (73.3%), do not smoke (85%), and are overweight (48.3%). the majority of patients had t2dm for 10 years and took antihyperglycemic therapy orally (98.3%) (table 1). 3.2. changes in attitudes and practices before and after the intervention table 2 shows the terms of some of the differences between the preand post-test attitudes toward controlling disease, the distribution of the mean scores of the preand post-test attitudes and practices toward the daily care of patients, and the associated constructs. the table also shows the attitudes and actions that have the most to do with stopping diseases. for example, the highest mean score for the total number of possible points in the pre-attitude group was 2.98 (i eat or drink regularly every day), while the lowest was 1.3 (i try to learn how to control my diabetes by going to different diabetes education programs) (table 2a). the highest mean score for the total number of possible points in the postattitude group was also 2.98. the point with the lowest mean score was 1.77, which said that herbal medicines have fewer side effects than medical ones (table 2b). 3.3. correlation between attitudes and sociodemographic characteristics table 3 presents a correlation matrix that facilitates the examination of the relationship between attitudes (before table 1: the t2dm patients’ (no.=60) sociodemographic and clinical information variable frequency percent level of education illiterate 33 55.0 primary school graduate 17 28.3 secondary school graduate 7 11.7 institute graduate 2 3.3 collage and post graduate 1 1.7 marital status single 1 1.7 married 54 90.0 widow 2 3.3 divorced 2 3.3 separated (not living together) 1 1.7 occupation government employed 9 15.0 self employed 12 20.0 retired 6 10.0 house wife 24 40.0 jobless 9 15.0 monthly income sufficient 4 6.7 barely sufficient 17 28.3 insufficient 39 65.0 residential area urban 44 73.3 rural 16 26.7 duration of diabetes mellitus ≤10 years 45 75.0 ≤20 years 12 20.0 >20 years 3 5.0 treatment method oral antihyperglycemic agents 59 98.3 insulin 1 1.7 do you smoke? yes 9 15.0 no 51 85.0 how many cigarettes per day? 11–20 1 10 21–30 9 90 body mass index underweight 1 1.7 normal weight 10 16.7 over weight 29 48.3 obesity ι 14 23.3 obesity ii 5 8.3 obesity iii 1 1.7 for how many years have you smoked? 10 2 15.38 15 5 38.46 20 2 15.38 25 3 23.08 40 1 7.69 source of information about disease physician 40 66.7 nurse 13 21.7 books and magazines 1 1.7 television 6 10.0 mohammed and lafi: attitude and glucose control uhd journal of science and technology | jan 2023 | vol 7 | issue 1 85 table 2a: the participants pre‑attitude behaviors evaluation variable always=3 sometime=2 never=1 mean score rank % i visit hospital regularly according to doctor’s appointment for examination or treatment of diabetes. 26 25 9 2.28 2 64 i take meals or refreshment regularly every day. 14 40 6 2.13 8 56.5 i eat as well-balance diet using a list of food exchanges 15 42 3 2.2 5 60 i take foods containing dietary fiber like grain, vegetable and fruit every day. 14 41 5 2.15 7 57.5 i set a limit of taking salt and processed foods. 29 20 11 2.3 1 65 i do a self-blood sugar test according doctors’ recommendations. 14 30 16 1.97 10 48.5 i do a self-blood sugar test more frequently, when i feel symptoms of hypoglycemia such as tremor, pallor, and headache. 14 24 22 1.87 12 43.5 i try to maintain the optimal blood sugar level. 9 34 17 1.87 13 43.5 i control the size of meals or exercise according to a blood sugar level. 6 32 22 1.73 17 36.5 i am carrying food likes sweet drink, candy or chocolate just in case of hypoglycemia. 3 20 37 1.43 25 21.5 i try to maintain optimal weight by measuring my weight regularly. 5 32 23 1.7 18 35 i carry insulin, injection and blood sugar tester whenever i go to trip. 5 10 45 1.33 27 16.5 i try to get information on diabetes control by attending various diabetes educational programs. 4 10 46 1.3 30 15 i take my diabetes medication like insulin injection as prescribe observing dosage and time regularly. 6 14 40 1.43 26 21.5 i keep in touch with my physician. 16 41 3 2.22 4 61 herbal medications have less complications than medical medications 3 27 30 1.55 20 27.5 regular exercise helps me to control diabetes. 5 23 32 1.55 21 27.5 reading handouts on proper footwear is necessary for me. 3 13 44 1.32 29 16 blood pressure control helps me to control my diabetes mellitus. 8 38 14 1.9 11 45 annual eyes examination is necessary for me. 9 34 17 1.87 14 43.5 always i be relaxing and avoid stress and bad mood because its effects diabetes negatively. 10 43 7 2.05 9 52.5 i did not miss doses of my diabetic medication. 20 31 9 2.18 6 59 i inspect my feet during and after my shower/bath. 14 21 25 1.82 15 41 i use talcum powder to keep my inter-digital spaces dry. 6 20 34 1.53 22 26.5 i check the temperature of water before use. 2 16 42 1.33 28 16.5 i examine my feet daily. 6 17 37 1.48 23 24 i used to check my blood glucose level. 5 27 28 1.62 19 31 i used to check fasting blood glucose and 2 h after meal by glucometer. 6 37 17 1.82 16 41 i take my medication according of physician recommendation. 24 27 9 2.25 3 62.5 i did not wear tide shoes. 8 13 39 1.48 24 24 and after) and sociodemographic characteristics. the matrix displays only those variables that exhibit a statistically significant correlation (p < 0.05) as determined by pearson’s r, with regard to satisfaction levels of the simulation experience. the matrix allows for an assessment of the degree of association between sociodemographic factors and participant satisfaction with the simulation, providing valuable insights into the factors that may impact user experience. the mann–whitney u-test is used in. 3.4. gender differences in attitudes before and after the intervention table 4 to compare the means of attitudes (before and after) by gender. before the test, the mean attitude score for males mohammed and lafi: attitude and glucose control 86 uhd journal of science and technology | jan 2023 | vol 7 | issue 1 table 2b: the participant’s post‑attitude behaviors evaluation variable always=3 sometime=2 never=1 mean score rank % i take foods containing dietary fiber like grain, vegetable and fruit every day. 58 2 0 2.97 4 98.5 i set a limit of taking salt and processed foods. 59 1 0 2.98 2 99 i do a self-blood sugar test according doctors’ recommendations. 55 5 0 2.92 9 96 i do a self-blood sugar test more frequently, when i feel symptoms of hypoglycemia like tremor, pallor and headache. 47 13 0 2.78 13 89 i try to maintain the optimal blood sugar level. 41 19 0 2.68 15 84 i control the size of meals or exercise according to a blood sugar level. 37 23 0 2.62 21 81 i am carrying food likes sweet drink, candy or chocolate just in case of hypoglycemia. 10 46 4 2.1 29 55 i try to maintain optimal weight by measuring my weight regularly. 29 31 0 2.48 23 74 i carry insulin, injection and blood sugar tester whenever i go to trip. 14 43 3 2.18 28 59 i try to get information on diabetes control by attending various diabetes educational programs. 48 12 0 2.8 12 90 i take my diabetes medication like insulin injection as prescribe observing dosage and time regularly. 43 15 2 2.68 16 84 i keep in touch with my physician. 56 4 0 2.93 8 96.5 herbal medications have less complications than medical medications. 12 22 26 1.77 30 38.5 regular exercise helps me to control diabetes. 47 11 2 2.75 14 87.5 reading handouts on proper footwear is necessary for me. 41 18 1 2.67 18 83.5 blood pressure control helps me to control my diabetes mellitus. 55 5 0 2.92 10 96 annual eyes examination is necessary for me. 58 2 0 2.97 5 98.5 always i be relaxing and avoid stress and bad mood because its effects diabetes negatively. 36 24 0 2.6 22 80 i did not miss doses of my diabetic medication. 38 22 0 2.63 19 81.5 i inspect my feet during and after my shower/bath. 28 32 0 2.47 24 73.5 i use talcum powder to keep my inter-digital spaces dry. 17 43 0 2.28 27 64 i check the temperature of water before use. 28 32 0 2.47 25 73.5 i examine my feet daily. 19 40 1 2.3 26 65 i used to check my blood glucose level. 41 19 0 2.68 17 84 i used to check fasting blood glucose and 2 h after meal by glucometer. 38 22 0 2.63 20 81.5 i take my medication according of physician recommendation. 52 8 0 2.87 11 93.5 i did not wear tide shoes. 58 2 0 2.97 6 98.5 fig. 1. compare means of attitude (pre and post) by level of education using kruskal–wallis h-test. was 2.219 and for females it was 2.201. after the test, the mean attitude score for males was 1.353 and for females it was 1.308. neither score changed significantly from the pre-test. the results of the post-test showed that there wasn’t a big difference between men and women using the kruskal–wallis h-tes. 3.5. impact of education level on attitudes before and after the intervention fig. 1 shows the study compared the mean attitudes of participants before and after the intervention with respect to their level of education. the results indicated that there was no statistically significant difference between the mean pre-test and post-test attitudes of the participants. mohammed and lafi: attitude and glucose control uhd journal of science and technology | jan 2023 | vol 7 | issue 1 87 table 3: correlation matrix of attitude (pre and post) with the socio‑demographic data variable no. pre-attitude post-attitude p-value spearman rank correlation no. p-value spearman rank correlation age 60 0.003 0.378** 60 0.716 −0.048 level of education 60 0.039 −0.268* 61 0.598 0.069 family member has diabetes mellitus 60 0.598 −0.069 62 0.009 0.334** monthly income 60 0.002 0.384** 63 0.753 0.041 duration of diabetes mellitus 60 0.712 0.049 64 0.795 0.034 body mass index (bmi) 60 0.587 −0.072 65 0.739 −0.044 how many cigars per day? 10 0.415 0.291 66 0.107 0.541 for how many years have you smoked? 13 0.601 0.16 67 0.424 0.243 how long ago did you quit smoking? 5 0.581 0.335 68 0.581 0.335 *significant at 0.001 level, **significant at 0.05 level. table 4: compare means of attitude (pre and post) by gender using mann‑whitney u‑test attitude gender no. mean sd mann-whitney u p-value mean pre male 34 2.219 0.356 424.5 0.794 female 26 2.201 0.357 mean post male 34 1.353 0.143 370.0 0.280 female 26 1.308 0.112 these findings sug gest that the inter vention did not have a significant impact on the attitudes of participants toward the research topic, regardless of their level of education. 3.6. impact of income level on attitudes before and after the intervention fig. 2 contrasts the perspectives of patients who participated in the pattern-making training before and after which training. it indicates that each group’s post-test results for each attitude component differ significantly. 3.7. statistical analysis of income levels before and after intervention in fig. 3, to assess whether there was a statistically significant difference in the mean rank of income levels, the kruskal– wallis test was employed. the results indicated a highly significant difference between the pre-test and post-test means for each income level, with a p < 0.05. specifically, the pre-test mean values ranged from 2.305 to 1.63, while the post-test means ranged from 1.339 to 0.316. the highest pre-test mean was observed as 2.305, while the lowest was 1.63. these findings suggest that the intervention had a fig. 2. compare means of attitude (pre and post) by occupation using kruskal–wallis h-test. fig. 3. compare means of attitude (pre and post) by monthly income using kruskal–wallis h-test. mohammed and lafi: attitude and glucose control 88 uhd journal of science and technology | jan 2023 | vol 7 | issue 1 significant impact on the attitudes of participants towards the research topic across all income levels, and support the need for continued efforts to improve attitudes and perceptions. however, to increase the p-value further, it may be necessary to adjust the alpha level or consider a larger sample size. 3.8. impact of residential area on attitudes before and after the intervention in fig. 4, mann–whitney test was conducted to compare the preand post-test attitude averages of participants by their residential area. the analysis revealed no significant differences between the preand post-test attitudes. the mean attitude score before the intervention was 0.451, while the mean score after the intervention was 0.608. 3.9. relationship between source of knowledge and attitudes before and after the intervention fig. 5 shows the analysis shows that there is no correlation between the mean attitudes (preand post-intervention) and the source of knowledge about the illness. the mean score for the pre-test was 0.529, and for the posttest, it was 0.704, with p = 0.05. the average attitude score before the intervention was 0.529, while after the intervention, it was 0.704. the findings indicate that there is no significant relationship between the attitudes of participants before and after receiving information about the disease (p > 0.05). 4. discussion the median age of t2dm patients in this research was 57.50 years. consequently, the research population consisted of adults and the elderly. after 50 years of age, the prevalence of (dm) grows progressively, according to the findings of various research conducted in various nations [20]. the majority of our participants were also male. however, national and regional investigations have revealed no substantial gender disparity in the frequency of dm [21]. therefore, the fact that the majority of our patients were males might be attributed to the fact that men had easier access to hospitals and clinics and more flexible work hours than women. before the initiation of the education program, the mean attitude score was found to be similar between the case and control groups. however, after the program was implemented, significant differences were observed between the two groups on multiple attitude-related questions. these findings are consistent with earlier research that employed alternative intervention methodologies, as reported by [23]. in addition, substantial patients completed diabetes selfmanagement education in the current research (dsme). consequently, they were more likely to comply with the recommended diabetic care standards and their pharmaceutical treatment regimens. this result is consistent with earlier research demonstrating that the hba1c levels of patients fell considerably following diabetes education program treatments [16]. in addition, another study comparing the opinions of 252 health professionals and 279 individuals with diabetes revealed major disparities in their perspectives. both groups agreed on the severity of t2dm, the necessity for strict glycemic control, and the psychosocial effect of the condition, but they disagreed on the importance of patient autonomy. this study found no significant differences in the severity of the illness between t1dm and t2dm individuals. in addition, people with diabetes who had previously received diabetes education had elevated rates of disease [17]. fig. 5. compare means of attitude (pre and post) by source of your information about disease using kruskal–wallis h-test. fig. 4. compare means of attitude (pre and post) by residential area using kruskal-wallis h-test. mohammed and lafi: attitude and glucose control uhd journal of science and technology | jan 2023 | vol 7 | issue 1 89 there is no significant correlation between gender attitude and program intervention, according to the current study. in contrast, the majority of married couples displayed a level of self-care that fell somewhere in the center. according to iranian study, married participants in diabetes selfmanagement programs had a more optimistic outlook than their single counterparts [18]. for instance, a separate study found that t2dm patients with stronger marital connections and mutual support have better self-care attitudes and self-management [19]. in addition, there was a substantial correlation between self-care and social support in iran cross-sectional research [23]. we also demonstrated a substantial relationship between education level and attitude. patients with higher levels of education, for example, demonstrated more positive attitudes when engaging in diabetic self-management programs. this was especially the case when the programs were implemented. moreover, illiterates were shown to have a much poorer level of self-care. in addition, it was shown that those with greater levels of education engage in more beneficial kinds of self-care than those with lower levels of education. one study including 125 individuals of diverse racial origins with t2dm found a substantial favorable connection between education and diabetes management [20]. at the outset of the study, the diabetes education program (dep) evaluated the perspectives of patients with type 2 diabetes on various aspects of the disease, including its physiopathological and nutritional components, treatments, physical activity, patient education, self-monitoring, chronic complications, special situations, and family support. the initial phase of the dep’s development involved assessing the patients’ attitude needs toward their illness, followed by an evaluation of their attitudes following the program’s implementation. this approach is consistent with the previous studies that have recommended preand post-intervention data collection to accurately evaluate the effectiveness of diabetes education programs [24]. the study results suggest that there was a significant change in diabetic patients’ perceptions of their illness after participating in the investigation. however, it is difficult to make a conclusive statement about the direct impact of this newfound knowledge on the patients’ behavior and lifestyle. while the study revealed that the dep had a positive effect on the patients’ attitudes and behavioral abilities, it was found that the improvements in diet-related attitudes were less significant than those observed for general diabetes knowledge. these findings provide concrete support for the notion that patient education programs can have a positive impact on patients’ perceptions of their illness and their ability to manage it effectively. however, further research is needed to determine the specific factors that contribute to behavior change in diabetic patients. changing the attitudes of diabetes patients is impacted by a number of factors, including their knowledge of their illness, risk factors, and treatment alternatives. the study investigated the efficiency of group education and determined that it successfully improved and altered attitudes toward selfmonitoring of capillary glucose. this was discovered by comparing the attitudes of participants prior to and following the instructional program [25]. this study found a significant association between patient attitude and glycemic control, with patients with more optimistic attitudes exhibiting better glycemic control. this result was supported by a substantial body of literature and contemporary research conducted in several cultural settings [26]. consequently, according to the american diabetes association (ada), individuals diagnosed with type 2 diabetes mellitus (t2dm) should possess a positive attitude towards their condition to effectively manage the illness and mitigate potential complications. inadequate glycemic control was associated with a low self-care score, whereas better disease management was associated with a higher self-care score [28]. in the present study, health literacy is shown to significantly influence glycemic control, while higher education levels are associated with favorable health behaviors in patients. literature supports the notion that health education and literacy can considerably influence illness outcomes, disease management, and prevention of complications. moreover, patients with higher education levels exhibit more optimistic attitudes compared to those with the lower education levels [27]. moreover, existing research has established a correlation between health literacy and the mitigation of diabetes complications through the adoption of a positive mindset (mukanoheli et al., 2020). furthermore, we observed a notable enhancement in the educational intervention concerning the identification of hypoglycemic symptoms, as inpatient care has been shown to yield better results, a finding that was reinforced during the program’s implementation. mohammed and lafi: attitude and glucose control 90 uhd journal of science and technology | jan 2023 | vol 7 | issue 1 5. conclusion the willingness of patients with type 2 diabetes to adopt a positive attitude and participate in positive behaviors is a significant factor in the effective control of their blood glucose levels in patients with type 2 diabetes. in addition to receiving medical treatment, patients have the additional responsibility of prioritizing healthy daily routines and habits. these may include monitoring their blood glucose levels, making alterations to their food, engaging in physical activity, and caring for their feet. these healthy practices have the potential to have a major influence on the patients’ general health and to enhance their capacity to keep their blood glucose levels under control. 6. acknowledgments the scientific working group of the university of raparin deserves special gratitude for their assistance. the participants and employees at the center for diabetes and chronic disease control in the chamchamal district of sulaimaniyah, iraq, also deserve our sincere gratitude. 7. conflict of interest there is no conflict of interest in this study. references [1] r. m. anderson and m. m. funnell. “patient empowerment: myths and misconceptions”. patient education and counseling, vol. 79, no. 3, pp. 277-282, 2010. [2] m. p. fransen, c. von wagner and m. l. essink-bot. “diabetes selfmanagement in patients with low health literacy: ordering findings from literature in a health literacy framework”. patient education and counseling, vol. 88, no. 1, pp. 44-53, 2012. [3] a. g. brega, a. ang, w. vega, l. jiang, j. beals, c. m. mitchell, k. moore, s. m. manson, k. j. acton, y. roubideaux and special diabetes program for indians healthy heart demonstration project. “mechanisms underlying the relationship between health literacy and glycemic control in american indians and alaska natives”. patient education and counseling, vol. 88, no. 1, pp. 6168, 2012. [4] g. danaei, m. m. finucane, y. lu, g. m. singh, m. j. cowan, c. j. paciorek, j. k. lin, f. farzadfar, y. h. khang, g. a. stevens, m. rao, m. k. ali, l. m. riley, c. a. robinson and m. ezzati. “national, regional, and global trends in fasting plasma glucose and diabetes prevalence since 1980: systematic analysis of health examination surveys and epidemiological studies with 370 country-years and 2.7 million participants”. lancet, vol. 378, no. 9785, pp. 31-40, 2011. [5] a. bener, e. j. kim, f. mutlu, a. eliyan, h. delghan, e. nofal, l. shalabi and n. wadi. “burden of diabetes mellitus attributable to demographic levels in qatar: an emerging public health problem”. diabetes and metabolic syndrome, vol. 8, no. 4, pp. 216-220, 2014. [6] world health organization. “diabetes”. available from: https://www. who.int/news-room/fact-sheets/detail/diabetes [last accessed on 2023 feb 28]. [7] l. adam, c. o'connor and a. c. garcia, “evaluating the impact of diabetes self-management education methods on knowledge, attitudes and behaviors of adult patients with type 2 diabetes mellitus,” canadian journal of diabetes, vol. 42, no. 5, pp. 470477, 2018. [8] r. e. soccio, r. m. adams, m. j. romanowski, e. sehayek, s. k. burley and j. l. breslow. “the cholesterol-regulated stard4 gene encodes a star-related lipid transfer protein with two closely related homologues, stard5 and stard6”. proceedings of the national academy of sciences u s a, vol. 99, no. 10, pp. 69436948, 2002. [9] a. van puffelen, m. kasteleyn, l. de vries, m. rijken, m. heijmans, g. nijpels, f. schellevis and diacourse study group. “self-care of patients with type 2 diabetes mellitus over the course of illness: implications for tailoring support”. journal diabetes and metabolic disorders, vol. 19, no. 1, pp. 81-89, 2020. [10] l. mulala. “diabetes self-care behaviors and social support among african americans in san francisco”. doctoral dissertation, university of san francisco; 2017. available from: https://www. proquestdissertationspublishing [last accessed on 2023 mar 02]. [11] v. mogre, a. natalie, t. flora, h. alix and p. christine. “barriers to self-care and their association with poor adherence to selfcare behaviours in people with type 2 diabetes in ghana: a cross sectional study”. obesity medicine, vol. 18, p. 100222, 2020. [12] m. a. powers, j. bardsley, m. cypress, p. duker, m. m. funnell, a. h. fischl, m. d. maryniuk, l. siminerio and e. vivian. “diabetes self-management education and support in type 2 diabetes: a joint position statement of the american diabetes association, the american association of diabetes educators, and the academy of nutrition and dietetics”. diabetes education, vol. 43, no. 1, pp. 4053, 2017. [13] american diabetes association. “classification and diagnosis of diabetes: standards of medical care in diabetes-2020. diabetes care, vol. 43, no. suppl 1, pp. s14-s31, 2020. [14] f. moosaie, f. d. firouzabadi, k. abouhamzeh, s. esteghamati, a. meysamie, s. rabizadeh, m. nakhjavani and. a. esteghamati. “lp(a) and apo-lipoproteins as predictors for micro-and macrovascular complications of diabetes: a case-cohort study”. nutrition metabolism and cardiovascular diseases, vol. 30, no. 10, pp. 1723-1731, 2020. [15] m. baghianimoghadam and a. ardekani. “the effect of educational intervention on quality of life of diabetic patients type 2, referee to diabetic research centre of yazd”. the horizon of medical sciences, vol. 13, no. 4, pp. 21-28, 2008. [16] a. steinsbekk, l. ø. rygg, m. lisulo, m. b. rise and a. fretheim. “group based diabetes self-management education compared to routine treatment for people with type 2 diabetes mellitus. a systematic review with meta-analysis”. bmc health services research, vol. 12, no. 1, pp. 213, 2012. [17] j. j. gagliardino, c. gonzález and j. e. caporale. “the diabetesrelated attitudes of health care professionals and persons with diabetes in argentina”. revista panamericana de salud pública, vol. 22, no. 5, pp. 304-307, 2007. [18] m. reisi, h. fazeli, m. mahmoodi and h. javadzadeh. “application of the social cognitive theory to predict self-care behavior among type 2 diabetes patients with limited health literacy”. journal of mohammed and lafi: attitude and glucose control uhd journal of science and technology | jan 2023 | vol 7 | issue 1 91 health literacy, vol. 6, no. 2, pp. 21-32, 2021. [19] j. s. wooldridge and k. w. ranby. “influence of relationship partners on self-efficacy for self-management behaviors among adults with type 2 diabetes”. diabetes spectrum, vol. 32, no. 1, pp. 6-15, 2019. [20] s. s. bains and l. e. egede. “associations between health literacy, diabetes knowledge, self-care behaviors, and glycemic control in a low income population with type 2 diabetes”. diabetes technology and therapeutics, vol. 13, no. 3, pp. 335-341, 2011. [21] m. reisi, h. fazeli, m. mahmoodi and h. javadzadeh. “application of the social cognitive theory to predict self-care behavior among type 2 diabetes patients with limited health literacy”. journal of health literacy, vol. 6, no. 2, pp. 21-32, 2021. [22] j. s. wooldridge and k. w. ranby. “influence of relationship partners on self-efficacy for self-management behaviors among adults with type 2 diabetes”. diabetes spectrum, vol. 32, no. 1, pp. 6-15, 2019. [23] s. s. bains and l. e. egede. “associations between health literacy, diabetes knowledge, self-care behaviors, and glycemic control in a low income population with type 2 diabetes”. diabetes technology and therapeutics, vol. 13, no. 3, pp. 335-341, 2011. [24] k. mulcahy, m. maryniuk, m. peeples, m. peyrot, d. tomky, t. weaver and p. yarborough. “diabetes self-management education core outcomes measures”. diabetes educator, vol. 29, no. 5, pp. 768-803, 2003. [25] m. l. zanetti, l. m. otero, m. v. biaggi, m. a. santos, d. s. péres and f. p. de mattos guimarães. “satisfaction of diabetes patients under follow-up in a diabetes education program”. revista latino americana de enfermagem, vol. 15, pp. 583-589, 2007. [26] c. a. bukhsh, t. m. khan, m. s. nawaz, h. s. ahmed, k. g. chan, l. h. lee and b. h. goh. “association of diabetes-related self-care activities with glycemic control of patients with type 2 diabetes in pakistan”. patient preference and adherence, vol. 12, pp. 23772386, 2018. [27] c. y. osborn, s. s. bains and l. e. egede. “health literacy, diabetes self-care, and glycemic control in adults with type 2 diabetes”. diabetes technology and therapeutics, vol. 12, no. 11, pp. 913919, 2010. [28] k. hawthorne, y. robles and r. cannings-john. “glycemic control and self-care behaviors in hispanic patients with type 2 diabetes: a pilot intervention study”. journal of transcultural nursing, vol. 23, no. 3, pp. 289-296, 2012. tx_1~abs:at/tx_2:abs~at 22 uhd journal of science and technology | jan 2023 | vol 7 | issue 1 1. introduction patients suffering from heart diseases need continuous healthcare especially for ecg monitoring and recognition to avoid dangerous of heart failure [1]. the reduction of heart attack depends on the fast identification of abnormal cardiac rhythms [2]. ecg is an effective diagnostic technique which is widely used by cardiologists [3]. ecg are electrical signals of the heart recorded by electrodes fixed on patient body [4]. ecg signals provide useful information about the rhythm and the operation of the heart. heart beats extracted from ecg signals can be categorized into classes that are: normal, atrial premature, and ventricular escape beats [5]. electrocardiographs are recorded by electrocardiograms that are very important for healthcare diseases [6]. these devices record electrical signal picked up by electrodes attached to certain parts of the patient body [7]. the signals recorded by the electrocardiograms at any moment are the sum of the all signals passing in cells throughout the heart [8]. electrocardiogram consists of 12 leads which indicated 12 electrical views of the heart [9]. the first six leads represent the frontal plane leads; i, ii, iii, v r , v l and v f . leads i, ii, and iii are the standard leads and are find by [10]: i v vl r= − (1) ii v vf r� � (2) iii v vf l= − (3) the other six leads are in the front of the heart; v 1 , v 2 , v 3 , v 4 , v 5 , and v 6 , these are recorded by the six electrodes placed on the chest of the patient [11]. ecg signal recognition based on lookup table and neural networks muzhir shaban al-ani university of human development, college of science and technology, department of information technology, sulaymaniyah, krg, iraq a b s t r a c t electrocardiograph (ecg) signals are very important part in diagnosis healthcare the heart diseases. the implemented ecg signals recognition system consists hardware devices, software algorithm and network connection. an ecg is a non-invasive way to help diagnose many common heart problems. a health-care provider can use an ecg to recognize irregular heartbeats, blocked or narrowed arteries in the heart, whether you have ever had a heart attack, and the quality of certain heart disease treatments. the main part of the software algorithm including the recognition of ecg signals parameters such as p-qrst. since the voltages at which handheld ecg equipment operate are shrinking, signal processing has become an important challenge. the implemented ecg signal recognition approach based on both lookup table and neural networks techniques. in this approach, the extracted ecg features are compared with the stored features to recognize the heart diseases of the received ecg features. the introduction of neural network technology added new benefits to the system implementing the learning and training process. index terms: electrocardiograph signals, p-qrs, healthcare, heart diseases. corresponding author’s e-mail: muzhir.al-ani@uhd.edu.iq received: 22-10-2022 accepted: 08-01-2023 published: 21-01-2023 access this article online doi: 10.21928/uhdjst.v7n1y2023.pp22-31 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2023 muzhir shaban al-ani. this is an open access article distributed under the creative commons attribution noncommercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l re se a rc h a rt i c l e uhd journal of science and technology al-ani: ecg signal recognition uhd journal of science and technology | jan 2023 | vol 7 | issue 1 23 electrocardiograms are concentrated on all issues associated with diseases of heart attack patients that used directly with clinical [12]. recently, a reliable and automatic analysis and segmentation of ecg signals are required for health-care environments [13]. computer based methods are suitable for processing and analyzing of ecg signals [14]. artificial neural network techniques are used for analyzing different types of signals and tasks related to heart diseases [15]. most of these tasks are associated to the detection of irregular heartbeats and irregular in recording process [16]. a back propagation neural network may apply in training stage to give a powerful pattern recognition algorithm [17]. electrocardiograph (ecg) signals are very important part in diagnosis healthcare the heart diseases. the implemented ecg signals recognition system consists hardware devices, software algorithm and network connection. an ecg is a non-invasive way to help diagnose many common heart problems. a health-care provider can use an ecg to recognize irregular heartbeats, blocked or narrowed arteries in the heart, whether you have ever had a heart attack, and the quality of certain heart disease treatments. the main part of the software algorithm including the recognition of ecg signals parameters such as p-qrst. since the voltages at which handheld ecg equipment operate are shrinking, signal processing has become an important challenge. the implemented ecg signal recognition approach based on both lookup table and neural networks techniques. in this approach, the extracted ecg features are compared with the stored features to recognize the heart diseases of the received ecg features. 2. ecg signals heart diseases are the well-known disease that affects humans worldwide [18]. yearly millions of people die or suffered from heart attacks [19]. early detection and treatment of heart diseases can prevent such events [20]. this would improve the quality of life and slow the events of heart failure [21]. the main benefit of the diagnosis is to record the ecg of the patient [22]. an ecg record is a non-invasive diagnostic tool used for the assessment of a patient heart condition [23]. the extraction of ecg features and combined that with the heart rate, these can lead to a fairly accurate and fast diagnosis [24]. bioelectrical signals represent human different organs electrical activities and ecg signals are the important signals among bioelectrical signals that represent heart electrical activity [25]. deviation or distortion in any part of ecg that is called arrhythmia can illustrate a specific heart disease [26]. the investigation of the ecg has been extensively used for diagnosing many cardiac diseases [27]. the ecg is a realistic record of the direction and magnitude of the electrical commotion that is generated by depolarization and repolarization of the atria and ventricles [28]. one cardiac cycle in an ecg signal consists of the p-qrs-t waves as shown in fig. 1 [29,30]. the majority of the clinically useful information in the ecg is originated in the intervals and amplitudes defined by its features (characteristic wave peaks and time durations) [31,32]. ecg is essentially responsible for patient monitoring and diagnosis [33]. normal rhythm produces four entities; p wave, qrs complex, t wave, and u wave in which each have a fairly unique pattern as shown in fig. 1 [34,35]: • p-wave represents the movement of an electric wave from the sino atrial (sa) node and causes depolarization of the left and right atria. • p-r segment represents the pause in electrical activity caused by a delay in conduction of electrical current in the atrioventricular (av) node to allow blood to flow from the atria to the ventricles before ventricular contraction happen. • qrs complex represents the electrical activity from the beginning of the q wave to the end of the s wave and the complete depolarization of the ventricles, resulting to ventricular contraction and ejection of blood into the aorta and pulmonary arteries. • s-t segment represents the pause in electrical activity after complete depolarization of the ventricles to allow blood to flow out of the ventricles before ventricular relaxation begins and the heart to fill the next contraction. • wave t represents the repolarization of the ventricles. fig. 1. representation of electrocardiograph signals. al-ani: ecg signal recognition 24 uhd journal of science and technology | jan 2023 | vol 7 | issue 1 • wave u represents the repolarization of the papillary muscle. 3. literature reviews mohammed et al., presented an ecg compression algorithm based on the optimal selection of wavelet filters and threshold levels in different sub bands that allow a maximum reduction of the volume of data guaranteeing the quality of the reconstruction. the proposed algorithm begins by segmenting the ecg signal into frames; where each image is decomposed into sub bands m by optimized wavelet filters. the resulting wavelet coefficients are limited and those having absolute values below the thresholds specified in all sub bands are eliminated and the remaining coefficients are properly encoded with a modified version of the run coding scheme [36]. reza et al., proposed compressed detection procedure and the collaboration detection matrix approach that used to provide a robust ultra-light energy focus for normal and abnormal ecg signals. the simulation results based on two proposed algorithms illustrate a 15% increase in signal to noise ratio and a good quality level for the degree of inconsistency between random and scatter matrices. the results of the simulation also confirmed that the toeplitz binary matrix offered the best snr performance and compression with the highest energy efficiency for the random array detection [37]. ann and andrés implemented an approach to classify multivariate ecg signals as a function of analyzing discriminant and wavelets. they used variants of multiscale wavelets and wave correlations to distinguish multivariate ecg signal models based on the variability of the individual components of each ecg signal and the relationships between each pair of these components. using the results from other ecg classification studies in the literature as references that demonstrated this approach to 12-lead ecg signals from a particular database compares favorably [38]. vafaie, et al., presented a new classification method to classify ecg signals more precisely based on the dynamic model of the ecg signal. the proposed method is constructed a diffuse classifier and its simulation results indicate that this classifier can separate the ecg with an accuracy of 93.34%. to further improve the performance of this classifier, the genetic algorithm is applied when the accuracy of the prediction increases to 98.67%. this method increased the precision of the ecg classification for a more accurate detection of the arrhythmia [39]. kamal and nader, realized a practical means to synthesize and filter of ecg signal in the presence of four types of interference signals: first, from electrical networks with a fundamental frequency of 50 hz, second, those resulting from breathing, with a frequency range 0.05–0.5 hz, third musical signals with a frequency of 25 hz and fourth white noise presented in the ecg signal band. this was accomplished by implementing a multiband digital filter (seven bands) of the finite impulse response multiband least square type using a programmable digital apparatus, which was placed on an education and development board [40]. farideh et al., explored combined discriminative ability of ecg/r signals in automatic staging. basically, this approach classified that the wakefulness of slow wave sleep and rem sleep was classified using a vector support machine fed with a set of functions extracted from characteristics of 34 features and characteristics of 45 features. first part has produced a reasonable discriminatory capacity, while the second part has considerably improved the rating and the best results were obtained using third approach. we then improved the support vector machine classifier with the recursive feature elimination method. the results of the classification were improved with 35 of the 45 features [41]. shirin and behbood classified a patient’s ecg cardiac beats into five types of cardiac beats as recommended by aami using an artificial neural network. this approach used block based on the neural network as a classifier. this approach created from a set of two dimensional blocks that are connected to each other. the internal structure of each block depends on the number of incoming and outgoing signals. the overall construction of the network was determined by the movement of signals through the network blocks. the network structure and weights are optimized using the particle swarm optimization approach [42]. prakash and shashwati, proposed an approach that attempts to reduce unwanted signals using a minorization-maximization method to optimize total signal variation. the unsuccessful signal is then segmented using the bottom-up approach. the obtained results show a significant improvement in the signal-to-noise ratio and the successful segmentation of the ecg signal sections. the extension of the heel depends on the smoothing parameter of lamda. as this approach was implemented for complete signal, then only 18 db of signal to noise ratio was achieved [43]. aleksandar and marjan, focused on a new algorithm for the digital filtering of an electrocardiogram signal received by stationary and al-ani: ecg signal recognition uhd journal of science and technology | jan 2023 | vol 7 | issue 1 25 non-stationary sensors. the basic idea of digital processing of the electrocardiogram signal is to extract the heartbeat frequencies that are normal in the range between 50 and 200 beats/min. the frequency of the extracted heart rate is irregular if the rate increases or decreases and serves as evidence for the diagnosis of a complex physiological state. the environment can generate a lot of noise, including the supply of electrical energy, breathing, physical movements, and muscles [44]. kumar et al., proposed an automated diagnosis of coronary artery disease using electrocardiogram signals. flexible analytical wavelet transform technology is used to break down electrocardiogram effects. the cross information potential parameter is calculated from the actual values of the flexible analytical wavelet transform decomposition detail coefficients. for diagnosis of coronary artery disease subjects, the mean value of the cross information potential parameter is higher in the comparison toner subjects. the statistical test is applied to check the discrimination capacity of the extracted functionalities. in addition, the functionality is fed to the least squares support vector machine for sorting. the classification accuracy is calculated at each decomposition level from the first decomposition level [45]. al-ani, explained that ecg waveform is an important process for determining the function of the heart, so it is useful to know the types of heart disease. the ecg chart gives a lot of information that is converted into an electrical signal containing the basic values in terms of amplitude and duration. the main problem that arises in this measurement is the confusion between normal and abnormal layout, in addition to certain cases where the p-qrs-t waveform overlaps. the purpose of this research is to provide an effective approach to measure all parts of the p-qrs-t waveform to give the right decision for heart function. the proposed approach depends on the classifier operation that based mainly on the features extracted from electrocardiograph waveform that achieved from exact baseline detection [46]. nallikuzhy and dandapat, explored an efficient technique to improve a low resolution ecg by merging fragmented coding and the learning model of the common dictionary. an enhance model is applied on low resolution ecg using previously learned model in order to obtain a high resolution full estimate of 12-lead ecg. this approach was applied based on the dictionary in which the common dictionary contains high and low resolution dictionaries regarding to the high and low resolution ecg and is learned simultaneously. similar fragmented representation for high and low resolution ecgs was generated using joint dictionary learning. mapping between the scattered coefficients of the high and low resolution ecgs was also learned [47]. han and shi presented an efficient method of detection and localization of myocardial infarction that combines a multilead residual neural network structure (ml-resnet) with three residual blocks and a function fused by 12-lead ecg recordings. a single network of characteristic branches was formed to automatically learn representative characteristics of different levels between different layers, which exploit the local characteristics of the ecg to characterize the representation of spatial information. then, all the main features are merged as global features. to evaluate the generalization of the proposed method and clinical utility, two schemes are used that include the intra-patient scheme and the inter-patient scheme. the obtained results indicated a high performance of accuracy and sensitivity [48]. abdulla and al-ani, implemented a review study classification for ecg signal. this work aimed to investigate and review the use of classification methods that have been used recently, such as the artificial neural network, the convolutional neural network, discrete wavelets transform, support vector machine and k-nearest neighbor. effective comparisons are presented in the result in terms of classification methods, feature extraction technique, data set, contribution, and some other aspects. the result also shows that convolutional neural network has been used more widely for ecg classification as it can achieve higher accuracy compared to other approaches [49]. abdulla and al-ani, explained an automatic ecg classification system which is difficult to detect, especially in manual analysis. an accurate classification and monitoring ecg system was proposed using the implementation of convolutional neural networks and long-short term memory. learned features are captured from the cnn model and passed to the lstm model. the output of the cnn-lstm model demonstrated superior performance compared to several of the more advanced ones cited in the results section. the proposed models are evaluated on the mit-bih arrhythmia and ptb diagnostics datasets. a high accuracy rate of 98.66% in the classification of myocardial infarction was obtained [50]. 4. methodology the methodology of this approach is divided into three parts: ecg signals recognition approach, ecg feature extraction al-ani: ecg signal recognition 26 uhd journal of science and technology | jan 2023 | vol 7 | issue 1 fig. 2. electrocardiograph recognition approach. fig. 3. electrocardiograph feature extraction. fig. 4. design of neural network architecture. and neural network architecture. in addition, the used data are images selected with different heart diseases. the main objective of introducing forward propagation neural networks in this work is to determine the main component values of the ecg signal, which are seven values (qrs complex, qt interval, qtcb wave, pr interval, p wave, rr interval and pp interval) and to compare that with the table that carries the standard values. depend on this comparison, it is possible to make an accurate decision about whether the ecg signal is normal or abnormal. 4.1. ecg signals recognition approach the ecg signals recognition approach is implemented through the following stages (fig. 2): • feature extraction stage in which ecg signals parameters (amplitude and time interval) will be extracted from the electrodes. • ecg recognition stage in which the extracted parameters of ecg are applied through neural network that specified the diseases associated with these parameters. • lookup table stage in which the constructed lookup table is associated with the list of specified heart diseases. • decision making stage in which take the decision of which type of heart diseases are related. 4.2. ecg feature extraction tracing of ecg signal on the special recognition is very important to extract the values of the direct parameters. the main advantage of ecg feature extraction operation is to generate a small set of features that achieve the ecg signal. ecg feature extraction operation is implemented through many steps as shown in fig. 3. the first step is preprocessing in which ecg graph will be cleaned and resized. the second step focusing on thinning filter in which the ecg signal will be better quality, in addition this step will eliminate the scattering pixels around the original signal. the third step concentrates on edge detection that detects the original ecg signal. in addition, this indicates the duration and amplitude of each ecg signal part. fourth step is to calculate the required parameters of ecg signal that related on duration and amplitude of each part of ecg signal. 4.3. neural network architecture proper neural network architecture is used to be efficient and work in a wide range of conditions. it is necessary to choose network parameters such that the obtained ecg system is acceptable for theoretical and practical settings. furthermore, this neural network is an application-oriented system and the design is done with the selection of the network architecture. in this case, many parameters are selected as below (fig. 4): • choice of initial weight and biases: the choice of initial weight will influence how quickly the system coverage? the values of the initial weights must not be too large or too small to avoid out of region condition. the weights and biases of ecg network in learning phase are initialized randomly between -0.5 and 0.5. • choice of activation function: the ecg used neural network of sigmoid function which has simple derivative and nonlinear property. the sigmoid range of output lies between zero and one. • choice of learning and momentum rate: for low learning rate, the neural network will adjust their weights gradually, but the convergence may be slow, while for high learning al-ani: ecg signal recognition uhd journal of science and technology | jan 2023 | vol 7 | issue 1 27 rate the neural network has big changes that are not desirable in a trained network. the network consists of 5 nodes in input layer, 80 nodes in hidden layer, and 4 nodes in output layer. • choice of the number of hidden nodes: the number of hidden nodes in the hidden layers is varied from 5 nodes to 125 nodes, while keeping the learning rate and momentum rate constant at nominal values (learning rate = 0.7 and momentum rate = 0.9). backpropagation neural network algorithm is used for ecg system to achieve a balance between correct response to the trained patterns and good responses to new input patterns. the forward propagation alg orithm starts with the presentation of input pattern to the input layer of the network and continues as activation level calculations propagate forward through the hidden layers. every processing unit (in each successive layer) sums its inputs and applies the sigmoid function to compute its output. then the output layer of the units produces the output of the network. suppose the total input s j to unit j is a linear function of the states of units a i which is equal to the activation levels of the neurons in the previous layer that is connected to unit j through the weights w ji and the threshold, θ j of unit j where: s a wj i i j i j= +∑  (4) the state of (y) of a unit is a sigmoid function of its total input s. y f s ei j s = ( ) = + − � 1 1 (5) the resulting value becomes the activation level of neuron j . once the set of the outputs for a layer is found, it serves as an input to the next layer. this process is repeated layer by layer until the final set of network output is produced. the backward propagation algorithm indicated by error values and these are calculated for all processing units and the weight changes are calculated for all interconnections. the calculations begin at the output layer and progress backward through the network to the input layer. the error value is simple to be computed for the output layer and somewhat more complicated for the hidden layers. if unit j represents the output layer, then its error value is given by: δ j j j jt a f s� � �( )= −( ) ′ (6) fig. 5. the relation between learning rate and mean square error. fig. 6. the relation between learning rate and number of iteration. fig. 7. the relation between momentum and mean square error. where: • tj is the target value for unit j. • f′ (sj) is the derivative of the sigmoid function f. • aj is the output value for unit j. • sj is the weighted sum of inputs to j. al-ani: ecg signal recognition 28 uhd journal of science and technology | jan 2023 | vol 7 | issue 1 fig. 8. the relation between momentum and number of iteration. fig. 9. the relation between network size and network capacity. fig. 11. normal electrocardiograph signal. fig. 12. final diagnosis of electrocardiograph signals (normal case). fig. 13. sinus tachycardia electrocardiograph signal. 5. results and discussion fig. 5 gives the relation between learning rate and mean square error (mse). the learning rate value is laying between zero and one and the commonly used range is in between 0.25 and 0.75. at this active rang of learning rate, the calculated mse is so small and laying in the range 0.1 and 3.5. fig. 6 shows the relation between learning rate and the number of iteration. at this figure it is clear that when the learning rate is equal to 0.5, then the number of iteration is 1000 and still saturated at this number of iteration as learning rate increases. fig. 7 shows the relation between momentum rate and mse. the momentum rate value is laying between zero and one and the commonly used range is around 0.9. at this figure mse still zero up to the momentum rate value is equal to 0.8 at which mse is about 1.8 and then saturated at this value. fig. 8 fig. 10. the relation between network size and generalization. shows the relation between momentum rate and number of iteration. at the momentum rate value of 0.8 the number of iteration is reached to 1000 and still saturated at this value. al-ani: ecg signal recognition uhd journal of science and technology | jan 2023 | vol 7 | issue 1 29 fig. 9 demonstrates the relation between network size and network capacity. at this figure it is clear that there is a linear relation between network size and network capacity. as the network size increases up to 140,000 it is clear that the network capacity increases up to 35000. fig. 10 shows the relation between network size and generalization. at this figure it is clear that the maximum generalization is obtained at starting point of the network size. then the generalization decreases when the network size increases. on the other hand, the zero generalization is obtained at the network size equal to 1000. fig. 11 presents the normal case of ecg signal. fig. 12 shows a normal patient having sinus normal rhythm in which the measurement ecg parameters are: qrs: 189.00 ms, qt: 292.66 ms, qtcb: 66.90 ms, pr: 137.10 ms, p: 88.85 ms, rr: 764.46 ms, and pp: 775.50 ms, this case indicated that the patient diagnosis is normal. fig. 13 deals with the sinus tachycardia case of ecg signal. fig. 14 shows another patient having sinus tachycardia rhythm in which the measurement ecg parameters are: qrs: 0.0 ms, qt: 0.0 ms, qtcb: 0.0 ms, pr: 102.5 ms, p: 66.5 ms, rr: 500 ms, and pp: 500 ms, this case indicated that the patient diagnosis is sinus tachycardia. 6. conclusions the diagnosis of heart diseases depends largely on ecg, in addition to other devices that give special properties and parameters that leading to great importance in the field of healthcare. measurements of ecg signals lead to the identification of problems experienced by people with heart disease. real-time ecg diagnosis has several advantages as it is important in sharing private information in healthcare systems especially for heart diseases. the implemented approach accompanies the properties of features extracted from the lookup table and properties of neural networks. the feature extraction step verifies the features from the received ecg and neural networks give good responses to the new input patterns. the applied approach gives accurate detection of ecg signals as well as good quality of recognized ecg signals. references [1] t. gandhi, b. k. panigrahi, m. bhatia and s. anand. (2010) “expert model for detection of epileptic activity in eeg signature”. expert systems with applications, vol. 37, pp. 3513-3520, 2010. [2] s. sanei and j. a. chambers. “eeg signal processing”. john wiley & sons ltd., chichester, 2013. [3] k. polat and s. günes. “classification of epileptic form eeg using a hybrid system based on decision treeclassifier and fast fourier transform”. applied mathematics and computation, vol. 187, pp. 1017-1026, 2007. [4] g. ouyang, x. li, c. dang and d. a. richards. “using recurrence plot for determinism analysis of eeg recordings in genetic absence epilepsy rats”. clinical neurophysiology, vol. 119, pp. 1747-1755, 2008. [5] m. ahmadlou, h. adeli and a. adeli. “new diagnostic eeg markers of the alzheimer’s disease using visibility graph”. journal of neural transmission, vol. 117, no. 9, pp. 1099-1109, 2010. [6] n. kannathal, u. r. acharya, c. m. lim, q. weiming, m. hidayat and p. k. sadasivan. “characterization of eeg: a comparative study”. computer methods and programs in biomedicine, vol. 80, no. 1, pp. 17-23, 2005. [7] n. w. willingenburg, a. daffertshofer, i. kingma and j. h. van dieen. “removing ecg contamination from emg recordings: a comparison of ica-based and other filtering procedures”. journal of electromyography and kinesiology, vol. 22, no. 3, pp. 485:493, 2010. [8] c. marque, c. bisch, r. dantas, s. elayoubi, v. brosse and c. perot. “adaptive filtering for ecg rejection from surface emg recordings”. journal of electromyography and kinesiology, vol. 15, no. 3, pp. 310-315, 2005. [9] s. abbaspour, m. linden and h. gholamhosseini. “ecg artifact removal from surface emg signal using an automated method based on wavelet-ica”. studies in health technology and informatics, vol. 211(phealth), pp. 91-97, 2015. [10] a. l. hoff. “a simple method to remove ecg artifacts from trunk muscle emg signals”. journal of electromyography and kinesiology, vol. 19, no. 6, pp. 554-555, 2009. [11] p. e. mcsharry, g. clifford, l. tarassenko and l. a. smith. “a dynamical model for generating synthetic electrocardiogram signals”. ieee transactions on biomedical engineering, vol. 50, no. 3, pp. 289-294, 2003. [12] m. s. al-ani and a. a. rawi. “ecg beat diagnosis approach for ecg printout based on expert system”. international journal of emerging technology and advanced engineering, vol. 3, no. 4, pp. 797-807, 2013. fig. 14. final diagnosis of electrocardiograph signals (sinus tachycardia case). al-ani: ecg signal recognition 30 uhd journal of science and technology | jan 2023 | vol 7 | issue 1 computers in cardiology, vol. 2000, pp. 379-382, 2000. [29] g. vijaya, v. kumar and h. k. verma. “ann-based qrs-complex analysis of ecg”. journal of medical engineering and technology, vol. 22, pp. 160-167, 1998. [30] m. ayat, m. b. shamsollahi, b. mozaffari and s. kharabian. “ecg denoising using modulus maxima of wavelet transform”. in: proceedings of the 31st annual international conference of the ieee engineering in medicine and biology society: engineering the future of biomedicine embc, pp. 416-419, 2009. [31] f. chiarugi, v. sakkalis, d. emmanouilidou, t. krontiris, m. varanini and i. tollis. “adaptive threshold qrs detector with best channel selection based on a noise rating system”. computers in cardiology, vol. 2007, pp. 157-160, 2007. [32] m. elgendi. “fast qrs detection with an optimized knowledgebased method: evaluation on 11 standard ecg databases”. plos one, vol. 8, p. e73557, 2013. [33] a. rehman, m. mustafa, i. israr and m. yaqoob. “survey of wearable sensors with comparative study of noise reduction ecg filters”. international journal of computing and network technology, vol. 1, pp. 61-81, 2013. [34] m. elgendi, b. eskofier and d. abbott. “fast t wave detection calibrated by clinical knowledge with annotation of p and t waves”. sensors (basel), vol. 15, pp. 17693-17714, 2015. [35] m. rahimpour, b. m. asl. “p wave detection in ecg signals using an extended kalman filter: an evaluation in different arrhythmia contexts”. physiological measurement, vol. 37, pp. 1089-1104, 2016. [36] a. z. mohammed, a. f. al-ajlouni, m. a. sabah and r. j. schilling. “a new algorithm for the compression of ecg signals based on mother wavelet parameterization and best-threshold levels selection”. digital signal processing, vol. 23, pp. 1002-1011, 2013. [37] b. m. reza, r. kaamran and k. sridhar. “robust ultra-lowpower algorithm for normal and abnormal ecg signals based on compressed sensing theory”. procedia computer science, vol. 19, pp. 206-213, 2013. [38] m. e. ann and m. a. andrés. “discriminant analysis of multivariate time series: application to diagnosis based on ecg signals”. computational statistics and data analysis, vol. 70, pp. 67-87, 2014. [39] m. h. vafaie, m. ataei and h. r. koofigar. “heart diseases prediction based on ecg signals’ classification using a genetic fuzzy system and dynamical model of ecg signals”. biomedical signal processing and control, vol. 14, pp. 291-296, 2014. [40] a. kamal and a. nader. “design and implementation of a multiband digital filter using fpga to extract the ecg signal in the presence of different interference signals”. computers in biology and medicine, vol. 62, pp. 1-13, 2015. [41] e. farideh, s. seyed-kamaledin and n. homer. “automatic sleep staging by simultaneous analysis of ecg and respiratory signals in long epochs”. biomedical signal processing and control, vol. 18, pp. 69-79, 2015. [42] s. h. shirin and m. behbood. “a new personalized ecg signal classification algorithm using block-based neural network and particle swarm optimization”. biomedical signal processing and control, vol. 25, pp. 12-23, 2016. [43] y. om prakash and r. shashwati. “smoothening and segmentation of ecg signals using total variation denoising, minimization, majorization and bottom-up approach”. procedia computer science, vol. 85, pp. 483-489, 2016. [44] m. aleksandar and g. marjan. “improve d pipeline d wavelet [13] m. s. al-ani and a. a. rawi. “rule-based expert system for automated ecg diagnosis. international journal of advances in engineering and technology, vol. 6, no. 4, pp. 1480-1493, 2013. [14] j. e. madias, r. bazaz, h. agarwal, m. win and l. medepalli. “anasarca-mediated attenuation of the amplitude of electrocardiogram complexes: a description of a heretofore unrecognized phenomenon”. journal of the american college of cardiology, vol. 38, no. 3, pp. 756-764, 2001. [15] u. r. acharya, v. k. sudarshan, h. adeli, j. santhosh, j. e. w. koh, s. d. puthankatti and a. adeli a. “a novel depression diagnosis index using nonlinear features in eeg signals”. european neurology, vol. 74, no. 79-83, 2015. [16] k. n. khan, k. m. goode, j. g. f. cleland, a. s. rigby, n. freemantle, j. eastaugh, a. l. clark, r. de silva, m. j. calvert, k. swedberg, m. komajda, v. mareev, f. follath and euroheart failure survey investigators. “prevalence of ecg abnormalities in an international survey of patients with suspected or confirmed heart failure at death or discharge. european journal of heart failure, vol. 9, pp. 491-501, 2007. [17] k. y. k. liao, c. c. chiu and s. j. yeh. “a novel approach for classification of congestive heart failure using relatively shortterm ecg waveforms and svm classifier. in: proceedings of the international multi-conference of engineers and computer scientists, imecs march 2015, hong kong, pp. 47-50, 2015. [18] r. j. martis, u. r. acharya and c. m. lim. “ecg beat classification using pca, lda, ica and discrete wavelet transform”. biomedical signal processing and control, vol. 8, no. 5, pp. 437-448, 2013. [19] u. orhan. “real-time chf detection from ecg signals using a novel discretization method”. computers in biology and medicine, vol. 43, pp. 1556-1562, 2013. [20] j. pan and w. j. tompkins. “a real time qrs detection algorithm”. ieee transactions on biomedical engineering, vol. 32, no. 3, 1985. [21] m. sadaka, a. aboelela, s. arab and m. nawar. electrocardiogram as prognostic and diagnostic parameter in follow up of patients with heart failure. alexandria journal of medicine, vol. 49, pp. 145152, 2013. [22] k. senen, h. turhan, a. r. erbay, n. basar, a. s. yasar, o. sahin and e. yetkin. “p wave duration and p wave dispersion in patients with dilated cardiomyopathy”. european journal of heart failure, vol. 6, pp. 567-569, 2004. [23] r. a. thuraisingham. “a classification system to detect congestive heart failure using second-order difference plot of rr intervals”. cardiology research and practice, vol. 2009, p. id807379, 2009. [24] e. d. ubeyli. “feature extraction for analysis of ecg signals”. in: annual international conference of the ieee engineering in medicine and biology society, milano, italy, pp. 1080-1083, 2008. [25] r. rodríguez, a. mexicano, j. bila, s. cervantes and r. ponce. “feature extraction of electrocardiogram signals by applying adaptive threshold and principal component analysis”. journal of applied research and technology, vol. 13, pp. 261-269, 2015. [26] h. gothwal, s. kedawat and r. kumar. “cardiac arrhythmias detection in an ecg beat signal using fast fourier transform and artificial neural network”. journal of biomedical science and engineering, vol. 4, pp. 289-296, 2011. [27] s. a. chouakri, f. bereksi-reguig, a. t. ahmed. “qrs complex detection based on multi wavelet packet decomposition”. applied mathematics and computation, vol. 217, pp. 9508-9525, 2011. [28] d. s. benitez, p. a. gaydecki, a. zaidi and a. p. fitzpatrick. “a new qrs detection algorithm based on the hilbert transform”. al-ani: ecg signal recognition uhd journal of science and technology | jan 2023 | vol 7 | issue 1 31 implementation for filtering ecg signals”. pattern recognition letters, vol. 95, pp. 85-90, 2017. [45] m. kumar, r. b. pachori and u. r. acharya. “characterization of coronary artery disease using flexible analytic wavelet transform applied on ecg signals”. biomedical signal processing and control, vol. 31, pp. 301-308, 2017. [46] m. s. al-ani. “electrocardiogram waveform classification based on p-qrs-t wave recognition.” uhd journal of science and technology, vol. 2, no. 2, pp. 7-14, 2018. [47] j. j. nallikuzhy and s. dandapat. “spatial enhancement of ecg using multiple joint dictionary learning”. biomedical signal processing and control, vol. 54, p. 101598, 2019. [48] c. han and l. shi. “ml–resnet: a novel network to detect and locate myocardial infarction using 12 leads ecg.” computer methods and programs in biomedicine, vol. 185, p. 105138, 2020. [49] l. a. abdulla and m. s. al-ani. “a review study for electrocardiogram signal classification”. uhd journal of science and technology (uhdjst), vol. 4, no. 1, 2020. [50] l. a. abdullah and m. s. al-ani. “cnn-lstm based model for ecg arrhythmias and myocardial infarction classification”. advances in science technology and engineering systems journal, vol. 5, no. 5, pp. 601-606, 2020. . uhd journal of science and technology | jan 2020 | vol 4 | issue 1 71 1. introduction internet of things (iot) is a long-term stream that we are currently at its earliest stage. we can consider three primary phases to achieve the first phase of iot. in the first phase, things can be identified for us and others and gradually assign a specific address on the network for themselves. in this phase, each object keeps certain information in it, but these are people who need to take out this information using tools like their smartphones [1], [2], [3]. in the second phase, each device has the ability to send information to the user at a specified time. after completing the relationship between objects and humans, it is time to relate things to each other. in the third phase, objects are associated with each other without human interference. completing these three phases will finish the first phase of iot evolution [4], [5]. at the end of the first phase, there is a world of ideas in front of developers. the problem is that each device has some information that is available on the network by other objects and its owner and developers can use their own creativity to make better use of this information; telecommunication networks communicate with each other based on technologies, spectra, and different frequency band. this technology in recent years has been more widely considered with the advent of iot technology/internet of a review of properties and functions of narrowband internet of things and its security requirements zana azeez kakarash1,2, farhad mardukhi2 1department of information technology, university of human development, sulaymaniyah, iraq, 2department of computer engineering and information technology, faculty of engineering, razi university, kermanshah, iran re v i e w a r t i c l e a b s t r a c t internet of things (iot) is a new web sample based on the fact that there are many things and entities other than humans that can connect to the internet. this fact means that machines or things can automatically be interconnected without the need for interacting with humans and thus become the most important entities that create internet data. in this article, we first examine the challenges of iot. then, we introduce features of nb-iot through browsing current international studies on narrowband iot (nb-iot) technology, in which we focus on basic theories and key technologies, such as the connection number analysis theory, the theory of delay analysis, the coating increase mechanism, low energy consumption technology, and the connection of the relationship between signaling and data. then, we compare some functions of nb-iot and other wireless telecommunication technologies in terms of latency, security, availability, and data transfer speed, energy consumption, spectral efficiency, and coverage area. finally, we review and summarize nb-iot security requirements that should be solved immediately. these topics are provided to overview nb-iot which can offer a complete familiarity with this area. index terms: internet of things, narrow band, internet of things, narrowband internet of things corresponding author’s e-mail: zana azeez kakarash, department of information technology, university of human development, sulaymaniyah, iraq. e-mail: zana.azeez@uhd.edu.iq received: 07-03-2019 accepted: 20-03-2020 published: 22-03-2020 access this article online doi: 10.21928/uhdjst.v4n1y2020.pp71-80 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2020 kakarash and mardukhi. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) uhd journal of science and technology zana azeez kakarash and farhad mardukhi: properties and securities of nb-iot 72 uhd journal of science and technology | jan 2020 | vol 4 | issue 1 everything and the expansion of devices and communication networks with specific requirements [6], [7], [8]. narrowband iot (nb-iot) is a low power radio network (low consumption) in a wide range (low power wide area network [lpwan]), which is designed and developed to allow the connection of a large number of devices or services using cellular telecommunication band (cellular network) [9], [10]. the nb of iot focuses on network coverage in a closed space, less cost, and more battery life and has the ability to connect a large number of connected devices. the nb technology of iot can be found in the spectrum in-band of the long-term evolution (lte) network or the fourth generation in the frequency blocks of a fourth-generation operator or unused blocks (guard band) of a fourthgeneration operator. it can also be used alone for the deployment of a specific range. it is also appropriate for new combinations (re-farming of [global system for mobile (gsm) communication] spectrum) [11], [12]. the nb was first introduced and developed by sig fax (2009). this company faced the 3rd generation partnership project (3gpp) institute, which defines cellular/mobile telecommunication standards with three challenges which have the ability to answer with a nb. the challenge is that there is a vibrant market for devices that: 1. do not have a lot of abilities 2. they want to be very cheap 3. they have a low power consumption 4. require high range (cover). it can be said that the nb of iot can exist in the following three conditions: • completely independent network • in unused bands of 200 khz, which previously used in gsm networks • second and third generations of mobile/communications • at fourth-generation stations that can assign a block (frequency) to nb of the iot or can be placed in (guard band) [13], [14], 15]. finally, it can be said that the establishment of a nb of a network of iot depends on the geographic conditions of the country and region as well as facilities and conditions of telecommunication and mobile operators of these countries. for example, in the united states, verizon companies (verizon and at and t) can use lte-m1 because both companies have invested in their fourth generation of the network; therefore, they probably do not want to create an independent network, and they want to have a network based on their current fourth-generation network [13], [14]. in front of areas of the world that have a wider gsm network than the fourth-generation network, it is rational to use an independent nb-iot network. for example, t-mobile operators in the united states and sprint eventually have turned their attention toward the deployment of a nb network of iot on the frequency spectrum of gsm network [13], [14], [15]. this paper recommends nb-iot applicable models for application in many places to solve many problems (smart white goods, smart coordination’s, smart power metering, and smart road lighting) and provides a comprehensive overview of the design changes brought in the nb-iot standardization along with the detailed research advancements from the viewpoints of security requirements and the practical presentation of nb-iot as far as successful throughput. the rest of the paper is organized as follows. section 2 describes some background concepts relevant to our review. section 3 describes the challenges of iot. significant features of nb-iot are described in section 4. section 5 presents nbiot and different wireless communication technologies. in section 7 describes basic requirements for nb-iot security and in section 8 discusses the conclusion. 2. background 2.1. brief review of nb-iot nb-iot is a guideline based low control wide zone (lpwa) innovation created to empower a wide scope of new iot gadgets and administrations. nb-iot essentially improves the power utilization of client gadgets, framework limit, and range effectiveness, particularly in profound inclusion. the battery life of beyond what 10 years can be upheld for a wide scope of utilization cases. new physical layer flag and channels are intended to meet the requesting necessity of broadened inclusion – rustic and profound inside – and ultra-low gadget multifaceted nature. the introductory expense of the nb-iot modules is required to be tantamount to gsm/general packet radio services (gprs). the basic innovation is anyway a lot more straightforward than the present gsm/gprs and its expense is relied on to diminish quickly as interest increments. by supporting all major equipment such as mobile equipment, chipset, and module producers, nb-iot can exist together zana azeez kakarash and farhad mardukhi: properties and securities of nb-iot uhd journal of science and technology | jan 2020 | vol 4 | issue 1 73 with 2g (second-generation), 3g (third-generation), and 4g (forth-generation) versatile systems. it likewise profits by all the security and protection highlights of versatile systems, for example, support for client character classification, element confirmation, privacy, information respectability, and portable hardware distinguishing proof. 2.2. benefits and constraints of nb-iot the main properties on nb-iot technology, as defined in rel-13 3gpp tr 45.820 [10], are given in table 1. we have to survey the basic points of interest and consequent restrictions in regards to the inalienable capacities of the nbiot innovation to investigate the end-gadget activity and its incorporation with the iot application [11], [12], [13], [14]. as planned ease of nb-iot module presents no requirements and just brings benefits contrasting with other lpwa arrange arrangements, it would not be talked about further. 2.2.1. wide coverage and deep signal penetration this component gives a chance to the new application class of indoor and underground applications which incorporate information securing and control of gear situated in sewer vents, cellars, pipelines, and different conditions in which the existing correspondence foundation is inaccessible. regardless of the improvement of sign entrance, the gadgets are relied upon to work on the lower limits of signature gathering. hence, support for the vehicle of dependable information ought to be given as a piece of the availability arrangement. 2.2.2. low power consumption of nb-iot modules the chance of battery-controlled structure or potential vitality collecting for end-gadget arrangements, which brings about long life remain solitary activity, is considered as the quick advantage of the low force property. since gadgets are required to work for quite a while, at that point, reconfigurability is an ideal limit which features the requirement for sporadic, however, solid two-way correspondence. the two-way correspondence necessity is likewise seen by 3gpp in their rush hour gridlock model. 2.2.3. massive connectivity the inactive limit of nb-iot supporting foundation is the gigantic availability coming about in up to 50 k gadgets per cell, which relies on inclusion mode and traffic blend gadgets are utilizing. since a huge number of gadgets are proposed to be coordinated into conveyed applications, unbounded remote help reaction time is normal, which is considered as one of the regular issues progressively enormous scope combinations. the correspondence measurements which are influenced incorporate the persistence of information correspondence, models for automatic repeat demand and stream control, and guaranteed unwavering quality (nature of administration) [38], [39], [40], [41]. 3. challenges of the iot on iot, we face a world in which makers supply their goods with their standards, and it is not clear, with the continuation of this variety, billions of devices that make up iot, where will lead future of networks. we examine two challenges of iot in this section. one of them is standard conflicts, and the other one is the security that puts the future of iot in disorderly conditions [16], [17]. 3.1. lack of standard unit the iot of today has a different world. when the internet standards were created, people controlled this standard that their true desire was to formulate global standards. standards are equally accessible to everyone, but the internet of today is in control of companies that each wants to use these standards to defeat competitors and benefit from them. furthermore, the internet is in the hands of governments that basically want to super vise everything. how do governments and companies in this situation want to agree on global standards? in the iot, standard means everything. each device must announce to other devices what it wants to do. without these standards, they cannot do any of these. add this truth to challenge that equipments connected to iot are very different and variant. many companies and organizations try to set standards, and all see union, industrial internet consortium, ipso union, and the open interconnect consortium are of the main institutions. in the iot landscape, there are not spots at which all agree over a series of global standards [15], [16], [17], [18]. table 1: nb-iot main properties [42] range <35 km battery life >10 years frequency bands lte bands bandwidth 200 khz or shared modulation dl: ofdma with 15 khz subcarrier spacing ul: single tone transmissions – 3.75 and 15 khz, multi-tone sc-fdma with 15 khz subcarrier spacing max throughput <56 kbps ul, <26 kbps dl link budget 164 db capacity +50 k iot devices per sector nb-iot: narrowband internet of things, ofdma: orthogonal frequency division multiple access, lte: long-term evolution, sc-fdma: single carrier frequency division multiple access zana azeez kakarash and farhad mardukhi: properties and securities of nb-iot 74 uhd journal of science and technology | jan 2020 | vol 4 | issue 1 3.2. security a recent discovery of a bug called bash or shellshock uncovered a serious security issue on the iot. the bug is a bunch of codes that allow hackers to run on unix and linux operating systems, as shown in fig. 1. the bug is announced by the national institute of standards and technology as a high-level security threat. the seriousness of the threat comes from the fact that hackers do not need to have prior knowledge of the attacked system before they add their code to the bash bug. the bug does not affect the iot only, but all devices connected to it are at risk of being attacked. devices that are attacked by the bug remain to be uncatchable and vulnerable. this discovered threat suggests that there might be many unaddressed security issues, which is good news to hackers and internet criminals and raise questions about the effectiveness and usability of iot in the future. another aspect of iot as contributing to security issues is its complexity, which makes it hard to identify security gaps. these gaps have been realized by researchers, as they have concluded that the connected world has many hidden risks that require intensive research to find suitable solutions [18], [19], [20]. many devices through various channels can connect to iot, and as yet no mechanism has been put forward to alert device users of security threats and the way they can prevent attacks from bash-like bugs. 4. significant features of nb-iot nb-iot is another rapidly developing remote connectivity 3gpp cell innovation standard introduced in release 13 that corresponds to iot’s preconditions for the lpwan. it is developing rapidly as the top-level driving innovation in lpwan to enable a wide range of new iot devices, including smart parking, utilities, wearables, and modern facilities. main features of nb-iot are shown in fig. 2 and briefly described below: 4.1. low energy consumption using power saving mode (psm) and infrequently developed receive (extended discontinuous receive [e-drx]) longer standby time can be observed in nb-iot. in this context, psm technology has been added lately to rel12, in which terminal power-saving mode is still being recorded online, but it cannot achieve to saving energy by sending a signal to put the terminal in a deep sleep for a longer time [20], [21]. 4.2. improved coverage and low latency sensitivity given the reproduced information tr45.820, it very well may be affirmed that the intensity of the covering nb-iot can find a good pace autonomous arrangement mode. recreation try for both in-band organization and watchman band sending is finished. so as to advance the inclusion, systems, for example, remobilization (multiple times) and low recurrence tweak by nb iot was endorsed. at present, nb-iot support from quadrature amplitude modulation 16 is still under discussion. to lose blending 164 db, if a dependable information move gave, due to re-change of mass information, dormancy increments [13], [14], [15], [16], [17], [18]. 4.3. transition mode as it is shown in table 2, nb-iot development is based on lte. correction is mainly based on lte-related technologies due to unique nb-iot features. radiofrequency bandwidth from nb-iot physical layers is 200 khz. at the bottom link, nb-iot with quadrature phase-shift keying (qpsk) modem and orthogonal frequency-division multiple access technologies is compatible with a distance under carrier 15 khz. in the uplink, binary phase-shift keying or qpsk modem and single-carrier frequency division multiple access innovations, including single sub-bearer and different fig. 1. how the function of code bash is vulnerable in the environment. zana azeez kakarash and farhad mardukhi: properties and securities of nb-iot uhd journal of science and technology | jan 2020 | vol 4 | issue 1 75 subcarrier, are embraced. a solitary sub-bearer innovation with the sub-bearer separating of 3.75 khz and 15 khz is appropriate to iot terminal with ultra-low rate and ultralow force utilization. the convention of nb-iot high layer (the layer above physical layer) is figured through modification of a few lte highlights, for example, multiassociation, low force utilization also, not many information. the center system of nb-iot is associated through s1 interface [16], [17], [18], [19], [20], [21], [22]. 4.4. spectrum resources iot is a core service that attracts a larger user group in the communication services market for the future. hence, nb-iot development supported by four major telecom operators in china, as shown in table 3, which is the owner of fvhd nb-iot relevant spectrum source. 4.5. deployment supported by nb-iot according to rp-151621 regulations, nb-iot is currently only foreign demand draft transfer mode with a bandwidth of 182 khz and three types of deployment model shown in fig. 3: • independent deployment (standalone mode), which utilizes a free recurrence band that has no cover with the lte recurrence band • guard band deployment (protective band mode), which uses edge band frequency • in-band deployment (in-band mode), which uses an lte frequency band for deployment, and takes one physical resource block from lte frequency band source for deployment [22], [23]. 4.6. structure and framework the bottom link in nb-iot enodeb supports from the wireless framework of e-utran one frame structure fig. 2. main features of narrowband internet of things [42]. table 2: main nb-iot technical characteristics layer technical feature physical layer uplink bpsk or qpsk modulation sc-fdma single carrier, the subcarrier interval is 3.75 khz and 15 khz the transmission rate is 160 kbit/s – 200 kbit/s multi-carrier, the subcarrier interval is 15 khz, the transmission rate is 160 kbit/s – 250 kbit/s downlink qpsk modulation ofdma, the subcarrier interval is 15 khz, the transmission rate is 160 kbit/s – 250 kbit/s upper layer lte-based protocol core network s1 interface based bpsk: binary phase-shift keying, nb-iot: narrowband internet of things, qpsk: quadrature phase shift keying, lte: long-term evolution, ofdma: orthogonal frequency division multiple access, sc-fdma: single carrier frequency division multiple access zana azeez kakarash and farhad mardukhi: properties and securities of nb-iot 76 uhd journal of science and technology | jan 2020 | vol 4 | issue 1 (fs1), as shown in fig. 4. upper link supports fs1 for under carrier spacing of 15 khz. however, for spacing under carrier 3.75 khz, a new type of framework is defined fig. 5. 5. key technology of nb-iot 5.1. connection analysis theory 3gpp analyzes several connections that nb-iot can access it when network supports from terminal periodic reporting service and network command reporting service. it is assumed that services are distributed within a day and nb-iot can support 52547 connectivity per cell. indeed, this assumption is too ideal, which almost ignores the business of nb-iot service. as a result, it is difficult to generalize it in other application scenes. at present, there are few studies in nbiot business service. however, the research results of lte-m (machine type communications [mtc]) and enhanced mtc are still valuable to learn. to overcome lte network access overhead at a time a lot of mtc terminals enter the network at the same time, researchers have focused their analysis on lte random access channel (rach) load pressure and additional load control mechanisms. researches typically coordinate service entering process as a homogeneous/hybrid process with the same distribution. the users retransitions the number of packet in queue head or channel position in a specific time slot as position variables for obtaining a stable graphical plot with the assumption of completing multichannel s-aloha static mode performance analysis.. the graphing plan can be used for lte rach optimal design. however, when a lot of mtc terminals enter the network simultaneously, a large number of mtc terminals are sent simultaneously to the network to request a quick meeting in a short time to respond the same incident or monitoring the relevant components. this feature can be hardly described by classical homogeneous/hybrid poisson process which forms direct application of network performance analysis method table 3: spectrum classification for nb-iot by telecom operators operator uplink frequency band/mhz downlink frequency band/mhz bandwidth/mhz china unicorn 909–915 954–960 6 1745–1765 1840–1860 20 china telecom 825–840 870–885 15 china mobile 890–900 934–944 10 1725–1735 1820–1830 10 sarft 700 700 undistributed nb-iot: narrowband internet of things fig. 3. three deployments supported by narrowband internet of things. fig. 5. narrowband internet of things framework structure for spacing under carrier of 3.75 khz for upper link [42]. fig. 4. narrowband internet of things framework structure for spacing under carrier of 15 khz for upper and lower links [42]. zana azeez kakarash and farhad mardukhi: properties and securities of nb-iot uhd journal of science and technology | jan 2020 | vol 4 | issue 1 77 based on stable state hypotheses. hence, a transient functional analysis method is essential for multi-channel s-aloha of non-poisson services [24], [25], [26]. 5.2. the latency analysis theory besides the numbers of connecting analysis, 3gpp indicates that there is a need for a theoretical latency model capable of addressing the latency of synchronization, random access, resource allocation, and data transmission to access the upper link. some of these latencies are concerned with signal detection and service behavior as researchers in the field have concentrated on mean and random-access latency variance and little attention has been paid to other curtail features such as probability density function (pdf) of latency. researchers such as rivero-angeles et al. [26] and liu et al. [27] have used the markov process to produce probability generation function from pdf. however, the complex nature of computing has made it difficult for researchers to find a mechanism to lower latency and increase communication probability. 5.3. covering enhancement mechanism slender band adjustment and sub-ghz arrangement from nb-iot can upgrade, getting affectability to build inclusion capacity. besides, 3gpp suggests another advancement component dependent on coverage classes, which is another idea presented for nb-iot by 3gpp. 5.4. very low energy technology a major issue with iot is energy consumption. researchers have simulated the energy consumption for ter minal services within nb-iot with the aim to identify an area for improvements and the result showed that if the information is transmitted once a day, the life expectancy of a 5 wh battery could be much prolonged. this leads to the suggestion that an evaluation mechanism for energy efficiency is required to ensure that lower energy consumption for iot is achieved. some researches, such as liu et al. [27] and balasubramanya et al. [28], on energy consumption in drx focuses on single terminals between control signaling states and terminal operating mode. however, more work is needed to find a holistic mechanism that is seen as one of the tasks of 3gpp r14. 5.5. connectivity between signaling and data coupling simulation between data and signaling is another concern in iot that companies such as huawei technology have indicated needs to be addressed. this is because in many simulation tools, data and signals are separated and simulation tests are done for each independently. this leads to a result where issues in connecting the two cannot be understood which makes it difficult to simulate real network capacity, for example, when access to mtc terminals are requested [29], [30]. 6. nb-iot and different wireless communication technologies lpwa technology is gaining popularity as iot services grow rapidly. the technology is used to deliver smart services with low data speed, which can be utilized in iot intelligent applications. these applications are classified into three groups by hekwan wu in the 2016 international internet conference in china, as shown in table 4. fig. 6a illustrates the position of (lpwan) in comparison to other communication technologies in terms of inclusion zone and information transmission rate. this type of technology is most suited to applications that require high bandwidth and short-range transmission speed such as bluetooth and zigbee [31], [32], [33]. table 4: distribution statistics for iot smart connection technology in 2020 global m2m/iot connection distribution in 2020 category network connection techniques fine-grained market opportunity 10% high data rate (>10 mbps), e.g., cctv, ehealth 3g: hspa/evdo/tds big profit margin for car navigation/ entertainment system 4g: lte/lte-a wifi 802.11 technologies 30% medium data rate (<1 mbps), e.g., pos, smart home, m2m backhaul 2g: gprs/cdma2kix 2g m2m could be replaced by mtc/ emtc techniques mtc/emtc 60% low data rate (<100 kbps), e.g., sensors, meters, tracking logistics s-mart parking, smart agriculture nb-iot various application cases; main market for lpwa; market vacancy sigfox lora short distance wireless connection, e.g., zigbee nb-iot: narrowband internet of things, gprs: general packet radio services, emtc: enhanced machine-type communications zana azeez kakarash and farhad mardukhi: properties and securities of nb-iot 78 uhd journal of science and technology | jan 2020 | vol 4 | issue 1 fig. 6b shows the position of nb-iot that makes use of both 4g/5g attributes and low power radio technology and advantages of low-energ y consumption remote correspondence advances (e.g., zigbee innovation) to be specific concentrated transmission and minimal effort. we have further investigated the technology and compared it with lora, which is a type of wan communication technology, as shown in table 5. 7. requirements for nb-iot security security requirements for nb-iot are similar to that of traditional iot with a number hardware, energy consumption, and network connection mode differences. traditional iot normally has a robust computing power with strong internal security design but with high energy consumption [34], [35], [36]. there are iot technologies equipped with low-power hardware, but in return, it offers a low computing power with high-security risk which may lead to service denial. as a consequence, any security violation, even on a small scale, may leave a negative lasting effect as terminals are simpler and easier for attackers to obtain information. researchers in chen et al. [35], li et al. [36], mangalvedhe et al. [37], and koc et al. [38] have analyzed nb-iot security requirements, which is distributed over three layers, as shown in fig. 7. the below explanation introduces the security prerequisites of nb-iot planning to the 3-layer design comprised perception layer, transition layer, and application layer. 7.1. perception layer perception layer is nb-iot base layer that shows fundamental and establishment of administration and engineering higher layers. nb-iot observation layer, for example, regular discernment layer, will, in general, be under latent and dynamic assaults. uninvolved assault implies trespasser ransacks data with no redress. the fundamental highlights incorporate listening in, rush hour gridlock investigation, etc. as nb-iot depended on an open remote system, trespassers may discover data about nb-iot terminals with strategies, for example, information connect theft and traffic properties examination to focus on a progression of resulting assaults. 7.2. transition layer contrasted with the traditional layer in customary iot, nb-iot changes complex system organization that implies hand-off entryway gathers data and afterward sends it to the table 5: comparison of nb-iot and lora item nb-lot lora power consumption low (l0 years battery life) low (l0 years battery life) cost low lower than nb-iot safety telecom level security slight interference accuracy rate high high coverage <25 km (resend supported) <11 km deployment rebuild supported based on lte fdd or gsm inconvenience nb-iot: narrowband internet of things, gsm: global system for mobile, fdd: foreign demand draft fig. 6. correlation between narrowband internet of things (nb-iot) and different wireless communication technologies (a) comparison of various wireless communication technologies. (b) nb-iot design exchanges. a b zana azeez kakarash and farhad mardukhi: properties and securities of nb-iot uhd journal of science and technology | jan 2020 | vol 4 | issue 1 79 base station for sustaining. subsequently, numerous issues, for example, multi-organizing, significant expense, and battery with a high limit, are illuminated. a system for the entire city can carry simplicity of upkeep and the board with advantages, for example, advantageous tending to and establishment through detachment from property administration. 7.3. application layer the purpose of the nb-iot application layer is to store, analyze, and manage data efficiently. after the perception and transfer layer, a large amount of data converges in the application layer. then, vast resources are formed to provide data support from different applications. compared to the traditional iot application layer, the nb-iot application layer carries more data [37], [38], [39], [40]. 8. conclusion in this paper, we reviewed the basic properties, benefits, and background and the latest scientific findings of nbiot. the general background of the iot was introduced. the benefits, features, basic theory, and nb-iot key technologies such as connection analysis, latency analysis, and coverage enhancement analysis were provided. subsequently, we focused on differences between nb-iot and different types of communication technologies. finally, we made a comparison between nb-iot and other wireless communication technologies and we examine nb-iot security requirements from three levels; perception layer, transition layer, and application layer. there are many future research paths for this study. we continue to investigate a visible network model that can visually reflect the status of nb-iot network operation. such a model should complete each of the operational modules and do link-level open type simulation and nb-iot confirmation form pellet. references [1]. p. reininger. “3gpp standards for the internet-of-things”. technologies report, huawei, shenzhen, china, 2016. [2]. “feasibility study on new services and markets technology enablers for massive internet of things”. document tr 22.861, 3gpp, 2016. [3]. m. chen, y. qian, y. hao, y. li and j. song. “data-driven computing and caching in 5g networks: architecture and delay analysis”. ieee wireless communications, vol. 25, no. 1, pp. 70-75, 2018. [4]. 3gpp. “standardization of nb-iot completed”, 2016. available from: http://www.3gpp.org/news-events/3gpp-news/1785-nb_iot_ complete. [last accessed on 2018 oct 01]. [5]. a. rico-alvarino, m. vajapeyam, h. xu, x. wang, y. blankenship, j. bergman, t. tirronen and e. yavuz. “an overview of 3gpp enhancements on machine to machine communications”. ieee communications magazine, vol. 54, no. 6, pp. 14-21, 2016. [6]. ericsson. ‘‘cellular networks for massive iot’’. technologies report, ericsson, stockholm, sweden, 2016. [7]. y. l. zou, x. j. ding and q. q. wang. “key technologies and application prospect for nb-iot”. zte technology journal, vol. 23, no. 1, pp. 43-46, 2017. [8]. a. laya, l. alonso and j. alonso-zarate. “is the random access channel of lte and lte-a suitable for m2m communications? a survey of alternatives”. ieee communications surveys and tutorials, vol. 16, no. 1, pp. 4-16, 2014. [9]. riot. ‘‘low power networks hold the key to internet of things’’. technologies report, berlin, germany, 2015. [10]. x. ge, x. huang, y. wang, m. chen, q. li, t. han and c. x. wang. “energy-efficiency optimization for mimo-ofdm mobile multimedia fig. 7. similarity between traditional narrowband internet of things (iot) and iot in terms of security requirements. zana azeez kakarash and farhad mardukhi: properties and securities of nb-iot 80 uhd journal of science and technology | jan 2020 | vol 4 | issue 1 communication systems with qos constraints”. ieee transactions on vehicular technology, vol. 63, no. 5, pp. 2127-2138, 2014. [11]. p. osti, p. lassila, s. aalto, a. larmo and t. tirronen. “analysis of pdcch performance for m2m traffic in lte”. ieee transactions on vehicular technology, vol. 63, no. 9, pp. 4357-4371, 2014. [12]. g. c. madueno, č. stefanović and p. “popovski. reengineering gsm/gprs towards a dedicated network for massive smart metering”. in: ieee international conference on smart grid communications (smartgridcomm), pp. 338-343, 2014. [13]. w. liu, j. dong, n. liu, y. l. chen, y. b. han and y. b. ren. “nb-iot key technology and design simulation method”. telecommunications science, china, pp. 144-148, 2016. [14]. m. centenaro and l. vangelista. “a study on m2m traffic and its impact on cellular networks”. in: 2015 ieee 2nd world forum on internet of things (wf-iot), pp. 154-159, 2015. [15]. g. y. lin, s. r. chang and h. y. wei. “estimation and adaptation for bursty lte random access”, vol. 65. in: ieee transactions on vehicular technology, pp. 2560-2577, 2016. [16]. q. xiaocong and m. mingxin. “nb-iot standardization technical characteristics and industrial development”. information research, vol. 5, pp. 523-526, 2016. [17]. m. t. islam, m. t. abd-elhamid and s. akl. “a survey of access management techniques in machine type communications”. vol. 52. ieee communications magazine, piscataway, pp. 74-81, 2014. [18]. g. h. dai and j. h. yu. “research on nb-io t background, standard development, characteristics and the service”. mobile communications, vol. 40, no. 7, pp. 31-36, 2016. [19]. m. a. khan and k. salah. “iot security: review, blockchain solutions, and open challenges”. future generation computer systems, vol. 82, pp. 395-411, 2018. [20]. v. kharchenko, m. kolisnyk, i. piskachova and n. bardis. “reliability and security issues for iot-based smart business center: architecture and markov model”. in: 2016 third international conference on mathematics and computers in sciences and in industry (mcsi), pp. 313-318, 2016. [21]. j. j. nielsen, d. m. kim, g. c. madueno, n. k. pratas and p. popovski. “a tractable model of the lte access reservation procedure for machine-type communications”. in: 2015 ieee global communications conference (globecom). pp. 1-6, 2015. [22]. c. h. wei, r. g. cheng and s. l. tsao. “performance analysis of group paging for machine-type communications in lte networks”. ieee transactions on vehicular technology, vol. 62, no. 7, pp. 3371-3382, 2013. [23]. m. koseoglu. “lower bounds on the lte-a average random access delay under massive m2m arrivals”. ieee transactions on communications, vol. 64, no. 5, pp. 2104-2115, 2016. [24]. s. persia and l. rea. “next generation m2m cellular networks: ltemtc and nb-iot capacity analysis for smart grids applications”. in: 2016 aeit international annual conference (aeit), pp. 1-6, 2016. [25]. t. m. lin, c. h. lee, j. p. cheng and w. t. chen. “prada: prioritized random access with dynamic access barring for mtc in 3gpp lte-a networks”. ieee transactions on vehicular technology, vol. 63, no. 5, pp. 2467-2472, 2014. [26]. m. e. rivero-angeles, d. lara-rodriguez f. a. cruz-perez. “gaussian approximations for the probability mass function of the access delay for different backoff policies in s-aloha”. ieee communications letters, vol. 10, no. 10, pp. 731-733, 2006. [27]. j. liu, j. wan, b. zeng, q. wang, h. song and m. qiu. “a scalable and quick-response software defined vehicular network assisted by mobile edge computing”. ieee communications magazine, vol. 55, no. 7, pp. 94-100, 2017. [28]. n. m. balasubramanya, l. lampe, g. vos and s. bennett. “drx with quick sleeping: a novel mechanism for energy-efficient iot using lte/lte-a”. ieee internet of things journal, vol. 3, no. 3, pp. 398-407, 2016. [29]. k. lin, d. wang, f. xia, h. ge. “device clustering algorithm based on multimodal data correlation in cognitive internet of things”. ieee internet of things journal, vol. 5, no. 4, pp. 2263-2271, 2018. [30]. g. naddafzadeh-shirazi, l. lampe, g. vos and s. bennett. “coverage enhancement techniques for machine-to-machine communications over lte”. ieee communications magazine, vol. 53, no. 7, pp. 192-200, 2015. [31]. f. xu, y. li, h. wang, p. zhang and d. jin. “understanding mobile traffic patterns of large scale cellular towers in urban environment”. ieee/acm transactions on networking, vol. 25, no. 2, pp. 11471161, 2017. [32]. y. li, f. zheng, m. chen and d. jin. “a unified control and optimization framework for dynamical service chaining in softwaredefined nfv system”. ieee wireless communications, vol. 22, no. 6, pp. 15-23, 2015. [33]. x. ge, j. yang, h. gharavi and y. sun. “energy efficiency challenges of 5g small cell networks”. ieee communications magazine, vol. 55, no. 5, pp. 184-191, 2017. [34]. x. yang, x. wang, y. wu, l. p. qian, w. lu and h. zhou. “small-cell assisted secure traffic offloading for narrow band internet of thing (nb-iot) systems”. ieee internet of things journal, vol. 5, no. 3, pp. 1516-1526, 2018. [35]. l. chen, s. thombre, k. järvinen, e. s. lohan, a. alén-savikko, h. leppäkoski, m. z. bhuiyan, s. bu-pasha, g. n. ferrara, s. honkala and j. lindqvist. “robustness, security and privacy in location-based services for future iot: a survey”. ieee access, vol. 5, pp. 8956-8977, 2017. [36]. y. li, x. cheng, y. cao, d. wang and l. yang. “smart choice for the smart grid: narrow band internet of things (nb-iot)”. ieee internet of things journal, vol. 5, no. 3, pp. 1505-1515, 2018. [37]. n. mangalvedhe, r. ratasuk and a. ghosh. “nb-iot deployment study for low power wide area cellular iot”. in: 2016 ieee 27th annual international symposium on personal, indoor, and mobile radio communications (pimrc), pp. 1-6, 2016. [38]. a. t. koc, s. c. jha, r. vannithamby and m. torlak. “device power saving and latency optimization in lte-a networks through drx configuration”. ieee transactions on wireless communications, vol. 13, no. 5, pp. 2614-2625, 2014. [39]. r. cheng, a. deng and f. meng. “study of nb-iot planning objectives and planning roles”. china mobile group design inst. co., technical reports telecommunications science, 2016. [40]. y. hou, and j. wang. “ls-svm’s no-reference video quality assessment model under the internet of things”. in: 2017 ieee smart world, ubiquitous intelligence and computing, advanced and trusted computed, scalable computing and communications, cloud and big data computing, internet of people and smart city innovation (smart world/scalcom/uic/atc/cbdcom/iop/sci), pp. 1-8, 2017. [41]. r. aleksandar, p. ivan, p. ivan, b. đorđe, s. vlado and r. miriam. “key aspects of narrow band iot communication technology driving future iot applications”. conference: in: 2017 ieee telecommunication forum (telfor), 2017. [42]. c. min, m. yiming, h. yixue, a. k. hwang. “narrow band internet of things”. ieee access, vol. 5, pp. 20557-20577, 2017. . uhd journal of science and technology | jul 2019 | vol 3 | issue 2 1 1. introduction the significance of student mobility and interuniversity exchange programs is incomprehensibly expanding, and the issue at present involves a huge spot in the motivation of instructive arrangement creators and advanced education establishments. in 2007, erasmus program commended its 20th anniversary. the erasmus program is presumably one of the best-known activities of the european commission, empowering students just as staff versatility, and intending to improve the quality and to fortify the european component of advanced education. university of human development participated with erasmus as an associative partner in erasmus program to help students and staff receive more practical mobility experience to accomplish specific tasks related to their profession [1]. mobility in space, land portability, “genuine” mobility, and physical mobility are for the most part terms used to allude to students and instructors in advanced education, “physically” moving to another establishment inside or outside their very own nation to study or instruct temporarily. in the accompanying passages, distinctive perspectives or kinds of mobility, for example, level and vertical mobility, free-mover and program portability are recognized and probably the most well-known programs are quickly depicted. the majority of the variations of geological mobility exhibited are types of physical mobility. students’ mobility can be arranged by the length of the investigation time frame abroad. when students just spend some portion of their examination program abroad or at an alternate institution in a similar nation, and just total a few modules or courses, however not entire degrees, it is alluded as flat portability (likewise called transitory, credit or non-degree mobility). most national and european mobility programs advance this variation of portability. the greatest mobility timeframe for students and graduates in such projects is normally 1 year. with the usage of the bologna blended learning mobility approach and english language learning mazen ismaeel ghareb1, saman ali mohammed2 1department of computer science, college of science and technology, university of human development, sulaymaniyah, iraq, 2department of english, college of languages, university of human development, sulaymaniyah, iraq a b s t r a c t although the benefits of blended learning have been well documented in educational research, relatively few studies have examined blended mobilities in education in kurdistan region government and in iraq. this study discusses a blended mobility approach for a teacher training program designed for in-service english language teachers (elt) and investigates its effectiveness by comparing the latest participation of the university of human development for computer science and proposing the same program for training english for lecturers and students. the research involved proposes new mobility program for teaching and learning english language and using their language skills in an ongoing business project using several software for communication and management of their projects. results will show the framework for new blended learning and blended mobilities of many different english language teaching (elt) aspects. index terms: blended aim, blended learning, blended mobility, language learning strategies, e-learning, virtual mobility access this article online doi: 10.21928/uhdjst.v3n2y2019.pp1-9 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2019 ghareb and mohammed. this is an open access article distributed under the creative commons attribution noncommercial no derivatives license 4.0 (cc by-nc-nd 4.0) r e v i e w a r t i c l e uhd journal of science and technology corresponding author’s e-mail: mazen ismaeel ghareb, department of computer science, college of science and technology, university of human development, sulaymaniyah, iraq. e-mail: mazen.ismaeel@uhd.edu.iq received: 06-03-2019 accepted: 13-04-2019 published: 20-06-2019 ghareb and mohammed: blended learning mobility approach 2 uhd journal of science and technology | jul 2019 | vol 3 | issue 2 procedure and the expanding presentation of the lone ranger and ace projects in europe, numerous higher instruction foundations are likewise expecting an expansion in what is known as vertical mobility (additionally called degree or confirmation mobility). here, students think about abroad for a full degree, accomplishing, for instance, their first degree at an organization in one nation (for the most part their nation of origin) and their second degree at another foundation, either in their nation of origin or abroad (for example a 4-year certification at home – ace degree abroad). the eu erasmus mundus program, for instance, underpins vertical mobility in a methodical way [2]. another meaning of blended mobility is a term used to depict an instructive idea that consolidates physical scholastic mobility, virtual mobility, and mixed learning. it is expected to advance employability of advanced education understudies. since 2009, it has advanced from virtual mobility, keeping the worldwide estimation of scholastic versatility, and yet giving a solid response to conceivable family related, monetary, mental and social hindrances of physical mobility [3], [4]. the virtual mobility part of mixed mobility is, for the most part, upheld using data and correspondence innovations (for example, skype, adobe connect, slack, google hangout, and trello) to remain associated with the educators or potentially understudies who might be arranged at numerous far off areas. the physical mobility part is ordinarily of momentary length, extending from 2 to 14 days. there may exist numerous times of momentary portability. brief times of physical mobility empower members to the center for several days, just on the genuine undertaking, which is troublesome in everyday life in a nearby domain [5]. early uses of a mixed mobility configuration can be found back in 2009. through this venture a domain was made which energizes the advancement of understudies’ delicate aptitudes, for example, cooperation and correspondence, in a universal setting by methods for an inventive guidance worldview to improve such abilities without costly and broad curricular changes [6]. 2. literature review 2.1. blended learning blended learning has turned into a trendy expression in numerous instructive conditions as of late, generally alluding to courses that utilize a blend of eye to eye and web-based learning [7]. the term started in work environment learning and writing but, on the other hand, is presently broadly utilized in advanced education, regularly portraying courses that have had an online segment added to them [8]. some consideration has been paid to the utilization of blended learning in language educating overall [9]-[11]; however, next to no work has been done explicitly in english language teaching (elt) settings. in fact, with reference to elt [12], features are needed for further research to be directed into what makes a powerful blend. fig. 1 explains some components of blended learning. joining the upsides of e-learning and conventional learning situations has prompted another learning condition regularly alluded to as “blended learning,” which unites customary physical classes with components of virtual learning [13]-[15]. one of the fundamental ideas hidden in this methodology is that: the individuals who utilize mixed methodologies base their instructional method on the suspicion that there are inborn advantages in face-to-face interactive (both among students and between student and teacher) just as the understanding that there are some inherent advantages in utilizing the web strategies in the learning process. consequently, the point of those utilizing blended learning approaches is to locate an agreeable harmony between online access to learning and face-to-face human interaction [16], [17]. in a survey of research on mixed learning [18], numerous examinations were recognized that uncovered positive impacts of mixed learning on (1) students performance [19]; (2) student interest and inspiration [20]; (3) expanded access and adaptability [21], (4) cost-viability [22]; and (5) progressively dynamic and deeper learning in examination with conventional classes [23]. fig. 1. blended learning components ghareb and mohammed: blended learning mobility approach uhd journal of science and technology | jul 2019 | vol 3 | issue 2 3 while trying to exploit blended learning, colleges have started advertising courses joining customary up close and personal instructional components with internet learning segments [24] in different scholarly fields, for example, the board [25] and business [26]. nonetheless, little research has been done on mixed learning in educator instruction explicitly [27], and distributed work has concentrated for the most part on understudies assessments and the mechanical applications presented [28]. be that as it may, mixed learning may be able to possibly improve instructor instruction regarding both availability and quality. blended learning has turned out to stand out amongst the most well-known approaches to educate english as a foreign language (efl) due to its twofold segment, which coordinates vis-à-vis classes with virtual learning so as to offer students a wide scope of materials and assets sorted out methodologically. thinking about the past viewpoints, in numerous instructive settings bl is a device accessible to students with the end goal for them to go past the homeroom and work on various intelligent exercises as an expansion of the immediate educating classes. through all the mechanical assets they have around them, students can find out about various subjects and societies, surf the web and use the technological device they access, for example, ipods, ipads, pcs, mp3s and mp4s, among others. notwithstanding, students are attacked by a lot of data from various sources. in this way, they get confounded and do not have the foggiest idea what to see first, which obstructs the proper utilization of the virtual material that may add to their english learning process. along these lines, efl instructors have the test of arranging virtual learning conditions that are engaging their students. this will enable them “to arrange” their efl learning procedure and supplement up close and personal classes or the different way can utilize the virtual stage self-governing to get readied for the eye to eye classes. along these lines, fl instructors are responsible for the methodological arranging of mixed courses which could be utilized to engage the efl students. in spite of the fact that there are a few models with which to sort out a mixed course, we consider the accompanying model recommended by khan [29], as shown in fig. 2. the institutional perspective is the primary component educators need to consider since it relies on the institutional strategies about the educational modules, the design of the material, and the organization and money related zone. the second segment, the mechanical one, is the fundamental thought when educators plan both the disconnected and online exercises. educators need a wide scope of mechanical assets so as to pull in their students’ consideration: if the face-to-face classes and the virtual ones are not testing, students may feel exhausted or baffled. it is important to show points and activities which are engaging them. the third factor to hold up under as a primary concern is a pedagogical segment, which no uncertainty is the most critical one in these cross breed courses. in the event that instructors have a methodological arrangement to sort out both their up close and personal classes and the online viewpoint, it will lead the language students to prevail in their learning procedure and acquire better outcomes since they appropriately compose the two segments. 2.2. students mobilities in last century in europe, mobility of students, educators, and staff has been a be noticeable among the most universities and education system. as the colleges of europe changed fig. 2. blended learning model. fig. 3. blended mobility framework. ghareb and mohammed: blended learning mobility approach 4 uhd journal of science and technology | jul 2019 | vol 3 | issue 2 to better approaches for working in the previous decades, they have kept on supporting this profitable convention. student mobility can be characterized by the length of the examination time frame abroad. when students just spend some portion of their investigation program abroad or at an alternate foundation in a similar nation, and just total a few modules or courses, however not entire degrees, it is alluded to as level mobility (additionally called brief, credit, or non-degree versatility). most national and european mobility programs advance this variation of mobility. the most extreme mobility time frame for students and graduates in such projects is typically 1 year. with the implementation of the bologna process and the expanding presentation of single guy and ace projects in europe, numerous higher instruction foundations are likewise expecting an expansion in what is known as vertical mobility (additionally called degree or certificate portability). here, students ponder abroad for a full degree, accomplishing, for instance, their first degree at an establishment in one nation (generally their nation of origin) and their second degree at another foundation, either in their nation of origin or abroad (e.g., bachelor certificate at home – ace degree abroad). the eu erasmus mundus program, for instance, bolsters vertical portability in a methodical way [30]. mobility can likewise be characterized by the method of association of the examination time frame abroad. program understudies are portable students partaking in a sorted out mobility program. “free-movers” then again do not profit by any sort of students among foundations and do not partake in a composed mobility program. “free-mover” versatility is the most seasoned type of scholastic portability. since the center of the 1970s composed mobility has increased expanding significance, with the ascent of organized national limited time programs (see for instance the daad grants) and european portability programmes. organized or program mobility is, these days, viewed as the real mobility engine for students, graduates, doctoral hopefuls and showing staff in europe (for example erasmus, leonardo, marie curie) and, progressively, the whole world (for example erasmus mundus) [30]. the topographical mobility of free-movers can occur inside a nation or crosswise over national fringes. free-mover mobility can likewise be seen on an overall scale and is commonly not restricted to specific areas or target nations. interestingly, program mobility ordinarily centers on specific areas (for example, ceepus, nordplus…) or on specific mainlands (for example, europe on account of erasmus, marie curie) [30]. the significance and prominence of specific versatility plots frequently vary among nations and in certain nations free-mover mobility still assumes an extensive job. aside from the free-mover mobility and the universal participation facilitated by explicit remotely supported projects, numerous higher instr uction organizations participate with one another on a reciprocal basis. bilateral understandings between foundations are sorted out so as to fire up joint activities or increase existing contacts, and more often than not additionally make open doors for understudy and staff versatility. the upsides of such respective concurrences as to portability are for instance simplicity of utilization, smooth credit exchange, and acknowledgment of studies. two-sided understandings can exist both on the dimension of the organization and the dimension of resources or divisions. at last, mobility can likewise be upheld in the structure of systems of advanced education foundations or understudy systems. the coimbra gathering understudy trade system for instance is a mobility plot supplementing the conventional erasmus mobility. 3. blended mobility proposed framework 3.1. university of human development 3.1.1. blended aim is one of those international projects that is supported and funded by the european union in the context of scientific exchange hosted by european universities. the projects are dedicated to fourth-grade students. from each university, two students and one supervising teacher participate. the participants, both the students and the supervisor, meet twice a year in one of the member universities to present any advancement in their project and plan for future projects. the students work virtually to finish the project. the students get ects for their works according to their university’s instructions. we suggest that the university of human development get intensively involved in this unique opportunity by fitting in our university in this project as it gets bigger and bigger, and it could require enlarging our team. 3.1.2. calling in other colleges university of human development is participating in the blended aim project for 2018–2019 with other colleges regarding the erasmus project calling. 3.2. virtual mobilities with the developing noteworthiness of distance learning and e-learning, virtual mobility has turned out to be progressively ghareb and mohammed: blended learning mobility approach uhd journal of science and technology | jul 2019 | vol 3 | issue 2 5 critical in the course of the most recent couple of years. it is since the second 50% of the 1990s that the idea of virtual mobility has picked up many with regard to the internationalization of advanced education foundations. be that as it may, what is comprehended by virtual portability? the elearningeuropa.info entry characterizes it as: “the utilization of data and correspondence innovations (ict) to get indistinguishable advantages from one would have with physical mobility yet without the need to travel”[20]. this definition obviously demonstrates the two distinct components of virtual mobility. virtual mobility is typically compared to the virtual mobility of “academic plagiarism” and adds to the internationalization of training by empowering (cross-border) collaboration between various instruction establishments. besides, it is connected to the new conceivable outcomes opened using data and correspondence innovation (ict) bolstered situations that incorporate, for instance, video conferencing, live spilling, community-oriented workspaces, and computer-mediated conferencing. in the system of the being portable task, components, for example, the improvement of (inter-) social comprehension were added to the definition to feature the wealth of the experience and the likenesses with the erasmus trade program: “virtual portability is a type of realizing which comprises virtual parts through a completely ict bolstered learning condition that incorporates cross-border coordinated effort with individuals from various foundations and societies working and contemplating together, having, as its fundamental reason, the upgrade of intercultural understanding and the trading of information” [31], [32]. the typology is, for the most part in light of the kind of movement and the conditions in which the virtual mobility movement happens: • a virtual course or workshop (arrangement): students in an advanced education foundation participate in virtual mobility for a solitary course (as a component of an entire investigation program) or a course (arrangement) and the remainder of their learning exercises happen face-to-face generally; • a virtual report program: a whole virtual investigation prog ram is offered at one advanced education establishment, giving understudies from distinctive nations the opportunity to take this investigation program without traveling to another country for an entire scholastic year; • a virtual mobility position: student’s situations are sorted out between a higher education foundation and an organization (now and again in an alternate nation). in the virtual mobility students use ict to bolster their temporary position, giving them a real-life involvement in a corporate setting without the need to move from the grounds to the organization or to move to another nation for a specific period of time, and giving them a down to earth planning for new methods for working through (global) synergistic cooperation; • virtual help exercises to physical trade: virtual mobility empowers both better planning and follow-up of students who take an interest in physical trade programs. preliminary exercises could incorporate understudy deter mination at a separation through videoor web conferencing (for checking social and language aptitudes), furthermore, online language and social combination courses. follow-up exercises will assist students with keeping in contact with their companions, dispersed the world over, to complete their normal research work as well as desk work. they could likewise appear as a so-called “virtual alumni” association, foster lifelong friendships, and networks. in spite of the fact that the term “virtual mobility” is generally new, the european commission has effectively advanced virtual mobility in the previous years, for the most part through the money related help of ventures inside the socrates/minerva and the e-learning and deep-rooted learning projects. a portion of the later tasks managing the subject incorporate the above-mentioned being versatile undertaking, reve (genuine virtual erasmus), emove (an operational origination of virtual portability), and more vm (prepared for virtual mobility), each focusing on various parts of virtual portability for various gatherings of members [3], [33]. 3.3. blended mobility while much of the time, virtual mobility speaks to an important elective answer for physical mobility, there is by all accounts general understanding that its anything but a substitute for physical mobility. virtual mobility is, then again, ending up progressively well-known as help and supplement to conventional genuine mobility programs. it can offer extra arrangements and is an approach to additionally improve the current conventional projects, for example, erasmus. at the point when parts of physical and virtual mobility ghareb and mohammed: blended learning mobility approach 6 uhd journal of science and technology | jul 2019 | vol 3 | issue 2 are consolidated so as to boost the benefits of both, it is characterized as “blended mobility” or – whenever connected to the eu erasmus program – “blended erasmus.” this blended methodology is in accordance with the consequences of, for instance, the eureca venture, completed by the european understudy affiliation aegee, which suggests in addition to other things that “erasmus understudies could be arranged as of now at their home colleges in ‘active workshops’ from one viewpoint, yet could likewise ‘trade encounters’ consequently classes” then again. the report additionally expresses that “each understudy ought to reserve the option to go to a language course that empowers him/her to pursue the scholastic program” at the host college and that “short-term trades and virtual trades could be developments.” in addition, the report of the workshop on “bologna and the difficulties of e-learning and separation instruction” [34] places uncommon accentuation on the strong capacity virtual versatility can play for physical portability and show that “virtual versatility must be utilized to enhance and bolster physical versatility by better setting it up, giving powerful follow-up intends to it, and offering the likelihood to remain in contact with the home foundation while abroad. it can likewise offer (in any event part of) the advantages of physical mobility for the individuals who are generally unfit to go to courses abroad” european undertakings, for example, sumit (supporting mobility through ict), esmos (upgrading understudy mobility through online help), triumphant (virtual educational program through dependable interoperating college frameworks) and others, recommend that the european commission has additionally recognized virtual mobility as a help instrument in physical portability as an imperative subject. (see add ii for more information). in addition, the vm-base venture (virtual mobility when student exchanges), in which this manual displays the outcomes, intends to raise the nature of student exchange by offering virtual help to physical mobility. in vm-base virtual help is utilized to plan and follow-up the portable understudy, as a supplement to the current trade programs. the venture in this way bolsters instructors in training trade understudies at a separate (e-coaching). student exchanges can set themselves up for their stay at a host college through, among other help exercises, virtual classes between the home and host college. preliminary language or social courses for the understudies could be given customarily at the home college or by means of ict from the host college before they remain. amid their stay at the host college, they could remain associated with understudies, partners, or instructors at the home college. furthermore, on their arrival, they could broaden their stay “essentially” by staying in touch with the host college by virtual methods. as it shows in fig. 3 the framework of blended mobility it a combination of blended learning and mobility learning, which is added to blended learning pedagogy in general . 4. english languages skills and information and communication technology 4.1. a. listening interactive activities to sight and multimedia assets. listening skills are best learned through simple, engaging activities that focus more on the learning process than on the final product. regardless of whether you are working with a huge gathering of students or a little one, you can utilize any of the accompanying guides to build up your own techniques for showing students how to listen well. listening skill/comprehension is an important as well as complex process of language learning. it plays a significant role in second language competence. the process and the act of comprehension are lessened and eased through the context and the purpose. linguistic knowledge and experiential knowledge are also key ways listeners make use of it to comprehend. there are many tools one can make use of such as computer-assisted language learning [15]. there are several techniques for listening activities developments: • interpersonal activities, for example, mock meetings and storytelling. assign the students to little gatherings of a few, and after that, give them a specific listening activity to achieve. • bigger group exercises likewise fill in as a supportive technique for showing listening aptitudes to students. • you can likewise train listening abilities through sound portions of radio projects, online digital recordings, instructional addresses, and other sound messages. • another helpful resource for teaching listening skills is video segments, including short sketches, news programs, documentary films, interview segments, and dramatic and comedic material. 4.2. speaking technology can stimulate the playfulness of students and drench them in an assortment of situations. innovation ghareb and mohammed: blended learning mobility approach uhd journal of science and technology | jul 2019 | vol 3 | issue 2 7 allows students to take part in self-coordinated activities, open doors for self-managed cooperation, protection, and a sheltered situation wherein blunders get redressed and explicit input is given. input by a machine offers extra an incentive by its capacity to track oversights and connection the understudy quickly to practices that attention on explicit errors. studies are rising that demonstrate the significance of subjective criticism in programming projects. at the point when connections are given to find clarifications, extra assistance, and reference, the estimation of technology is additionally increased. present day technologies accessible in instruction today are: • correspondence lab • discourse acknowledgment programming • interne to technology-enhanced language learning • podcasting • quick link pen • quicktionary. as english today is ranked number one in the worlds in terms of a number of users, the ability to speak fluently has turned into an aptitude of foremost centrality to acquire. an online foreign language speaking class, virtual classes are structured having at the top of the priority list, standards of elt and e-learning, alongside systems that raise connection, incorporating vocabulary and utilization of english, while giving a calm situation so as to inspire even withdrawn students take an interest and produce spoken language. use of oovoo and skype, slack, google hangout, trello separated from empowering clients to cooperate with prerecorded messages, additionally give students the choice of synchronous talk, permitting the formation of a virtual class of three to six clients, contingent on the sort of membership – free or paid, respectively. another advantage given by these two instruments is that students can profit by real learning encounters as opposed to their standard everyday practice, which will thus inspire them to request all the more genuine correspondence subsequently, more opportunities to disguise language [8]. 4.3. reading online reading perusing is an errand that has all the hallmarks of being important for the 21st-century understudies. along these lines, the production of an electronic perusing program called “english reading online” was made to limit the hole among perusing and understanding utilizing web-based perusing procedures. the successful utilization of perusing procedures is known to enhance peruser’s understanding. as innovation has infiltrated our lives, the impression of perusing for cognizance through innovation needs to transform into a groundbreaking method for doing as such a definitive objective is to empower students to use procedures spontaneously. notwithstanding, perusing procedures has a few advantages, as well as confinements. for example, the dimension of the members, the study hall settings, and the arrangement of procedures need to be mulled over before connecting with training. vital perusing guidance benefits all understudies even those of scholastic level. this may be a consequence of lacking secondary school arrangement or little planning amid their time as students. as understudies gain a lot from perusing through procedures which improve their scholastic execution, having it offered through an innovation improved condition increases its impact on understanding while it enables them to figure out how to utilize technology. the assets offered to the understudies through a learning content administration framework called “varsite” enabled them access to a bigger assortment of writings of those found in the college library. this fundamentally gives every understudy the self-rule to get to these assets as per their timetable, empowering them to screen their adapting far and away superior. 4.4. writing writing can be perplexing for many students since it requires the correct utilization of language. in contrast to spoken language, composed language cannot utilize motions or non-verbal communication to clarify what it is that should be comprehended or passed on. played out an investigation where they attempted to recognize the most ideal way an educator can use to show the latent voice marvel. they utilized three sorts of classes. the first was “the customary up close and personal way,” the second was the “integrative way” where both customary instructing and web-based instructing were utilized, and the third sort was the “online way” where the main sort of educating and materials was electronic what they found was that the coordinated route ended up being the most advantageous for the students, just as that sexual orientation assumes a non-noteworthy job since the outcomes were not unique. it was likewise discovered that the dimension of the understudies changed toward progress after the utilization of the incorporated strategy, in this manner consequences of the post-test essentially varied from the aftereffects of the pretest. this investigation enables us to see that teachers should utilize electronic material as they do improve their students’ level utilizing a free and simple to utilize apparatus. the utilization of blog programming and tweeter are devices that can enable understudies to rehearse composed language, draw in with the language they wish to learn and obviously to share their considerations or emotions ghareb and mohammed: blended learning mobility approach 8 uhd journal of science and technology | jul 2019 | vol 3 | issue 2 and think about them [11]. advancing composition guidance through such an engaging way empowers more generation of composed language which may not have created something else. students writes blogged instead of going to an in-class session appeared better outcomes from the individuals who just got in-class composing guidance. instructors should utilize this device as it upgrades composing execution while it is not constrained inside school dividers as it can happen anyplace. the outcome the students who blogged appear to have was an improvement over the individuals who did not, which demonstrates the estimation of the mix of this device. tweeting likewise is by all accounts a significant device to start the production of network bonds, henceforth permitting the students to discover increasingly about one another also, assemble network bonds. likewise, while executing gatherings, online journals and wikis in the meantime, this appears to have positive outcomes on understudies’ learning progress since this mixed methodology enables them to consider the contrasts which may happen in methods for communicating in english when utilizing composed language. 5. conclusions and recommendations one of the primary reasons for higher education organizations is to give students the urgent apparatuses to prevail in the worldwide work market. blended learning has turned out to be a standout among the most well-known approaches to educate efl because students can find out about various subjects and societies, surf the web and use the technological device they access, for example, ipods, ipads, pcs, mp3s, and mp4s, among others. proficient life is these days heavily relying on mobility and requests experts to exceed expectations in relational abilities at a global, culturally diverse condition. such activities have multiple benefits, both for the staff who participate and for their schools such as enhanced language skills, innovative teaching methods, and cultural awareness. the main advantages for students can be • social skills development • developing organizational skills • learn to use online communication tools • does not disturb regular home activities • learn how to work as a member of a team of students, international, and/or interdisciplinary • develop the skills of self-management and work on a project or proof-of-concept assigned by a company, resulting in real-world, innovative projects • experience cultural differences and similarities • practice languages other than mother tongue • integrated more easily in english language curriculum. • it provides opportunities to participants with special needs (e.g., online assistance software, medical treatment.,...). there also some disadvantages such as: • it is challenges to communicate in a virtual way, especially if not mother tongue • it is difficult with long-term mobility, but not equivalent • cultural communication issues may arise earlier and faster • student must have disciplines • students need to have a certain level of independence is required in any case, delicate abilities, just as global presentation, are once in a while tended to college classes. blended mobility defeats the run of the mill obstructions to mobility, in this way enabling students to exploit the advantages that mobility and worldwide presentation offer. notwithstanding, paying little respect to its additional esteem, mixed mobility is not really utilized and scarcely perceived as a genuine option with incredible potential to defeat the normal troubles of global mobility. the blended-aim project sets the basis to support and structure mixed mobility as a rule. in concrete, mixed point enables universal blended-aim and employability by giving the assets – including preparing, supporting apparatuses and data – to help students and organizations facilitating entrylevel positions and by streamlining inventive instructing ideal models intended to build up students’ delicate abilities in a worldwide domain. the framework of blended mobility is a combination of blended learning and mobility learning, which is added to blended learning pedagogy. in general, we hope the higher education in kurdistan can adapt this system in new bologna process. references [1] f. rizvi. “global mobility, transnationalism and challenges for education”. transnational perspectives on democracy, citizenship, human rights and peace education, bloomsbury academic, london, p. 27, 2019. [2] h. du, z. yu, f. yi, z. wang, q. han and b. guo. “group mobility classification and structure recognition using mobile devices”. in: 2016 ieee international conference on pervasive computing and communications (percom). ieee, sydney, pp. 1-9, 2016. [3] m. t. batardière, m. giralt, c. jeanneau, f. le-baron-earle and v. o’regan. “promoting intercultural awareness among european university students via pre-mobility virtual exchanges”. journal of virtual exchange, vol. 2, pp.1-6, 2019. ghareb and mohammed: blended learning mobility approach uhd journal of science and technology | jul 2019 | vol 3 | issue 2 9 [4] a. baroni, m. dooly, p. g. garcía, s. guth, m. hauck, f. helm, t. lewis, a. mueller-hartmann, r. o’dowd, b. rienties and j. rogaten. “evaluating the impact of virtual exchange on initial teacher education: a european policy experiment”. researchpublishing. net, voillans, france, 2019. [5] t. andersen, a. jain, n. salzman, d. winiecki and c. siebert. “the hatchery: an agile and effective curricular innovation for transforming undergraduate education”. in: proceedings of the 52nd hawaii international conference on system sciences, 2019. [6] j. o’donnell and l. fortune. “mobility as the teacher: experience based learning. in: the study of food, tourism, hospitality and events”. springer, singapore, pp. 121-132, 2019. [7] c. j. bonk and c. r. graham. “the handbook of blended learning: global perspectives, local designs”. john wiley and sons, hoboken, 2012. [8] j. macdonald. “blended learning and online tutoring: a good practice guide”. gower, uk, 2006. [9] i. falconer and a. littlejohn. “designing for blended learning, sharing and reuse”. journal of further and higher education, vol. 31, no. 1, pp. 41-52, 2007. [10] m. i. ghareb and s. a. mohammed. “the effect of e-learning and the role of new technology at university of human development”. international journal of multidisciplinary and current research, vol. 4, pp. 299-307, 2016. [11] m. i. ghareb and s. a. mohammed. “the role of e-learning in producing independent students with critical thinking”. international journal of engineering and computer science, vol. 4, no. 12, pp. 15287, 2016. [12] p. neumeier. “a closer look at blended learning parameters for designing a blended learning environment for language teaching and learning”. recall, vol. 17, no. 2, pp.163-178, 2005. [13] d. dozier. “interactivity, social constructivism, and satisfaction with distance learning among infantry soldiers”. (doctoral dissertation), 2004. [14] m. i. ghareb, s. h. karim, z. a. ahmed and j. kakbra. “understanding student’s learning and e-learning style before university enrollment: a case study in five high schools/sulaimanikrg”. kurdistan journal of applied research, vol. 2, no. 3, pp. 161-166, 2017. [15] m. i. ghareb and s. a. mohammed. “the future of technologybased classroom”. uhd journal of science and technology, vol. 1, no. 1, pp. 27-32, 2017. [16] f. mortera-gutiérrez. “faculty best practices using blended learning in e-learning and face-to-face instruction”. international journal on e-learning, vol. 5, no. 3, pp.313-337, 2006. [17] m. i. ghareb, z. a. ahmed and a. a. ameen. “the role of learning through social network in higher education in krg”. international journal of scientific and technology research, vol. 7, no. 5, pp. 20-27, 2018. [18] m. p. menchaca and t. a. bekele. “learner and instructor identified success factors in distance education”. distance education, vol. 29, no. 3, pp. 231-252, 2008. [19] s. wichadee. “facilitating students’ learning with hybrid instruction: a comparison among four learning styles”. electronic journal of research in educational psychology, vol. 11, no. 1, pp. 99-116, 2013. [20] j. a. lencastre and c. p. coutinho. blended learning. in: “encyclopedia of information science and technology”. 3rd ed. igi global, hershey pa, pp. 1360-1368, 2015. [21] m. macedo-rouet, m. ney, s. charles and g. lallich-boidin. “students’ performance and satisfaction with web vs. paper-based practice quizzes and lecture notes”. computers and education, vol. 53, no. 2, pp. 375-384, 2009. [22] d. neubersch, h. held and a. otto. “operationalizing climate targets under learning: an application of cost-risk analysis”. climatic change, vol. 126, no. (3-4), pp. 305-318, 2014. [23] n. deutsch and n. deutsch. “instructor experiences with implementing technology in blended learning courses”. proquest, umi dissertation publishing, 2010. [24] p. mitchell and p. forer. “blended learning: the perceptions of first-year geography students”. journal of geography in higher education, vol. 34, no. 1, pp. 77-89, 2010. [25] r. b. marks, s. d. sibley and j. b. arbaugh. “a structural equation model of predictors for effective online learning”. journal of management education, vol. 29, no. 4, pp. 531-563, 2005. [26] c. w. holsapple and a. l. post. “defining, assessing, and promoting e learning success: an information systems perspective”. decision sciences journal of innovative education, vol. 4, no. 1, pp. 67-85, 2006. [27] c. greenhow, b. robelia and j. e. hughes. “learning, teaching, and scholarship in a digital age: web 2.0 and classroom research: what path should we take now”? educational researcher, vol. 38, no. 4, pp. 246-259, 2009. [28] m. v. lópez-pérez, m. c. pérez-lópez and l. rodríguez-ariza. “blended learning in higher education: students’ perceptions and their relation to outcomes”. computers and education, vol. 56, no. 3, pp. 818-826, 2011. [29] b. h. khan, editor. “managing e-learning: design, delivery, implementation, and evaluation”. igi global, hershey pa, 2005. [30] b. wächter and s. wuttig. “student mobility in european programmes”. eurodata: student mobility in european higher education, pp.162-181, 2006. [31] b. schreurs, s. verjans and w. van petegem. “towards sustainable virtual mobility in higher education institutions”. in: eadtu annual conference, 2006. [32] h. de wit. “global: internationalization of higher education: nine misconceptions”. in: understanding higher education internationalization. sense publishers, rotterdam, pp. 9-12. 2017. [33] k. thompson, r. jowallah and t. b. cavanagh. “solve the big problems: leading through strategic innovation in blended teaching and learning”. in: technology leadership for innovation in higher education. igi global, hershey pa, pp. 26-48, 2019. [34] s. adam. “learning outcomes current developments in europe: update on the issues and applications of learning outcomes associated with the bologna process”. in: presented to the bologna seminar: learning outcomes based higher education: the scottish experience, edinburgh: scottish government, 2008. tx_1~abs:at/tx_2:abs~at 32 uhd journal of science and technology | jan 2023 | vol 7 | issue 1 1. introduction in recent decades, many studies have demonstrated the ability of the machine to examine the environment and learn to distinguish patterns of interest from their background and make reliable and feasible decisions regarding the categories of the patterns. with huge volumes of data to be dealt with and through years of research, the design of approaches based on character recognition (cr) remains an ambiguous goal. various frameworks employed machine learning approaches which have been most comprehensively studied and applied to a large number of systems that are essential in building a high-accuracy recognition system, cr is among the most well-known techniques and methods that make use of such artificial intelligence which have received attention increasingly. moreover, in various application domains, ranging from computer vision to cybersecurity, character classifiers have shown splendid performance [1]-[3]. the application of cr is concerned with several fields of research. through those numerous applications, there is no single approach for recognition or classification that is optimal and that motivates the researchers to explore multiple methods and approaches to employ. in addition, a combination of several techniques and classifiers is popped to the surface to serve the same purpose. due to the increased attention paid to cr-based applications, noticeably there are few comprehensive overviews and systematic mappings of construction of alphabetic character recognition systems: a review hamsa d. majeed*, goran saman nariman department of information technology, college of science and technology, university of human development, kurdistan region, iraq a b s t r a c t character recognition (cr) systems were attracted by a massive number of authors’ interest in this field, and lot of research has been proposed, developed, and published in this regard with different algorithms and techniques due to the great interest and demand of raising the accuracy of the recognition rate and the reliability of the presented system. this work is proposed to provide a guideline for cr system construction to afford a clear view to the authors on building their systems. all the required phases and steps have been listed and clarified within sections and subsections along with detailed graphs and tables beside the possibilities of techniques and algorithms that might be used, developed, or merged to create a high-performance recognition system. this guideline also could be useful for readers interested in this field by helping them extract the information from such papers easily and efficiently to reach the main structure along with the differences between the systems. in addition, this work recommends to researchers in this field to comprehend a specified categorical table in their work to provide readers with the main structure of their work that shows the proposed system’s structural layout and enables them to easily find the information and interests. index terms: optical character recognition, script identification, document analysis, character recognition, multiscript documents corresponding author’s e-mail: hamsa d. majeed, department of information technology, college of science and technology, university of human development, kurdistan region, iraq. e-mail: hamsa.al-rubaie@uhd.edu.iq received: 09-11-2022 accepted: 07-02-2023 published: 18-02-2023 access this article online doi: 10.21928/uhdjst.v7n1y2023.pp32-42 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2023 majeed and nariman. this is an open access article distributed under the creative commons attribution noncommercial no derivatives license 4.0 (cc by-nc-nd 4.0) r e v i e w a r t i c l e uhd journal of science and technology majeed and nariman: construction of cr systems: a review uhd journal of science and technology | jan 2023 | vol 7 | issue 1 33 cr applications design. instead, the existing reviews explore in detail a specific domain, technique, or system focusing on the algorithms and methodology details [4], [5]. while starting investigations in this field, a big space of confusion appeared while diving into the details of each step in the recognition process due to the variety of paths that could be taken to reach the final goal and the pool of factors to be phished for that matter. that leads to the fact of considering an in-depth literature review as a requirement for surveying the possibility of using the techniques, approaches, or methodologies that are required for that phase of the recognition process among the others and deciding if they are suitable or not for that cr-based application. the major aim of this study is to present the main path for the various kinds of approaches to be followed before diving into the details of the framework to be proposed by the meant research, moreover, depending on each research field, there are options offered and categorized, techniques, and methods are presented and summarized from multiple perspectives all of which are investigated to answer the following queries: 1. which language will be taken to recognize as input and what is a specified script writing style? 2. how can the data be acquired? is it taken digitally (touchscreen, scanner, or another digital device) or uploaded from a non-digital source? in printed form by a keyboard or in handwritten form? 3. which scale or level of detail is present in that set of data? does the script have to be taken wholly or by a single character each time? 4. from which source could those data be collected? is the preprocessing phase needed or not? 5. generally, through which recognition process should invade for the optimal outcomes considering the previously chosen phases? this work is structured to give the most suitable roadmap to the author of interest by presenting a systematic guideline to explore the multidisciplinary path starting from the script writing style the passing by the most suitable guide throughout the desired dataset characteristics (acquisition, granularity level, and the source of collected data), reaching to the script recognition process for the cr-based applications. furthermore, this study uncovers the potential of cr applications among different domains and specifications by summarizing the purpose, methodologies, and application. thorough proofreading of several types of research including survey articles, the cr process has the same stations to stop by which could be sorted under some separated categories on specific factors and all those categories of any proposed system may have a stop in those main stations, that was an encouragement to make this study to highlight those main stations and present a guideline the researchers of interest by examining the detailed of sub-stations due to building cr system efficient to the author and understandable by the reader. 2. proposed walkthrough guideline the main goal of this study is to construct and design criteria for researchers working in the field of cr systems to observe when initiating research in both the practical and written parts. the following classifications and assortments are proposed, as shown in fig. 1. 2.1. script writing system from the linguistic point of view, nowadays, scripts used throughout the work have been broken down into six script classes, each of which can be used in one or more languages [6], [7]. furthermore, in the context of cr, the investigations of the script character characteristics and structural properties, the script-written system has been classified under six classes. different classes may contain the same language scripts [8], [9], [10]. fig. 2 illustrates the classification of the script writing system. 2.1.1. logographic system the oldest kind of writing system is a logographic writing system; it is also called an ideogram as well, which employs symbols to depict a whole word or morpheme. the most well-known logographic script is chinese, but logograms fig. 1. general assortments of the cr system. majeed and nariman: construction of cr systems: a review 34 uhd journal of science and technology | jan 2023 | vol 7 | issue 1 fig. 2. script written system classifications. such as numbers and the ampersand are found in almost all languages. an ideographic writing system typically has thousands of characters. thus, the recognition process of this kind of script is still a challenging and fascinating topic for researchers. han is the only script family in this class that includes two more languages, namely, japanese (kanji) and korean (hanja). the interesting distinguishing point between han and other languages is the text line written direction, which is either from top to bottom or left to right. in literature, lots of research can be found on handwritten cr in these scripts, for instance [11]-[13] work on chinese, japanese (kanji), and korean (hanja), respectively. the accuracy rates for the scripts based on the aforementioned references are 99.39%, 99.64%, and 86.9%, respectively. 2.1.2. syllabic system every written sign in a syllabic system, such as the one used in japanese, corresponds to a phonetic sound or syllable. kanas, which are divided into two types hirakana and katakana represent japanese syllables. the japanese script combines logographic kanji and syllabic kanas, as mentioned in the previous subsection. the kanas has a similar visual appearance to the chinese, with the exception that the kanas has a lower density than the chinese. a lot of recognition progress can be found in the literature for both hirakana and katakana. examples of excellent achievements in recognition accuracy rate are contributed in [11] for both hirakana and katakana, which are 98.83% and 98.19%, respectively. 2.1.3. alphabetic system each consonant and vowel have a distinct symbol in the alphabetic writing system, which is used to write the languages classified under this written system. segmental systems are another name for alphabets. to represent spoken language, these systems mix a small number of characters called letters. letters are meant to represent certain phonemes. greece is where the alphabet was first used, and it later expanded around the world, especially in europe and a part of asia as well [14], proposed a system for ancient greek cr that achieved an accuracy rate of 96%. latin, cyrillic, and armenian also belong to this system. there are numerous languages that use the latin alphabet, commonly known as the roman script, with differing degrees of alteration. it is utilized to write in a wide range of european languages, including english, french, italian, portuguese, spanish, german, and others. the interested authors of latin languages presented their ideas in terms of the recognition system for the different latin languages, majeed and nariman: construction of cr systems: a review uhd journal of science and technology | jan 2023 | vol 7 | issue 1 35 for instance, afrikaans 98.53% [15], catalan 91.97% [16], dutch 95.5% [17], english 98.4% [18], french 93.6% [19], italian 92.47% [20], luxembourgish (87.55 ± 0.24)% [21], portuguese 93% [22], spanish 97.08% [23], vietnamese 97% [24], and german 99.7% [25]. cyrillic has a separate letter set but is still relatively comparable to latin. the cyrillic writing system has been adopted by certain asian and eastern european languages, including russian, bulgarian, ukrainian, and macedonian, where the recognition rate is recorded for them as follows: russian 83.42% [26], bulgarian 89.4% [27], ukrainian 95% [28], and macedonian 93% [29]. finally, the armenian written system, this language classified as an indo-european language belonging to an independent branch of which it is the only member recent cr system for this language scored 89.95% [30]. 2.1.4. abjads when the words have a writing pattern from right to left along with text line, written in a repetition of consonants that are close together leaving the vowel sounds to be inferred by the reader, and have cursive long strokes consisting of few dots, then you are looking at abjads writing system. it is unlike most other scripts in the world but it is similar to the alphabetic system unless it has symbols for consonantal sounds only. these unique features make the process of script identification for abjads relatively simpler compared to other scripts, particularly because of the long cursive strokes with dots and the right-to-left writing direction, making it easier for recognition systems in pen computing. arabic and hebrew are considered the major categories of the abjads writing system. there are some other scripts of arabic origin, such as farsi (persian), urdu, and uyghur. a lot of approaches had been proposed for identifying abjad-based scripts, they used the long main stroke along with the cursive appearance yielding from conjoined words for arabic. meanwhile, the more uniform strokes in length and discrete letters were the main dependent features of hebrew script recognition. according to the latest survey for arabic recognition systems [31], the highest accuracy score is 99.98%, while recorded 97.15% for hebrew [32]. in farsi, urdu, and uyghur, the highest accuracies achieved are 99.45%, 98.82%, and 93.94%, respectively [33]-[35]. 2.1.5. abugidas it is a writing script primarily based on a consonant letter and secondary vowel notation. they are sharing with alphabetic systems the property of combining characters writing styles within the text line. it belongs to the brahmic family of scripts which is can be expressed in two groups: 1. original brahmi script: this northern group deployed in devnagari, bangla (bengali), manipuri, gurumukhi, gujrati, and oriya languages. the most recent survey papers for the cr systems of this group come up with the highest recognition rate of 99% for devnagari, 99.32% for bangla (bengali), 98.70% for manipuri, 99.3% for gurumukhi, 98.78% for gujrati, and 96.7% for oriya [36]-[38]. 2. derived from brahmi: look quite different from the northern group and used in: a. south india: tamil, telugu, kannada, and malayalam, where the highest accuracy of the mentioned language for recognition matter was for tamil 98.5%, telugu 98.6%, k annada 92.6%, and malayalam 98.1% [39]. b. southeast asia: thai, lao, burmese, javanese, and balinese, the languages of this group have achieved the highest validation rate where thai, lao, and burmese attained 92.1%, 92.41%, and 96.4% while javanese and balinese gained 97.7% and 97.53%, respectively [40]-[43]. 2.1.6. featural system this form of writing system is significantly represented by symbols or characters, the main language is korean which is described as less complex and less dense compared to chinese and japanese, it is represented by mixing logographic hanja and featural hangul, the highest scored accuracy rate for korean was 97.07% [44]. as a summarization of all the findings in this section, table 1 illustrates the classifications of the languages with the highest accuracy recorded so far. 2.2. data acquisition the next step for the author after selecting which language to work on is to decide which writing style will be chosen for recognition, this step is considered one of the fixed and essential phases in all the recognition studies and research, reaching this phase requires the knowledge of how to start acquiring data to be fed into the recognition system, the answer simply starts with defining the writing style, here the author has two options either printed script or handwritten script. after making the decision, the acquisition tools are required either offline tools or online. in this section, a guideline is majeed and nariman: construction of cr systems: a review 36 uhd journal of science and technology | jan 2023 | vol 7 | issue 1 table 1: summarization of languages with their recent highest accuracy rate script writing system main language sub-language accuracy rate (%) logographic system han chinese 99.39 japanese (kanji) 99.64 korean (hanja) 86.9 syllabic system kanas japanese (hirakana) a 98.83 japanese (katakana) 98.19 alphabetic system greek greek 96 latin afrikaans 98.53 catalan 91.97 dutch 95.5 english 98.4 french 93.6 italian 92.47 luxembourgish (87.55±0.24) portuguese 83 spanish 97.08 vietnamese 97 german 99.7 cyrillic russian 83.42 bulgarian 89.4 ukrainian 95 macedonian 93 armenian armenian 89.95 abjads hebrew hebrew 97.15 arabic arabic 99.98 farsi 99.45 urdu 98.82 uighur 93.94 abugidas brahmi devnagari 99 bangla (bengali) 99.32 manipuri 98.70 gurumukhi 99.3 gujrati 98.78 oriya 96.7 tamil 98.5 telugu 98.6 kannada 92.6 malayalam 98.1 thai 92.1 lao 92.41 burmese 96.4 javanese 97.7 balinese 97.53 featural system korean korean 97.07 proposed and could be followed to help make those decisions as fig. 3 shows. 2.2.1. printed character those characters are produced as a result of the process of producing using inked-type tools. in recognition systems of any language, the printed characters usually achieve a high recognition rate because it is considered in regular form, clean, have the same style, and have similar shapes and lines, and that facilitates the learning operation and therefore raises the accuracy of recognition in the testing phase. 2.2.2. handwriting character when the process of forming letters of any language is done with the hand, rather than any typing device then the result is handwriting characters. most of the authors that are interested in cr are employing handwriting characters as input to their approaches to prove the effectiveness and efficiency of their systems or techniques due to the complexity and impenetrability that come with the variety of the handwriting style and the use of tools besides the differences in lines and colors not to mention the irregular shapes and positions. 2.2.3. online character these characters are obtained from digital devices with a touch screen with/without a keyboard involved like a personal digital assistant, or mobile. where screen sensors receive the switching of pushing and releasing the pen on the screen in addition to the pen tip movements over the screen. 2.2.4. offline character this kind of character is attended when image processing is involved by converting an input image (from a scanner or a camera) of text to character code which is aimed to be utilized by a text-processing application. it is essential for the author to choose the correct combination of the writing style and the writing tool, as fig. 3 illustrates there are three combinations to decide among them: offlineprinted where the input of the cr system decided to be in offline mode with characters taken from the printed device rather than the offline-handwritten which taken from a human-hand in offline-mode already written on paper in a previous time while the online-handwritten fed as input to cr system instantly by hand through a touchable input device without a keyboard. some recent recognition systems are illustrated in table 2 for several languages to show some authors’ choices for the fig. 3. overview of data acquisition. majeed and nariman: construction of cr systems: a review uhd journal of science and technology | jan 2023 | vol 7 | issue 1 37 language, writing style, and writing tool, and how their choices affect the accuracy rate for each mechanism. furthermore, a comprehensive survey for online and offline handwriting recognition can be found in plamondon and srihari [45]. the outcomes of table 2 show that most existing studies have focused on handwritten text, with fewer works attempting to classify or identify printed text. this is because of the high variance in handwriting styles across people and the poor quality of the handwritten text compared to printed text yields the fact that handwritten cr is more challenging than the printed one. on the other hand, it is noticeable using offline as writing tool more than online ones this is due to in the online case, features can be extracted from both the pen trajectory and the resulting image, whereas in the offline case, only the image is available, so the offline recognition is observed as harder than online recognition. 2.3. granularity level of documents the third type of classification of character handwriting recognition is “granularity level of documents,” which describes the level of detailed information taken as initial input to the defined and proposed framework. this class could be split into five granularity levels as shown in fig. 4, from a script page full of text to a single letter or symbol. in the domain of cr, if the initial input into the ocr framework is not at a character level, the process of script identification must proceed until it gets to a single character. this procedure, known as “segmentation,” will be covered in the following subsection (3.5). 2.3.1. document/page level document-level script is the most detailed granularity level, where the entire document is exposed to the script identification procedure at once. following processing, the document is further broken down into pages, pieces of paragraphs, text lines, words, and finally characters to enable the recognition of the precise letter. although some researchers discriminate between the script recognition process at the document and page levels, in general, the technical methodologies are very similar. because of this, some researchers alternately refer to document-level and page-level script recognition. finding the text region on a page is the initial step in pagelevel script identification. it is possible to carry out this operation by separating the pages into text and non-text pieces [53]. several pieces of research can be found in the literature for both offline-handwritten [54] and offlineprinted [55]. after the page of the script has been identified, the process of the next level starts, which is paragraph or text block identification. it operates by dividing the entire page into equal-sized text blocks with several lines of content. text blocks can have different sizes, and padding may be necessary if characters are on the edge of a text block [56]. is an example of segmenting pages into pieces of text blocks. 2.3.2. paragraph level the text block is separated into lines. the white space between lines is typically used for text line segmentation. lines of scrip are detected and segmented to be prepared for further segmentation processing. both offline-handwritten [57] and offline-printed [58] line detection has been the subject of numerous studies in the literature. fig. 4. granularity level classification. table 2: examples of recognition systems with different data acquisition mechanisms reference language writing style writing tool accuracy rate (%) [46] arabic handwritten offline 99.93 [18] english handwritten offline 98.4 [47] english printed offline 98 [48] english handwritten online 93.0 [49] chinese handwritten online 98 [50] chinese handwritten offline 94.9 [13] chinese printed offline 99.39 [51] arabic printed offline 97.51 [52] arabic handwritten online 96 majeed and nariman: construction of cr systems: a review 38 uhd journal of science and technology | jan 2023 | vol 7 | issue 1 2.3.3. text line level a framework that gets a line of script as initial input needs segmentation processes to identify each word in the text. therefore, word identification is needed. text lines are divided into words; usually, the white space between text lines is used for this purpose. numerous literary attempts have been made to address the difficulties encountered in this process. for instance, there might be noise, twisted words, missing or partial letters in words, words that are not available as straight text lines, etc. some examples given in this topic of identifying words in a text line are [59] for offline-handwritten and [60] for offline-printed. 2.3.4. word level character detection and segmentation are required since the initial input is a word. it usually works by combining properties from various characters to ensure the process. several attempts have been made to improve accuracy and ensure that no character inside a word is missed. for instance, recently [61] used distinct strategies and achieved a satisfactory outcome. 2.3.5. character level finally, there is no requirement for segmentation at the character level because the initial input into the proposed framework is already character. the character goes through preprocessing, which is followed by recognition procedures. in some circumstances, no preprocessing is required, as is the case when using a character public dataset. for instance [62], is an example of working at the character level with and without preprocessing, respectively. in addition, to avoid confusion between granularity levels for identification/detection and recognition processes, it is worth mentioning that from the recognition standpoint, when the granularity level is text line level, it means that the text line is already known and the detection and segmentation into words and characters are needed. however, from the identification/ detection point of view, it means that the identification and detection of text lines are working. further details about these processes can be found in [10], [63]. 2.4. source of collected dataset the essential component of any machine learning application is the dataset. that leads us to discuss this important phase of cr as the fourth classification named source of collected dataset which is broken down into two categories as fig. 5 illustrates: 2.4.1. public dataset (real-world dataset) the term “public dataset” refers to a dataset saved in the cloud and made open to the public. mnist, keras, kaggle, and others are examples. almost all of the public datasets have been preprocessed, cleaned, and usually, in the case of character level, reshaped to 28 × 28 pixels and saved as csv files. many authors attain to use this source to skip the preprocessing step and focus more on the other steps and easily find opponents for the comparison issue of those who used the same data source with different techniques. 2.4.2. self-constructed dataset is the dataset that the researchers create and prepare on their own depending on their techniques, it is an online or offline way of collection, this source of dataset is considered more challenging because the collected images are not processed at all in terms of resizing, denoising, colored, etc. for a fair comparison, this kind of work better to be compared with studies that have done with a self-collected source of data, not with a public one that comes clean and processed. researchers should be aware of the data to be collected and use the proper tools required to preprocess in a way that suits the technique used for recognition. 2.5. script recognition process the script recognition process (the implementable phase) is the fifth classification type of alphabet handwritten recognition framework. in an in-depth study of several research articles, including survey articles, we mainly focused on the phases that an ocr system needs to accomplish its recognition goal. thus, we could conclude that four categories can be defined based on the number of phases in which the whole procedure of recognition comprises, as presented in fig. 6. in addition, commonly, script recognition is achieved by blending traditional image processing techniques with fig. 5. categories of collected dataset sources. majeed and nariman: construction of cr systems: a review uhd journal of science and technology | jan 2023 | vol 7 | issue 1 39 image identification and recognition techniques. the recognition composition is formed from four primary phases, namely, preprocessing (p), segmentation (s), feature extraction (f), and classification (c). the last two phases, feature-extraction and classification, are the most common in the research. there is not any work without any of these two phases. the next few paragraphs will briefly outline them. • preprocessing (p) is a sequence of operations performed to intensify the input image. it is responsible for removing noise, resizing, thinning, contouring, transforming the dataset into a black-white format, edge detection, etc. every single one of them can be performed with an appropriate technique • segmentation (s) performs the duty of obtaining a single character. the document processing follows a hierarchy; it starts from the whole page and ends with a single character. the required level of the hierarchy is a single character • feature extraction (f) is a mechanism in which each character is turned into a feature vector using specific algorithms for the extraction of the features, which is then fed into a classifier to determine which class it belongs to. • the classification (c) phase is a decision-making process that uses the features extracted from the preceding step as input. and it decides what the final output is. it is worth noticing that handwritten mathematical symbols and expressions recognition is out of our research scope. therefore, we do not consider the two additional phases (structural analysis and symbol recognition) which are included in such works. more details can be found in sakshi and kukreja [64]. 2.5.1. psfc the first category of the script recognition process class can be called psfc, which means all four phases have been utilized to achieve the goal as [65] describe. 2.5.2. pfc the segmentation process is skipped in the second category, in most cases due to working on character level as initial input therefore no need for segmentation as presented in parthiban et al. [66]. 2.5.3. sfc the third one is sfc as [67] proposal, where the preprocessing is missed because the entered data originally is clean and there is no preprocessing required. 2.5.4. fc in the fourth and last category as illustrated in gautam and chai [68], the first two phases p and f are dismissed because the granularity level is letters, and the initial input data is originally clean. for instance, works utilizing public datasets such as mnist [69] could be classified under this category. 3. examples this section is to illustrate some of the cr systems and gives a description of how to read their roadmap regarding their systems, by applying the proposed guideline, any paper in this field can be summarized in stages according to the fig. 6. basic components of the script recognition processes. majeed and nariman: construction of cr systems: a review 40 uhd journal of science and technology | jan 2023 | vol 7 | issue 1 author’s choices and be easier to the reader to figure out the main stages and for the other authors to develop any desired cr system. some examples are presented here to show how the cr system can be summarized according to the proposed guideline, and resembling a table is suggested to be created in such a work to provide a comprehensive view of the proposed framework as a whole. it makes it easier for the reader to find the information they are searching for before going into depth. table 3 provides two examples of how to present the suggested table. in addition, the following examples demonstrate how the systems may be constructed using the component chain: 1. [18] english → offline-handwritten → character level → self-constructed dataset → pfc 2. [46] arabic → offline-handwritten → line level → public dataset → psfc 3. [49] chinese → online-handwritten → page level → public dataset → sfc 4. [48] english → online-handwritten → line level → public dataset → fc 5. [13] chinese → offline-printed → character level → self-constructed dataset → pfc 4. conclusion cr stepped ahead as an eminent topic of research. exhaustive studies continuously presented cr of different languages with various algorithms that were developed to increase the reliability of these characters for accurate recognition. a guideline for the construction cr system has been proposed for the authors in this field to overcome the unclear presentation and expressing ideas in such a domain of science. almost all the required steps have been shown and demonstrated by graph and table to be used in such works in cr domaine for more clarity for the authors to margin their scope. it is also for the readers, as well, to directly recognize the used technique through in-text reading and then move forward to the details afterward. through reading this guideline, the authors will be able to order their thoughts and build their recognition system smoothly and effectively especially for the new authors in this field, as for readers after reading this work they will have the ability to analyze other research in the relative fields and extract information easily from other works of interest, for the seekers of new ideas or merging techniques, this guideline is suitable to help to determine the exact part of recognition system to be studied or compared with. saving time, effort, and thoughts orienting for other authors or readers was one of the essential aims of this work. references [1] m. paolanti and e. frontoni. “multidisciplinary pattern recognition applications: a review”. computer science review, vol. 37, pp. 100276, 2020. [2] m. kawaguchi, k. tanabe, k. yamada, t. sawa, s. hasegawa, m. hayashi and y. nakatani. “determination of the dzyaloshinskiimoriya interaction using pattern recognition and machine learning”. npj computational materials, vol. 7, no. 1, 2021. [3] b. biggio and f. roli. “wild patterns: ten years after the rise of adversarial machine learning”. pattern recognition, vol. 84, pp. 317-331, 2018. [4] t. s. gorripotu, s. gopi, h. samalla, a. v. prasanna and b. samira. “applications of computational intelligence techniques for automatic generation control problem-a short review from 2010 to 2018.” in: computational intelligence in pattern recognition. springer singapore, singapore, 2020, pp. 563-578. [5] m. i. sharif, j. p. li, j. naz and i. rashid. “a comprehensive review on multi-organs tumor detection based on machine learning”. pattern recognition letters, vol. 131, pp. 30-37, 2020. [6] a. nakanishi. “writing systems of the world: alphabets, syllabaries, pictograms”. charles e. tuttle co., united states, 1980. [7] f. coulmas. “the blackwell encyclopedia of writing systems”. blackwell, london, england, 1999. [8] d. sinwar, v. s. dhaka, n. pradhan and s. pandey. “offline script recognition from handwritten and printed multilingual documents: a survey”. international journal on document analysis and recognition, vol. 24, no. 1-2, pp. 97-121, 2021. [9] d. ghosh, t. dube and a. p. shivaprasad. “script recognition-a review”. ieee transactions on pattern analysis and machine intelligence, vol. 32, no. 12, pp. 2142-2161, 2010. [10] k. ubul, g. tursun, a. aysa, d. impedovo, g. pirlo and i. yibulayin. “script identification of multi-script documents: a survey”. ieee table 3: examples of the proposed framework of character recognition example 4 [48] example 5 [13] classifications nominated category classifications nominated category script writing system english script writing system chinese data acquisition online-handwritten data acquisition offline-printed granularity level of documents line level granularity level of documents character level source of the collected dataset public dataset source of the collected dataset self-constructed dataset script recognition process fc script recognition process pfc majeed and nariman: construction of cr systems: a review uhd journal of science and technology | jan 2023 | vol 7 | issue 1 41 access, vol. 5, pp. 6546-6559, 2017. [11] c. tsai. “recognizing handwritten japanese characters using deep convolutional neural networks”. university of stanford in stanford, california, pp. 405-410, 2016. [12] s. purnamawati, d. rachmawati, g. lumanauw, r. f. rahmat and r. taqyuddin. “korean letter handwritten recognition using deep convolutional neural network on android platform”. journal of physics conference series, vol. 978, no. 1, p. 012112, 2018. [13] y. q. li, h. s. chang and d. t. lin. “large-scale printed chinese character recognition for id cards using deep learning and few samples transfer learning”. applied sciences, vol. 12, no. 2, p. 907, 2022. [14] b. robertson and f. boschetti. “large-scale optical character recognition of ancient greek”. mouseion journal of the classical association of canada, vol. 14, no. 3, pp. 341-359, 2017. [15] j. hocking and m. puttkammer. “optical character recognition for south african languages”. in: 2016 pattern recognition association of south africa and robotics and mechatronics international conference (prasa-robmech), 2016. [16] a. fornes, v. romero, a. baró, j. i. toledo, j. a. sánchez, e. vidal, j. lladós. “icdar2017 competition on information extraction in historical handwritten records”. in: 2017 14th iapr international conference on document analysis and recognition (icdar), 2017. [17] h. van halteren and n. speerstra. “gender recognition on dutch tweets”. computational linguistics in the netherlands journal, vol. 4, pp. 171-190, 2019. [18] h. d. majeed and g. s. nariman. “offline handwritten english alphabet recognition (ohear)”. uhd journal of science and technology, vol. 6, no. 2, pp. 29-38, 2022. [19] k. todorov and g. colavizza. “an assessment of the impact of ocr noise on language models”. in: proceedings of the 14th international conference on agents and artificial intelligence, 2022. [20] m. del buono, l. boatto, v. consorti, v. eramo, a. esposito, f. melcarne and m. tucci. “recognition of handprinted characters in italian cadastral maps”. in: character recognition technologies. spie proceedings, 1993. vol. 1906, pp. 89-99. [21] r. barman, m. ehrmann, s. clematide, s. a. oliveira and f. kaplan, “combining visual and textual features for semantic segmentation of historical newspapers. journal of data mining and digital humanities, 2021. [22] f. lopes, c. teixeira and h. g. oliveira. “comparing different methods for named entity recognition in portuguese neurology text”. journal of medical systems, vol. 44, no. 4, p. 77, 2020. [23] n. alrasheed, p. rao and v. grieco. “character recognition of seventeenth-century spanish american notary records using deep learning”. digital humanities quarterly, vol. 15, no. 4, 2021. [24] t. q. vinh, l. h. duy and n. t. nhan. “vietnamese handwritten character recognition using convolutional neural network”. iaes international journal of artificial intelligence, vol. 9, no. 2, pp. 276283, 2020. [25] a. chaudhuri, k. mandaviya, p. badelia and s. k. ghosh. “optical character recognition systems for german language.” in: optical character recognition systems for different languages with soft computing. cham, springer international publishing, 2017, pp. 137-164. [26] d. gunawan, d. arisandi, f. m. ginting, r. f. rahmat and a. amalia. “russian character recognition using self-organizing map”. journal of physics: conference series, vol. 801, p. 012040, 2017. [27] g. georgiev, p. nakov, k. ganchev, p. osenova and k. i. simov. “feature-rich named entity recognition for bulgarian using conditional random fields”. in: proceedings of the international conference ranlp-2009. arxiv [cs.cl], 2021. [28] a. radchenko, r. zarovsky and v. kazymyr, “method of segmentation and recognition of ukrainian license plates”. in: 2017 ieee international young scientists forum on applied physics and engineering (ysf), 2017. [29] m. gjoreski, g. zajkovski, a. bogatinov, g. madjarov, d. gjorgjevikj and h. gjoreski. “optical character recognition applied on receipts printed in macedonian language”. in: international conference on informatics and information technologies (ciit), 2014. [30] t. ghukasyan, g. davtyan, k. avetisyan and i. andrianov. “pioner: datasets and baselines for armenian named entity recognition”. in: 2018 ivannikov ispras open conference (ispras), 2018. [31] n. alrobah and s. albahli. “arabic handwritten recognition using deep learning: a survey”. arabian journal for science and engineering, 2022. [32] o. keren, t. avinari, r. tsarfaty and o. levy, “breaking character: are subwords good enough for mrls after all?” arxiv [cs.cl], 2022. [33] y. a. nanehkaran, d. zhang, s. salimi, j. chen, y. tian and n. alnabhan. “analysis and comparison of machine learning classifiers and deep neural networks techniques for recognition of farsi handwritten digits”. journal of supercomputing, vol. 77, no. 4, pp. 3193-3222, 2021. [34] d. rashid and n. kumar gondhi. “scrutinization of urdu handwritten text recognition with machine learning approach”. in: communications in computer and information science. cham, springer international publishing, 2022, pp. 383-394. [35] y. wang, h. mamat, x. xu, a. aysa and k. ubul. scene uyghur text detection based on fine-grained feature representation”. sensors (basel), vol. 22, no. 12, p. 4372, 2022. [36] s. sharma and s. gupta. “recognition of various scripts using machine learning and deep learning techniques-a review”. in: 2021 6th international conference on signal processing, computing and control (ispcc), 2021. [37] p. d. doshi and p. a. vanjara. “a comprehensive survey on handwritten gujarati character and its modifier recognition methods”. in: information and communication technology for competitive strategies (ictcs 2020). springer singapore, singapore, 2022, pp. 841-850. [38] m. r. haque, m. g. azam, s. m. milon, m. s. hossain, m. a. a. molla and m. s. uddin. “quantitative analysis of deep cnns for multilingual handwritten digit recognition”. in: advances in intelligent systems and computing. singapore: springer singapore, 2021, pp. 15-25. [39] h. singh, r. k. sharma and v. p. singh. “online handwriting recognition systems for indic and non-indic scripts: a review”. artificial intelligence review, vol. 54, no. 2, pp. 1525-1579, 2021. [40] l. saysourinhong, b. zhu and m. nakagawa. “online handwritten lao character recognition by mrf”. ieice transactions on information and systems, vol. e95.d, no. 6, pp. 1603-1609, 2012. [41] c. s. lwin and w. xiangqian. “myanmar handwritten character recognition from similar character groups using k-means and convolutional neural network”. in: 2020 ieee 3rd international conference on electronics and communication engineering (icece), 2020. [42] m. a. rasyidi, t. bariyah, y. i. riskajaya and a. d. septyani. “classification of handwritten javanese script using random forest majeed and nariman: construction of cr systems: a review 42 uhd journal of science and technology | jan 2023 | vol 7 | issue 1 algorithm”. bulletin of electrical engineering and informatics, vol. 10, no. 3, pp. 1308-1315, 2021. [43] i. w. a. darma and n. k. ariasih. “handwritten balinesse character recognition using k-nearest neighbor”. ina-rxiv, 2018. [44] j. park, e. lee, y. kim, i. kang, h. i. koo and n. i. cho. “multilingual optical character recognition system using the reinforcement learning of character segmenter”. ieee access, vol. 8, pp. 174437174448, 2020. [45] r. plamondon and s. n. srihari. “online and off-line handwriting recognition: a comprehensive survey”. ieee transactions on pattern analysis and machine intelligence, vol. 22, no. 1, pp. 6384, 2000. [46] n. s. guptha, v. balamurugan, g. megharaj, k. n. a. sattar and j. d. rose, “cross lingual handwritten character recognition using long short term memory network with aid of elephant herding optimization algorithm”. pattern recognition letters, vol. 159, pp. 16-22, 2022. [47] g. s. katkar and m. v kapoor. “performance analysis of structure similarity algorithm for the recognition of printed cursive english alphabets”. international journal of scientific research in science and technology, vol.8, no.5, pp. 555-559, 2021. [48] s. tabassum, n. abedin, m. m. rahman, m. m. rahman, m. t. ahmed, r. i. maruf and a. ahmed. “an online cursive handwritten medical words recognition system for busy doctors in developing countries for ensuring efficient healthcare service delivery”. scientific reports, vol. 12, no. 1, p. 3601, 2022. [49] d. h. wang, c. l. liu, j. l. yu and x. d. zhou. “casia-olhwdb1: a database of online handwritten chinese characters”. in: 2009 10th international conference on document analysis and recognition, 2009. [50] t. q. wang, x. jiang and c. l. liu. “query pixel guided stroke extraction with model-based matching for offline handwritten chinese characters”. pattern recognition, vol. 123, p. 108416, 2022. [51] a. qaroush, b. jaber, k. mohammad, m. washaha, e. maali and n. nayef. “an efficient, font independent word and character segmentation algorithm for printed arabic text”. journal of king saud university-computer and information sciences, vol. 34, no. 1, pp. 1330-1344, 2022. [52] k. m. m. yaagoup and m. e. m. musa. “online arabic handwriting characters recognition using deep learning”. international journal of advanced research in computer and communication engineering, vol. 9, no. 10, pp. 83-92, 2020. [53] p. b. pati, s. sabari raju, n. pati and a. g. ramakrishnan. “gabor filters for document analysis in indian bilingual documents.” in: international conference on intelligent sensing and information processing, 2004. proceedings of, 2004, pp. 123-126 [54] s. m. obaidullah, c. halder, n. das sand k. roy. “numeral script identification from handwritten document images”. procedia computer science, vol. 54, pp. 585-594, 2015. [55] r. bashir and s. quadri. “identification of kashmiri script in a bilingual document image”. in: 2013 ieee second international conference on image information processing (iciip-2013), 2013. [56] s. manjula and r. s. hegadi. “identification and classification of multilingual document using maximized mutual information”. in: 2017 international conference on energy, communication, data analytics and soft computing (icecds), 2017. [57] k. roy, o. m. sk, c. halder, k. santosh and n. das. “automatic line-level script identification from handwritten document images-a region-wise classification framework for indian subcontinent”. malaysian journal of computer science, vol. 31, no. 1, p. 10, 2016. [58] g.s. rao, m. imanuddin and b. harikumar. “script identification of telugu, english and hindi document image”. international journal of advanced engineering and global technology, vol. 2, no. 2, pp. 443-452, 2014. [59] e. o. omayio, i. sreedevi and j. panda. “word segmentation by component tracing and association (cta) technique”. journal of engineering research, 2022. [60] p. k. singh, r. sarkar and m. nasipuri. “offline script identification from multilingual indic-script documents: a state-of-the-art”. computer science review, vol. 15-16, pp. 1-28, 2015. [61] y. baek, d. nam, s. park, j. lee, s. shin, j. baek, c. y. lee and h. lee. “cleval: character-level evaluation for text detection and recognition tasks”. in: 2020 ieee/cvf conference on computer vision and pattern recognition workshops (cvprw), 2020. [62] k. j. taher and h. d. majeed. “recognition of handwritten english numerals based on combining structural and statistical features”. iraqi journal of computers, communications, control and systems engineering, vol. 21, no. 1, pp. 73-83, 2021. [63] d. sinwar, v. s. dhaka, n. pradhan and s. pandey. “offline script recognition from handwritten and printed multilingual documents: a survey”. international journal on document analysis and recognition, vol. 24, no. 1-2, pp. 97-121, 2021. [64] sakshi and v. kukreja. “a retrospective study on handwritten mathematical symbols and expressions: classification and recognition”. engineering applications of artificial intelligence, vol. 103, p. 104292, 2021. [65] n. murugan, r. sivakumar, g. yukesh and j. vishnupriyan. “recognition of character from handwritten”. in: 2020 6th international conference on advanced computing and communication systems (icaccs), 2020, pp. 1417-1419. [66] r. parthiban, r. ezhilarasi and d. saravanan. “optical character recognition for english handwritten text using recurrent neural network”. in: 2020 international conference on system, computation, automation and networking (icscan), 2020. [67] h. q. ung, c. t. nguyen, k. m. phan, v. t. m. khuong and m. nakagawa. “clustering online handwritten mathematical expressions”. pattern recognition letters, vol. 146, pp. 267-275, 2021. [68] n. gautam and s. s. chai. “zig-zag diagonal and ann for english character recognition”. international journal of advanced trends in computer science and engineering, vol. 8, no. 1.4, pp. 57-62, 2019. [69] l. deng. “the mnist database of handwritten digit images for machine learning research [best of the web]”. ieee signal processing magazine, vol. 29, no. 6, pp. 141-142, 2012. . 40 uhd journal of science and technology | may 2018 | vol 2 | issue 2 1. introduction drinking water pollution is becoming an increasing problem in the entire world for its severity and toxic effects on human health. the continuous development of significant changes such as population growth, industrialization, expanding urbanization, and diminishing water resources made the issue much worst [1]. awareness of the quality of drinking water is expanding steadily in many countries in the world [2]. heavy metals play a reasoned approach to the classifying of drinking water quality due to their toxicity and poisonousness even at low quantities [3]. heavy metals are the most damaging and dangerous contaminants in water due to non-biodegradable nature and their accumulation in a biological system [4]. drinking water may contain essential and toxic heavy metals. the essential metals are co, cr, fe, mn, mo, ni, se, sn, v, cu, and zn, these metals are critical for sustain biological life, but still, their accumulation in the body may cause dangerous assessment of heavy metals contamination in drinking water of garmian region, kurdistan, iraq hayder mohammed issa1, azad h. alshatteri2 1college of human sciences, university of garmian, sulaymaniyah province 46021, kurdistan region, iraq, 2department of chemistry, college of education, university of garmian, sulaymaniyah province 46021, kurdistan region, iraq a b s t r a c t drinking water of safe quality is a critical issue for human survival and health. water pollution by heavy metals is very crucial because of their toxicity. this study assesses the potential of heavy metal pollution in drinking water in the three districts of garmian region, east iraq. water samples were investigated for 23 heavy metals and 6 chemical contaminants collected from 16 locations between january 1 and october 31 in 2017. the analysis was performed using inductively coupled plasma optical emission spectroscopy (icpoes, spectro arcos). high levels of al, se, sr, and fe have been detected at certain locations in the study area. statistical analysis techniques of the correlation matrix and cluster hierarchical analysis were conducted. the heavy metals pollution index (hpi), heavy metals evaluation index (hei), and contamination index (c d ) were applied. these indices linked with the statistical analysis to interpret relationships among tested parameters in water samples and to investigate pollution sources over the study region. even with the significant correlations between the hpi, c d , and hei, they showed a dissimilar impact of examined heavy metals on the water quality. it was found that concentrations of heavy metals such as al, fe, and se are in elevation (0.550, 0.736, and 0.044) mg/l, respectively, at certain locations depending on the last update of the who guidelines for drinking water. the most reliable pollution evaluation index of hei for drinking water showed that 44% of the water samples is critically polluted. sources of the contamination are most likely coming from natural geological sources. the anthropogenic impact was only noticed at several sites in the study area. index terms: drinking water, environmental risk assessment, garmian region, heavy metals, multivariate statistical analysis corresponding author’s e-mail: hayder mohammed issa, university of garmian, sulaymaniyah province 46021, iraq. phone: +9647708617536. e-mail: hayder.mohammed@garmian.edu.krd received: 27-07-2018 accepted: 25-08-2018 published: 08-09-2018 access this article online doi: 10.21928/uhdjst.v2n2y2018.pp40-53 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2018 issa, et al. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l re se a rc h a rt i c l e uhd journal of science and technology hayder mohammed issa and azad h. alshatteri: drinking water quality of garmian region uhd journal of science and technology | may 2018 | vol 2 | issue 2 41 effects [5]. the toxic and non-essential heavy metals such as contamination index (c d ), pb, al, as, ba, hg, be, and ti are toxic can cause critical or chronic poisoning [6], [7]. for the past years, various works have been performed to identify heavy metals pollution in drinking water [8]-[11]. any reliable assessment for water quality needs to take into account more chemical parameters such as ca, na, mg, po4, no3, so4, and total hardness. to obtain a total view on drinking water quality condition, as the chemical parameters in drinking water may cause important environmental and sanitary consequences [12]. assessment of drinking water quality requires to recognize regional geogenic and anthropogenic characteristics for an area to be studied. naturally, heavy metals reach water resources by leaching from contacting the soil and underlying rocks. heavy metals may come from anthropogenic activities such as agricultural run-off, effluents discharged from cities, industrial plants, and mining sites of heavy metals [13]-[15]. evaluating heavy metals traces in drinking water have been performed by generating pollution indices. these indices refer to the overall water quality in terms of heavy metals contamination. many indices are used for the purpose concerned, such as heavy metals pollution index (hpi), heavy metals evaluation index (hei), and the degree of c d [16] [18]. c d is distinguished by the fact that it implicates heavy metals and other coexisting contaminants in water quality evaluation [19]. many parts of iraq, including the garmian region, are suffering from the low quality and pollution of their drinking water sources [20]-[22]. as a result, several attempts to have been made to define the potential risk of drinking water quality in the area concerned [23]. however, up until now, no extensive analysis has been performed to identify heavy metals levels in drinking water at garmian region. the current study tests the heavy metal concentration levels in drinking water of garmian region, east iraq by establishing a reliable dataset that aids further investigations to develop remediation strategies, to enhance the environment of the region, and to protect people health. this work investigates 23 heavy metals and six chemical parameters in drinking water samples from 16 different locations in garmian region. hpi, hei, c d , with statistical analysis approaches of anova, the correlation matrix (cm), and cluster analysis cluster hierarchical analysis (ca) have been carried out to detect the possible pollution sources. 2. materials and methods 2.1. study area garmian region is located between latitudes (34° 17’ 15”35° 10’ 35”) north and longitudes (44° 31’ 30”-45° 47’ 10”) east. (fig. 1), the study area has a total area of 6716.5 km2 in three districts kalar, kifri, and khanaqin. the region has a population of 300,000 inhabitants, with no major industrial constructions. the physicographic feature of the area is an alluvial plain in the south and west; while the area lies within foothill in the north and east. the major river systems draining the area include alwand, diyala-sirwan, and awaspi rivers. a climate of the study area is continental semiarid by potential evaporation [24]. soil order of the area is mainly aridisols [25]. the land surface is covered by sand, silt, and clay, while periodically several areas are covered by gravel [26]. many parts of the study area are rich with gypsum minerals [27]. the area is underlain by the outcropping formations of tertiary (pliocene), and the quaternary deposits (pleistocene–holocene) consist in the alternation of sandstone, siltstone, and claystone [28]. 2.2. collection of water samples water samples (surface and groundwater) were collected from the study area and sampling locations from 16 locations in garmian districts between january 1 and october 31 in 2017, three samples were collected from each location. water sample was collected from selected sites, where including different water systems and an area covered a stretch of about 60–70 km. in the field a clean pre-washed (250 ml.) polyethylene bucket, which had been connected with a long rope used for fig. 1. map of study area, and locations of water sampling (the first map of garmian region topography is courtesy of garmian region directorate 2017, the second map of garmian region location according to kurdistan region, and iraq was modified by the authors). hayder mohammed issa and azad h. alshatteri: drinking water quality of garmian region 42 uhd journal of science and technology | may 2018 | vol 2 | issue 2 collection of water samples from different sampling sites. the water sample was allowed to pass through the bucket for a while. samples were identified in table i. all samples were acidified with 2% nitric acid (ph-2), and refrigerated and transferred to the instrumental research laboratory to analyze them. all samples were analyzed within 2 days from the time of collection by inductively coupled plasma optical emission spectroscopy (icpoes) (spectro across germany) at university of gar mian. the standard solutions were prepared by serial dilutions of the 1000 mg/l. distilled deionized water was used for the dilutions and the washing all glassware [4]. 2.3. heavy metals analysis various accurate analytical methods are applied to determine heavy metals concentrations in water samples such as the atomic absorption spectrometry aas [29], [30], the icpoes [31]-[33], and the inductively coupled plasma mass spectrometry icp-ms [34]. all water samples were stored in polyethylene containers and returned to the laboratory under dark conditions within 1–2 h of collection time. the water samples were acidified by adding concentrated nitric acid hno 3 and sored at 25°c for trace metal determination purposes. icp-oes: spectro arcos was used to analyze the 23 heavy metals. the instrument conditions used were: spray chamber is scott spray; nebulizer: crossflow; rf power/w: 1400; pump speed: 30 rpm; coolant flow (l/min): 14; auxiliary flow (l/min): 0.9; nebulizer gas flow (l/min): 0.8; preflush (s): 40; measure time (s): 28; replicate measurement: 3; argon gas (purity ≥ 99.99); multi-elements stock solutions containing 1000 mg/l were obtained from bernd kraft (bernd kraft gmbh, duisburg, germany); and standard solutions were diluted by several dilution into 0.1, 0.5, and 2 ppm in 0.5% nitric acid as diluent [2]. 2.4. statistical analysis water pollution indices and statistical approaches were implemented to evaluate the potential sources and levels of heavy metals. typically, evaluation of water quality by pollution indices depends on a massive dataset collected for various relevant contamination parameters in water samples at different locations. application of water pollution indices is associated with various statistical analytical techniques to interpret and classify the obtained water quality data sets. however, among the numerous available statistical techniques, the univariate anova, the bivariate correlation coefficient matrix cm, and the multivariate cluster analysis ca are used for heavy metals impact on water quality [35], [36]. sometimes, these statistical become helpful as water quality results may require additional explanations to identify source and way of the contamination. the obtained data sets from water samples were subjected to statistical analysis using excel 2013 software. two statistical analysis that performed to deduce the sources of heavy metals were; anova and cm interpretations, and cluster analysis ca. using anova aids to find out the significance of the variation between sampling locations while a cm was used to reveal the relationships between the examined heavy metals and chemical contaminants. cluster analysis was applied in this work to classify water samples according to their spatial table i: the description of sources of water samples sampling symbol samples location site coordinate source s1 mineral water bani-khailan, kalar district 35.07, 45.67 spring water s2 drilled well 1 kifri district 34.91, 44.82 groundwater s3 drilled well 2 kifri district 35.02, 44.63 groundwater s4 water project kifri district 34.70, 44.96 surface water awaspi river s5 drilled well 3 kifri district 34.91, 45.07 groundwater s6 drilled well 4 kifri district 34.87, 44.85 groundwater s7 drilled well 5 kalar district 34.64, 45.30 groundwater s8 water project kalar district 34.65, 45.36 surface water sirwan river s9 drilled well 6 kalar district 34.83, 45.51 groundwater s10 drilled well 7 sarqala, kifri district 34.74, 45.06 groundwater s11 drilled well 8 sarqala, kifri district 34.74, 45.08 groundwater s12 drilled well 9 rizgari, kalar district 34.66, 45.26 groundwater s13 drilled well 10 rizgari, kalar district 34.67, 45.18 groundwater s14 drilled well 11 khanaqin district 34.57, 45.35 groundwater s15 drilled well 12 khanaqin district 34.39, 45.35 groundwater s16 water project khanaqin district 34.35, 45.39 surface water alwand river and balaju-canal hayder mohammed issa and azad h. alshatteri: drinking water quality of garmian region uhd journal of science and technology | may 2018 | vol 2 | issue 2 43 variation of heavy metal and chemical parameters of water samples. ward-algorithmic linkage method and euclidean distance are the basis to conduct statistical cluster analysis. agglomerative hierarchical clustering is the used statistical cluster analysis. cluster analysis of water samples was made using xlstat (version 2017 for excel 2013 software). 2.5. heavy metals pollution assessment 2.5.1. heavy metal pollution index (hpi) in this study, the heavy metal pollution index (hpi) was used with the formula that proposed by mohan et al. [37]. where the water quality is assessed according to existence and importance of heavy metals in water samples. many works have used this index to acquire information on heavy metal pollution potential in tested waters [38]-[41]. hpi is an arithmetical tool that computed on the basis of the arithmetic mean method to transform various water existing data into a single derived number in terms of relevant heavy metals presence effect on water quality. ∑ == ∑ = n w qi iihpi n wii 1 1 (1) where q i is the subindex of i-parameter, w i is the weight of i-parameter, and n is the total number of parameters that included in test. w i for each parameters is inversely proportional to the recommended standard for the corresponding parameter. the ith parameter subindex is calculated as follows. ( )− = ∑ −= n m ii iqi s ii i i [ ] * 100 ( )1 (2) where m i , s i , and i i are monitored, standard, and ideal values of i-parameter for the investigated heavy metals. 2.5.2.hei hei is another pollution index related to heavy metals. usually, it is applied to get a whole idea on potential water contamination caused by heavy metals. hei is calculated as following equation [42], [43]. = ∑ = n hchei hmaci 1 (3) where, h c and h mac are the obser ved and maximum permissible level concentrations for each i-parameter, respectively. 2.5.3.c d the c d is computed to evaluate the contamination of water quality, c d is a sum of contamination factors of individual parameters those have values above the upper allowable limits [44]. c d takes into consideration the number of parameters exceeding permissible limits and their concentrations [45]. many works have used this index to reveal any potential contamination and the combined effects of harmful quality parameters in various water resources such as [46] and [47]. c d is calculated as the following two steps. = ∑ = n c cd fi i 1 (4) = − c aic fi c ni 1 (5) where, c fi , c ai , and c ni are concentration factor, analytical value, and the upper allowable concentration of the i-parameter, respectively. 2.6. methods evaluation before going any further, it was very necessary to evaluate the performance method applied in this study. the performance evaluation is usually made according to limits of detection (lod), limit of quantification (loq), and linearity [38], [48]. for elements measured by icpoes, the calibration curves were found depending on the standard addition method. the linearity of the analyzed elements was tested and approved. the lod and loq were estimated per their relations with standard deviation. the accuracy and reproducibility of elements analyzed and measured by icpoes were determined by spiking and homogenizing three replicates of each of the three samples collected randomly from sampling locations. 3. results and discussion 3.1. heavy metals in drinking water samples presence of heavy metals in drinking water samples (groundwater and surface water) from the 16 different sites in garmian region is illustrated in tables ii and iii. in this study, 23 metals of cr, cu, fe, mn, mo, al, sr, zn, ba, se, li, v, ni, cd, as, pb, co, tl, ag, be, hg, sb, and sn have been analyzed. descriptive statics including maximum permissible limit mpl and lod with the wavelength for the investigated heavy metals at all water sampling locations are presented in table ii. hayder mohammed issa and azad h. alshatteri: drinking water quality of garmian region 44 uhd journal of science and technology | may 2018 | vol 2 | issue 2 as stated in table ii most mpl for the tested parameters are according to the who [49] except that mpl for be, fe, mn, sr, li, v, ca, p, be, co, tl, sn, and t. hardness were adapted from other standards as demonstrated in table ii. from the results obtained, a part of the examined metals of ni, cd, as, pb, co, tl, ag, be, hg, sb, and sn are not detected due to their concentrations which are below the lod as shown in table ii. the ph ranges were from 6.5 to 8.0 for all water samples, with no great difference in ph values among the sampling locations in which this weak influence could be ignored on the heavy metals presence in tested samples. from table ii of descriptive statistics, it can be seen from the obtained results that heavy metals characteristics of drinking water quality in garmian region are generally within acceptable ranges except for fe, al, sr, li, and se at certain locations such as s2, s8, and s14. the distribution of the measured heavy metals shows that the mean and median values for the metal of aluminum (al) concentration in water samples 0.3 mg/l are higher than maximum permissible limits mpl 0.2 mg/l this reveals the significance of the al metal impact on drinking water at those locations in the region. mean value of lithium li is 0.037 which is exceeded mpl at most locations in the study area. strontium sr and selenium se mean values in water samples are 3.838 and 0.038 mg/l that is close to the maximum permissible limits mpl of 4, and 0.04 mg/l, respectively, hence this reveals the contribution of sr and se in the drop of drinking water quality of the area. the rest of the parameters showed lower concentrations in tested samples. table iii illustrates more details on heavy metals concentrations among the analyzed drinking water samples that collected from various locations in garmian region. the obtained results showed a sign of pollution hazards of certain heavy metals. for cr high level it was determined to be 0.021 mg/l for water samples collected from location s10 and was low or bdl in the other locations. cr was only found in groundwater samples (0.001–0.021 mg/l). in all table ii: descriptive statistics for heavy metal and chemical parameters in tested water samples parameter min max mean median standard deviation lod (mg/l) mpl (mg/l) wavelength (λ) cr 0.000 0.021 0.003 0.000 0.006 0.0010 0.05 267.7 cu 0.011 0.028 0.018 0.016 0.005 0.0010 1 324.8 fe 0.009 0.736 0.074 0.0155 0.179 0.0020 0.2a 259.9 mn 0.001 0.020 0.004 0.001 0.006 0.0010 0.05a 257.6 mo 0.001 0.006 0.003 0.002 0.001 0.0010 0.07b 202.1 al 0.000 0.550 0.038 0.000 0.137 0.0040 0.1b 396.2 sr 1.046 11.94 3.838 3.9865 2.900 0.0020 4d 407.7 zn 0.001 0.386 0.055 0.0175 0.095 0.0010 3 213.9 ba 0.006 0.094 0.034 0.0165 0.027 0.0044 0.7 455.4 se 0.027 0.044 0.038 0.038 0.004 0.0020 0.04 196.1 li 0.004 0.078 0.037 0.034 0.021 0.0010 0.01e 670.8 v 0.001 0.008 0.005 0.0045 0.002 0.0025 0.015f 292.4 as bdl bdl bdl bdl -0.0026 0.01 189.0 ag bdl bdl bdl bdl -0.0012 0.05 328.1 be bdl bdl bdl bdl -0.0010 0.004c 313.1 cd bdl bdl bdl bdl -0.0010 0.003 214.4 co bdl bdl bdl bdl -0.0010 0.1f 228.6 hg bdl bdl bdl bdl -0.0040 0.006 184.9 ni bdl bdl bdl bdl -0.0010 0.07 231.6 pb bdl bdl bdl bdl -0.0035 0.01 220.4 sb bdl bdl bdl bdl -0.0068 0.02 206.8 sn bdl bdl bdl bdl -0.0010 0.001k 190.0 tl bdl bdl bdl bdl -0.0040 0.0072g 190.9 ca 36.48 175.41 103.55 114.54 49.22 0.004 75h 315.9 k 0.78 5.04 2.24 2.16 1.24 0.031 12b 766.5 mg 9.91 69.76 37.76 47.95 20.32 0.005 50b 279.1 na 5.34 125.53 50.59 50.09 39.86 0.066 50 330.2 p 0.03 0.07 0.04 0.04 0.01 0.002 0.16m 177.5 t. hardness 139.17 724.40 413.57 470.67 203.75 -200h -lod: limit of detection, bdl: below detection limit, mpl: maximum permissible limit, aadapted from [50], b (who, 2011) adapted from [51], c (usepa, 2008) adapted from [51], d (usepa, 2008) adapted from [52], eadapted from [53], fadapted from [54], f (usepa, 008) adapted from [55], gadapted from [56], h (who, 2006) adapted from [57], kadapted from [58], madapted from [59] hayder mohammed issa and azad h. alshatteri: drinking water quality of garmian region uhd journal of science and technology | may 2018 | vol 2 | issue 2 45 sampling locations, low levels of cu were detected ranging of 0.011–0.028 mg/l. however, in one location s8 a high level of fe 0.736 mg/l exceeding the mpl. it was found that the concentrations of mn, mo, zn, ba, and v were lower than mpl of 0.05, 0.07, 3.0, 0.7, and 0.015 mg/l, respectively, in all sampling locations that considered in this study. zn showed critical concentrations at locations s2 and s10 with the range of 0.386 and 0.103 mg/l. it was noticed at some locations that al, sr, and li concentrations are higher than mpl specified by this work. some heavy metals were not detected in this study in all sampling locations due to their low concentrations levels such as as, ag, be, cd, co, hg, ni, pb, sb, sn, and tl. samples from diyala-sirwan river downstream location s8 (kalar drinking water project was established to provide potable water to kalar city residents) looks like having higher concentrations than mpl values of al and fe when compared to groundwater and surface water samples from other locations. this elevation in fe and al levels is due to the fact that diyala-sirwan river flows through small building materials manufactures. therefore, the high contamination in this location may come from effluents discharged by these sites and also from aluminum-rich materials used in water treatment. considerable contaminations of sr were observed in various locations s1 to s6 and s14 to s16 for both surface and groundwater sources at khanaqin and kifri districts. according to usepa 2008 standards of mpl is 4.0 mg/l [52], many water samples contain a high level of sr parameter. these levels are generally related with environmental contamination generated by a natural occurrence of alkaline earth metal. this could be relatively distributed in groundwater as well as in surface water and that is common in such systems and crustal materials [52], [60]. se and li levels are high in water samples s2, s3, and s14 for se and s2, s3 while the concentration of li is 0.055 mg/l for s6. high se and li levels in certain groundwater samples are occurring due to geogenic sources such as weathering and leaching of rocks, dissolution of soluble salts in soils, and it might occur due to anthropogenic activities [61], [62]. several chemical parameters of the water quality were investigated in this study. according to their levels and roles in the anthropological life that called macro essential elements, five cations chemical elements were analyzed include ca, k, mg, na, and p. the statistical description for these chemical parameters of maximum, minimum, mean, median, and standard deviation for all water samples is summarized in table ii. in many locations, statistics show that the mean and median concentrations are close to or even exceed the mpl. from tables ii and iv, it can be noticed that the ranges of the studied cations of the water samples (mg/l) were ca, 36.48–175.4; k, 0.777–5.042; mg, 9.914–69.757; na, 5.34– 125.53; and p, 0.029–0.68; t. hardness, and 139.171–724.4. ca and na and t. hardness are in the first class. magnesium has shown high concentrations in water samples from most locations and exceeded the mpl. high concentrations of ca and mg exist in water samples of khanaqin district (s14 to s16), kifri district (s2, s3, s4, s5, s6, and s10), and in one location at kalar district s7. accordingly, at these locations, the total hardness is high also. sources of elevated ca, na, and mg ions are more likely to be geogenic, like natural hydro-geochemical processes of soil leaching and chemical weathering of rocks from the adjoining basement complex that causes salinized groundwater and river water [63]. table iii: concentrations of heavy metals in drinking water samples detected by icpoes sample location concentration (mg/l) cr cu fe mn mo al sr zn ba se li v s1 bdl 0.021 0.009 0.001 0.002 bdl 1.241 0.001 0.014 0.027 0.004 0.001 s2 0.009 0.012 0.071 0.009 0.005 bdl 11.940 0.386 0.006 0.042 0.075 0.005 s3 0.008 0.011 0.067 0.004 0.006 bdl 6.713 0.065 0.016 0.042 0.078 0.005 s4 bdl 0.015 0.011 0.001 0.002 bdl 4.398 0.020 0.015 0.038 0.047 0.004 s5 bdl 0.015 0.011 0.001 0.002 bdl 4.971 0.015 0.013 0.040 0.049 0.005 s6 bdl 0.013 0.013 0.001 0.001 bdl 5.804 0.037 0.015 0.038 0.055 0.005 s7 bdl 0.019 0.011 0.001 0.001 bdl 1.738 0.001 0.062 0.040 0.026 0.003 s8 bdl 0.021 0.736 0.018 0.003 0.55 1.190 0.006 0.094 0.035 0.033 0.004 s9 bdl 0.022 0.012 0.001 0.001 bdl 1.143 0.015 0.065 0.034 0.027 0.003 s10 0.021 0.016 0.034 0.003 0.002 bdl 3.884 0.103 0.017 0.039 0.040 0.008 s11 bdl 0.023 0.014 0.001 0.004 bdl 1.555 0.093 0.042 0.038 0.023 0.006 s12 0.001 0.024 0.015 0.001 0.003 bdl 1.601 0.006 0.045 0.034 0.019 0.007 s13 0.002 0.028 0.016 0.001 0.002 bdl 1.046 0.001 0.056 0.034 0.011 0.008 s14 bdl 0.016 0.017 0.001 0.004 bdl 4.631 0.091 0.012 0.044 0.035 0.003 s15 bdl 0.014 0.034 0.001 0.002 bdl 4.089 0.034 0.015 0.038 0.027 0.003 s16 bdl 0.013 0.112 0.020 0.004 0.05 5.466 0.004 0.063 0.037 0.043 0.003 hayder mohammed issa and azad h. alshatteri: drinking water quality of garmian region 46 uhd journal of science and technology | may 2018 | vol 2 | issue 2 especially in rural areas in the study region, the agricultural runoff has happened on a limited scale. other anthropogenic activities consequences such as wastewater mixing or leakage have not considerable effects on the groundwater quality. this comes from the fact that no significant human actions present considerable accumulations of chemical elements like cations in water resources at these areas. these variations in cations concentrations are well-known phenomenon, and it has been observed by previous works [64], [65]. 3.2. statistical analysis a one-way analysis of variance anova function of excel 2013 was used in this work to validate the significant differences among sampling locations. statistically analyzed results of water samples using anova were at 95 % confidence level [2]. the variance analysis results showed that all tested heavy metals and chemicals were substantially different at p < 0.05. p = 0.00722, f value was 2.187, and f crit was 1.7058. one-way anova technique was applied in this work because there is only one variable is tested which is the spatial variance of the study area without replication for each sample. fig. 2 illustrates the most significant variance of the investigated heavy metals and chemicals in drinking water samples. fe and al levels showed an interesting deviation at location s8 as mentioned before. location s8 is a water treatment plant at kalar city that takes raw water from the nearby diyala-sirwan river. this distinction refers to the impact of discharge by the existed construction materials plants situated along the river bank. similarly, it refers to potential contamination by aluminum-rich material used in water treatment. fig. 2 shows high concentrations of particular heavy metals such as se and sr in most water samples in the study area. as there is no significant anthropogenic activity can cause these elevations in the region. it is assumed that heavy metals come from natural geogenic sources. ca and mg levels are high almost all over the study region as presented in fig. 2. these high levels of ca and mg are typically caused by geological properties of the region [42]. the cm analysis was performed to figure out the relationships among the water sample contaminants. a correlation coefficient nearer to 1.0 means perfect linear relation between the related parameters. normally, a correlation coefficient of 1.0 is achieved for parameters related with itself. table v illustrates the correlation coefficients matrix between heavy metals and other parameters. relationships of coefficients >0.5 between two investigated parameters at 5% level of significance and p < 0.05 are considered significant. such coefficients were generated between certain pairs of heavy metals or chemical parameters in the water samples. strong positive relationships (>0.7) between heavy metals were observed for example (fe with al), (li with sr and se). at the same time, strong negative relationships (<0.8) were found such as (sr with cu), and (cu with li). correlations at p < 0.05 were obtained for the tested heavy metals and chemical parameters. there were significant positive correlations between se, li, and sr with all tested chemical parameters in this study except p. furthermore, significant negative correlations exist between cu with all tested chemical parameters in this study except p. table iv: concentrations of chemical parameters in water samples sample location concentration (mg/l) ca k mg na p t. hardness s1 36.484 0.777 14.811 5.340 0.043 151.816 s2 141.971 2.649 47.825 79.775 0.029 551.01 s3 139.498 4.640 48.077 125.530 0.032 545.86 s4 135.276 2.591 48.693 59.302 0.037 537.712 s5 138.011 2.661 48.867 58.618 0.036 545.263 s6 157.008 2.998 61.069 66.198 0.034 642.784 s7 93.794 1.461 23.024 15.805 0.040 328.765 s8 59.108 2.396 17.636 14.340 0.054 219.959 s9 60.182 1.244 16.340 9.985 0.048 217.330 s10 81.322 1.932 48.889 112.068 0.039 403.631 s11 48.927 1.054 18.055 15.993 0.049 196.224 s12 50.530 1.081 16.222 17.907 0.065 192.716 s13 39.457 1.028 9.914 15.094 0.068 139.171 s14 139.603 1.765 52.928 41.571 0.039 565.893 s15 160.174 2.528 62.035 64.985 0.036 654.660 s16 175.406 5.042 69.757 106.877 0.045 724.400 hayder mohammed issa and azad h. alshatteri: drinking water quality of garmian region uhd journal of science and technology | may 2018 | vol 2 | issue 2 47 these significant correlations confirm the source of the heavy metals and chemical parameters in water samples are the geological structure or composition of rocks, soil. heavy metals enrichment of al and fe in the water sample s8 is attributed to small projects constructed beside diyala-sirwan river, as most the effluents are washed by surface runoff and goes into the river. aluminum-rich materials utilized on the site of the water treatment plant could be the second source of al [66]. 3.2.1. cluster analysis the ca analysis can identify any similarity that exists among clustered results. by showing considerable internal clusters homogeneity and significant external heterogeneity concerning clusters. hierarchical agglomerative clustering is applied to find any spatial similarity between water samples regarding their locations in the study area. from the results illustrated in fig. 3, the dendrogram of hierarchical cluster analysis has generated three distinct clusters. a similarity of water samples in term of sampling locations are classified into three principal cluster groups. the main groups of sample locations are cluster 1, contains sampling locations of s2, s3, and s4, s5, s6, s10, s14, s15, and s16. cluster 2, includes one sampling location of s8. cluster 3, combines sampling locations of s1, s7, s9, s11, s12, and s13. it can be deduced from the cluster analysis that the spatial division was based principally on the type of heavy metals contamination. as the location s8 in cluster 2 is a water treatment plant constructed at downstream of a river, this sample showed different contamination (high levels of fe and fig. 2. mean concentrations spatial distribution for some heavy metals and chemical parameters with indicating mpl limit; (a) for iron, (b) for strontium, (c) for aluminum, (d) for selenium, (e) for calcium, and (f) for magnesium. a b dc e f hayder mohammed issa and azad h. alshatteri: drinking water quality of garmian region 48 uhd journal of science and technology | may 2018 | vol 2 | issue 2 al) from other locations. in cluster 3, groundwater samples were of low concentrations of heavy metals. 3.2.2. contamination evaluation indices contamination evaluation indices hpi, hei, and c d in this work based on the who guidelines for drinking water and other standards taken from the literature. mean values of the heavy metals were used to calculate contamination evaluation indices hpi and hei while mean values of heavy metals and chemical parameters were used to calculate contamination degree index c d . table vi illustrates the values of hpi, hei, and c d . hpi for the heavy metals in water samples ranges from 54.986 to 24.564 with a mean value of 25.48. location s8 has the highest hpi value. hpi value equals 100 is considered as a critical potential pollution with respect to heavy metals concentrations [41]. no location in the study area has exceeded this limit. nevertheless, as stated by herojeet et al. [67] hpi results were classified as low (<15), medium (15–30), or high (>30) pollution. in this case, only two locations (s1 and s16) are not highly contaminated by heavy metals. it is worth mentioning here; highest hpi value comes from water treatment plant at kalar city that takes raw water from diyala sirwan river. the elevated hpi at this site is in accord with the statistical analysis results. high hpi is due to the impact of the building material plants at a river bank. otherwise, it caused by materials used in water treatment. other groundwater samples have also registered high hpi values at locations s10 and s13, where the heavy metal table v: correlation matrix between heavy metals and chemical parameters in analyzed water samples parameters cr cu fe mn mo al sr zn ba se li v ca k mg na p t. hard. cr 1 cu −0.27 1.00 fe −0.07 0.09 1.00 mn 0.02 −0.22 0.71 1.00 mo 0.23 −0.34 0.15 0.39 1.00 al −0.13 0.15 0.99 0.66 0.07 1.00 sr 0.36 −0.81 −0.14 0.22 0.53 −0.23 1.00 zn 0.48 −0.39 −0.08 0.11 0.52 −0.15 0.78 1.00 ba −0.31 0.56 0.60 0.50 −0.17 0.62 −0.59 −0.43 1.00 se 0.28 −0.62 −0.12 0.00 0.43 −0.17 0.63 0.48 −0.34 1.00 li 0.38 −0.82 0.04 0.25 0.52 −0.04 0.87 0.56 −0.40 0.71 1.00 v 0.53 0.26 −0.08 −0.15 0.10 −0.10 0.02 0.17 −0.01 0.17 0.10 1.00 ca 0.00 −0.90 −0.15 0.21 0.24 −0.21 0.74 0.24 −0.45 0.66 0.70 −0.25 1.00 k 0.12 −0.79 0.16 0.57 0.48 0.09 0.61 0.09 −0.11 0.44 0.74 −0.12 0.79 1.00 mg 0.19 −0.90 −0.17 0.21 0.22 −0.23 0.70 0.23 −0.52 0.59 0.64 −0.14 0.95 0.76 1.00 na 0.60 −0.82 −0.12 0.28 0.46 −0.21 0.71 0.33 −0.45 0.55 0.77 0.18 0.71 0.82 0.81 1.00 p −0.25 0.89 0.21 0.05 −0.17 0.26 −0.71 −0.45 0.66 −0.57 −0.70 0.38 −0.72 −0.49 −0.70 −0.60 1.00 t.hard. 0.08 −0.91 −0.16 0.22 0.24 −0.22 0.73 0.24 −0.48 0.64 0.69 −0.21 0.99 0.79 0.98 0.76 −0.72 1 correlations are significant at a level of (p<0.05) fig. 3. hierarchical cluster analysis dendrogram of water samples locations. hayder mohammed issa and azad h. alshatteri: drinking water quality of garmian region uhd journal of science and technology | may 2018 | vol 2 | issue 2 49 pollution comes from natural sources and much less from domestic waste and agricultural runoff. the lowest hpi recorded in the study region was for the water sample s1, s1 which is a spring water located at north of the region and no anthropogenic pollution exist. table vii depicts the deviation and percentage deviation from mean values for hpi, hei, and c d indices. from table vii, it is noticed that eight locations (s3, s5, s6, s8, s10, s11, s12, and s13) have hpi values above the hpi mean value. in other words, it can be said that 50% of the study area is significantly affected by heavy metals pollution in drinking water sources according to the hpi index. the classification of overall drinking water quality per hei is low (<1.24), medium (1.24–2.48) and high (>2.48) polluted [68]. the quality of drinking water in regard to hei at the majority of sampling locations (s2, s3, s4, s5, s6, s8, s10, a14, and s16) is in the high class (hei >2.45). the water resources in these locations are surface water and groundwater. elevated heavy metals concentrations are observed in certain water samples. the maximum hei value is 8.441 for the location s8. location s8 has also the highest hpi value; the reason for the rise is mentioned previously. substantially, the lowest hei value of 1.179 for surface water sample from the location s1, considering all sampling locations. water source at this location is spring water; hence, it is the less contaminated site in the study area. table vii shows that only five locations (s2, s3, s8, s10, and s16) have hei values above the mean value. their percentage of deviation from hei mean value ranges from 7.07% at s8 to 179.07% at s10. by considering hei results, among the highest five polluted locations; two of them are surface water of s8 and s16. table vi: values of pollution indices sample location hpi hei cd s1 24.564 1.179 −14.839 s2 39.425 5.298 −5.100 s3 41.324 3.817 −5.534 s4 37.348 2.485 −8.416 s5 40.949 2.743 −8.095 s6 40.009 2.907 −6.778 s7 35.160 1.840 −12.117 s8 54.986 8.441 −6.495 s9 32.750 1.556 −13.625 s10 52.622 3.238 −8.035 s11 43.015 1.992 −13.300 s12 44.334 1.965 −13.218 s13 47.210 1.915 −13.852 s14 36.981 2.687 −8.341 s15 33.811 2.445 −7.170 s16 29.678 3.886 −3.920 mean 39.635 3.023 −9.302 standard deviation 7.916 1.773 3.593 min. 24.564 1.179 −14.839 max. 54.986 8.441 −3.920 hpi: heavy metals pollution index, hei: heavy metals evaluation index, c d : contamination index table vii: mean deviation values of contamination indices sample location hpi hei cd mean deviation % mean deviation mean deviation % mean deviation mean deviation % mean deviation s1 −15.071 −38.025 −1.846 −61.022 −5.537 59.522 s2 −0.211 −0.531 2.274 75.171 4.202 −45.177 s3 1.688 4.260 0.793 26.213 3.768 −40.503 s4 −2.288 −5.771 −0.540 −17.838 0.887 −9.531 s5 1.314 3.315 −0.282 −9.327 1.208 −12.982 s6 0.374 0.943 −0.118 −3.893 2.524 −27.134 s7 −4.475 −11.291 −1.185 −39.166 −2.815 30.263 s8 15.351 38.730 5.416 179.069 2.808 −30.181 s9 −6.885 −17.371 −1.469 −48.572 −4.323 46.474 s10 12.986 32.765 0.214 7.067 1.267 −13.619 s11 3.379 8.526 −1.033 −34.145 −3.997 42.973 s12 4.698 11.854 −1.059 −35.021 −3.916 42.100 s13 7.574 19.110 −1.109 −36.680 −4.550 48.912 s14 −2.654 −6.697 −0.337 −11.150 0.961 −10.333 s15 −5.824 −14.694 −0.580 −19.174 2.132 −22.918 s16 −9.957 −25.122 0.861 28.466 5.383 −57.864 hpi: heavy metals pollution index, hei: heavy metals evaluation index, c d : contamination index hayder mohammed issa and azad h. alshatteri: drinking water quality of garmian region 50 uhd journal of science and technology | may 2018 | vol 2 | issue 2 a difference between hpi and hei results appears pursuant to divergence in results at several locations see fig. 4. this great variation was increased by taking in account ideal values of permissible limits of heavy metals with hpi calculations. these permissible limits are subject to variations according to different accredited authorities. all measured parameters were implied: the heavy metals and chemical parameters. characterizing c d values were made as previous works. c d was classified into three groups: low (c d <1), medium (c d = 1–3), and high (c d >3) [44], [69], [70]. c d results range between −14.839 and −3.920. the mean value is −9.302, with >60% of water samples falling above the mean value. percentage of deviation from mean value ranges from 57.86% at s16 to 9.53% at s4 (table vii). the previously proposed classification of c d consider all water samples (surface water and groundwater) are low; as they did not exceed 1.0. therefore, the study area is considered as slightly polluted with respect to all pollutants (heavy metals and chemical). from fig. 4, the results of c d show a convergence with hei results. the two indices did not take into account the ideal limits for tested parameters. different evaluations were observed between hei and c d . the differences were rising from the fact that c d is combining the chemical parameters in the pollution assessment calculations. the obtained results led to figuring out the impact of the heavy metals on the drinking water quality in garmian region. the contamination is due to the nature of the soil and underlying rocks compositions. weathering and leaching of soluble salts from the soil and underlying rocks may reach the water resource in the region. anthropogenic activities impact was observed in water quality in the results of hpi, hei, and c d for the location s8 particularly the minor industrial activities near diyala-sirwan river. 4. conclusion • in this work, the used statistical methods were: cm and cluster analysis ca. the obtained results showed that the drinking water quality in most locations of the study area is polluted at different levels. • concentrations of some heavy metals such as fe, al, li, sr, and se are considerably high at certain locations in the study area. for example location s8, which is the water treatment plant of kalar city recorded the highest levels of al and fe. correspondingly, chemical parameters concentrations of ca and mg are high in most the tested water samples in the study area. • in general, water pollution indices, hpi, hei, and cd have provided an over view of the extent of contamination at all locations in the garmian area. for most of these locations, pollution indices have made a convergent evaluation and their values showed considerable correlation. nevertheless, three extreme results have appeared in the locations s14, s15, and s16 of hpi with hei and cd. the variances in these locations are most likely due to differences in the heavy metals concentrations assessment schemes used by hpi. according to hpi contamination evaluation level, all the investigated locations are not critically polluted in view of the fact that hpi is <100 as proposed by prasad and bose [41]. where the hpi is between 24.564 and 54.986. according to cd, all study locations are occurred within low polluted level cd index places all the locations within low polluted levels (cd >3 for all the study area). the third pollution evaluation index hei has a more reliable pollution categorization for water samples, in which low (<1.24), medium (1.24–2.48), and high (>2.48). as per hei evaluation levels, 44% of location is critically polluted and 38% of the locations are moderately polluted. all surface water samples s4, s8, and s16 are classified as critically polluted, where the highest level of contamination was observed at location s8 (hei = 8.441). hence, hei proved to be more appropriate for heavy metal pollution evaluation, as the unwieldy way of calculation processed by cd and hpi. • statistical analysis by correlation coefficient matrix and cluster analysis ca was applied in the study. these methods detected that heavy metals and other contaminants in drinking water are mostly released from natural geological sources. especially, weathering and leaching of soils and underlain rocks. while anthropogenic activities sources were only found in the locations s8 and s16. the ca and cm analytical results gave a concrete agreement between them for all the data sets investigated. fig. 4. spatial distribution of heavy metals pollution index, heavy metals evaluation index, and contamination index on sampling locations of study area. hayder mohammed issa and azad h. alshatteri: drinking water quality of garmian region uhd journal of science and technology | may 2018 | vol 2 | issue 2 51 • drinking water samples studied in this work are the main source for residents living in rural and urban locations of garmian region. detection of high or critical levels in collected samples means there is a significant potential for drinking water contamination by heavy metals in the area. hence, this study leads to establish a reliable database on heavy metals and their potential sources that leaching into the water resources of garmian region. these findings give a rigid base for any further studies performed on the drinking water quality in same area to reach a broad understanding of natural and anthropogenic impacts on drinking water quality in garmian region. the importance of comparative evaluation by hpi and statistical methods is proved to be significant in such water quality studies. references [1] h. effendi. “river water quality preliminary rapid assessment using pollution index”. procedia environmental sciences, vol. 33, pp. 562-567, 2016. [2] a. z. aris, r. c. y. kam, a. p. lim and s. m. praveena. “concentration of ions in selected bottled water samples sold in malaysia”. applied water science, vol. 3, pp. 67-75, 2013. [3] d. k. gupta and l. m. sandalio, metal toxicity in plants: perception, signaling and remediation. springer, berlin heidelberg, 2011. [4] j. e. marcovecchio, s. e. botté, and r. h. freije. “heavy metals, major metals, trace elements”. handbook of water analysis, vol. 2, pp. 275-311, 2007. [5] n. a. nkono and o. i. asubiojo. “trace elements in bottled and soft drinks in nigeria — a preliminary study”. science of the total environment, vol. 208, pp. 161-163, 1997. [6] j. duruibe, m. ogwuegbu and j. egwurugwu. “heavy metal pollution and human biotoxic effects”. international journal of physical sciences, vol. 2, pp. 112-118, 2007. [7] m. s. nahar and j. zhang. “assessment of potable water quality including organic, inorganic, and trace metal concentrations”. environmental geochemistry and health, vol. 34, pp. 141-150, 2012. [8] y. meride and b. ayenew. “drinking water quality assessment and its effects on residents health in wondo genet campus, ethiopia”. environmental systems research, vol. 5, p. 1, 2016. [9] r. peiravi, h. alidadi, a. a. dehghan and m. vahedian. “heavy metals concentrations in mashhad drinking water network”. zahedan journal of research in medical sciences, vol. 15, pp. 7476, 2013. [10] c. güler. “evaluation of maximum contaminant levels in turkish bottled drinking waters utilizing parameters reported on manufacturer’s labeling and government-issued production licenses”. journal of food composition and analysis, vol. 20, pp. 262-272, 2007. [11] g. tamasi and r. cini. “heavy metals in drinking waters from mount amiata (tuscany, italy). possible risks from arsenic for public health in the province of siena”. science of the total environment, vol. 327, pp. 41-51, 2004. [12] z. napacho and s. manyele. “quality assessment of drinking water in temeke district (part ii): characterization of chemical parameters”. african journal of environmental science and technology, vol. 4, pp. 775-789, 2010. [13] r. virha, a. k. biswas, v. k. kakaria, t. a. qureshi, k. borana and n. malik. “seasonal variation in physicochemical parameters and heavy metals in water of upper lake of bhopal”. bulletin of environmental contamination and toxicology, vol. 86, pp. 168174, 2011. [14] v. demir, t. dere, s. ergin, y. cakır and f. celik. “determination and health risk assessment of heavy metals in drinking water of tunceli, turkey”. water resources, vol. 42, pp. 508-516, 2015. [15] d. d. runnells, t. a. shepherd and e. e. angino. “metals in water. determining natural background concentrations in mineralized areas”. environmental science and technology, vol. 26, pp. 23162323, 1992. [16] b. nadmitov, s. hong, s. in kang, j. m. chu, b. gomboev, l. janchivdorj, c. h. lee and j. s. khim. “large-scale monitoring and assessment of metal contamination in surface water of the selenga river basin (2007–2009)”. environmental science and pollution research, vol. 22, pp. 2856-2867, 2015. [17] j. milivojević, d. krstić, b. šmit and v. djekić. “assessment of heavy metal contamination and calculation of its pollution index for uglješnica river, serbia”. bulletin of environmental contamination and toxicology, vol. 97, pp. 737-742, 2016. [18] s. mishra, a. kumar, s. yadav and m. k. singhal. “assessment of heavy metal contamination in water of kali river using principle component and cluster analysis, india”. sustainable water resources management, vol. 21, pp. 515-532, 2017. [19] b. backman, d. bodiš, p. lahermo, s. rapant and t. tarvainen. “application of a groundwater contamination index in finland and slovakia”. environmental geology, vol. 36, pp. 55-64, 1998. [20] r. o. rasheed and u. m. k. aziz. “evaluation of some heavy metals in well water within sulaimani city, kurdistan region-iraq”. marsh bulletin, vol. 8, pp. 131-147, 2013. [21] w. s. kamil and k. a. abdulrazzaq. “construction water suitability maps of tigris river for irrigation and drinking use”. journal of engineering-iraq, vol. 16, pp. 5822-5841, 2010. [22] m. a. ibrahim. “assessment of water quality status for the euphrates river in iraq”. engineering and technology journal, vol. 30, pp. 2536-2549, 2012. [23] h. m. issa. “an initial environmental assessment for the potential risk of the developing industry impact on the surface water resources in the kurdistan region-iraq”. journal of garmian university, vol. 1, pp. 35-48, 2014. [24] n. kharrufa. “simplified equation for evapotranspiration in arid regions”. beiträge zur hydrologie, vol. 5, pp. 39-47, 1985. [25] a. s. muhaimeed, a. saloom, k. saleim and k. alaane. classification and distribution of iraqi soils”. international journal of agriculture innovations and research, vol. 2, pp. 997-1002, 2014. [26] s. jassim and j. goff. “geology of iraq. dolin, prague and moravian museum”. brno, vol. 2006, p. 341, 2006. [27] s. n. azeez and i. rahimi. “distribution of gypsiferous soil using geoinformatics techniques for some aridisols in garmian, kurdistan region-iraq”. kurdistan journal of applied research, vol. 2, pp. 57-64, 2017. [28] s. m. ali and a. s. oleiwi. “modelling of groundwater flow of khanaqin area, northeast iraq”. iraqi bulletin of geology and mining, vol. 11, pp. 83-94, 2015. [29] e. z. jahromi, a. bidari, y. assadi, m. r. m. hosseini and m. r. jamali. “dispersive liquid–liquid microextraction combined with graphite furnace atomic absorption spectrometry: ultra trace hayder mohammed issa and azad h. alshatteri: drinking water quality of garmian region 52 uhd journal of science and technology | may 2018 | vol 2 | issue 2 determination of cadmium in water samples”. analytica chimica acta, vol. 585, pp. 305-311, 2007. [30] j. chen and k. c. teo. “determination of cadmium, copper, lead and zinc in water samples by flame atomic absorption spectrometry after cloud point extraction”. analytica chimica acta, vol. 450, pp. 215-222, 2001. [31] o. v. s. raju, p. prasad, v. varalakshmi and y. r. reddy. “determination of heavy metals in ground water by icp-oes in selected coastal area of spsr nellore district, andhra pradesh, india”. international journal of innovative research in science, engineering and technology, vol. 3, pp. 9745-9749, 2014. [32] e. l. silva, p. d. s. roldan and m. f. giné. “simultaneous preconcentration of copper, zinc, cadmium, and nickel in water samples by cloud point extraction using 4-(2-pyridylazo)-resorcinol and their determination by inductively coupled plasma optic emission spectrometry”. journal of hazardous materials, vol. 171, pp. 1133-1138, 2009. [33] p. liang, y. qin, b. hu, t. peng and z. jiang. “nanometer-size titanium dioxide microcolumn on-line preconcentration of trace metals and their determination by inductively coupled plasma atomic emission spectrometry in water”. analytica chimica acta, vol. 440, pp. 207-213, 2001. [34] i. komorowicz and d. barałkiewicz. “arsenic and its speciation in water samples by high performance liquid chromatography inductively coupled plasma mass spectrometry—last decade review”. talanta, vol. 84, pp. 247-261, 2011. [35] a. ali, v. strezov, p. davies and i. wright. “environmental impact of coal mining and coal seam gas production on surface water quality in the sydney basin, australia”. environmental monitoring and assessment, vol. 189, p. 408, 2017. [36] k. h. low, i. b. koki, h. juahir, a. azid, s. behkami, r. ikram, h. a. mohammed and s. m. zain”. evaluation of water quality variation in lakes, rivers, and ex-mining ponds in malaysia (review)”. desalination and water treatment, vol. 57, pp. 28215-28239, 2016. [37] s. v. mohan, p. nithila and s. j. reddy. “estimation of heavy metals in drinking water and development of heavy metal pollution index”. journal of environmental science and health part a, vol. 31, pp. 283-289, 1996. [38] m. f. cengiz, s. kilic, f. yalcin, m. kilic and m. g. yalcin. “evaluation of heavy metal risk potential in bogacayi river water (antalya, turkey)”. environmental monitoring and assessment, vol. 189, p. 248, 2017. [39] b. a. zakhem and r. hafez. “heavy metal pollution index for groundwater quality assessment in damascus oasis, syria”. environmental earth sciences, vol. 73, pp. 6591-6600, 2015. [40] r. reza and g. singh. “heavy metal contamination and its indexing approach for river water”. international journal of environmental science and technology, vol. 7, pp. 785-792, 2010. [41] b. prasad and j. bose. “evaluation of the heavy metal pollution index for surface and spring water near a limestone mining area of the lower himalayas” environmental geology, vol. 41, pp. 183188, 2001. [42] t. k. boateng, f. opoku, s. o. acquaah and o. akoto. “pollution evaluation, sources and risk assessment of heavy metals in handdug wells from ejisu-juaben municipality, ghana”. environmental systems research, vol. 4, p. 18, 2015. [43] c. singaraja, s. chidambaram, k. srinivasamoorthy, p. anandhan and s. selvam. “a study on assessment of credible sources of heavy metal pollution vulnerability in groundwater of thoothukudi districts, tamilnadu, india”. water quality, exposure and health, vol. 7, pp. 459-467, 2015. [44] a. edet and o. offiong. “evaluation of water quality pollution indices for heavy metal contamination monitoring. a study case from akpabuyo-odukpani area, lower cross river basin (southeastern nigeria)”. geo journal, vol. 57, pp. 295-304, 2002. [45] s. venkatramanan, s. y. chung, t. h. kim, m. v. prasanna and s. y. hamm. “assessment and distribution of metals contamination in groundwater: a case study of busan city, korea”. water quality, exposure and health, vol. 7, pp. 219-225, 2015. [46] m. a. bhuiyan, m. islam, s. b. dampare, l. parvez and s. suzuki. “evaluation of hazardous metal pollution in irrigation and drinking water systems in the vicinity of a coal mine area of northwestern bangladesh”. journal of hazardous materials, vol. 179, pp. 10651077, 2010. [47] j. varghese and d. s. jaya. “metal pollution of groundwater in the vicinity of valiathura sewage farm in kerala, south india”. bulletin of environmental contamination and toxicology, vol. 93, pp. 694698, 2014. [48] s. j. cobbina, a. b. duwiejuah, r. quansah, s. obiri and n. bakobie. “comparative assessment of heavy metals in drinking water sources in two small-scale mining communities in northern ghana”. international journal of environmental research and public health, vol. 12, pp. 10620-10634, 2015. [49] world health organization. guidelines for drinking-water quality. who publications, 2011. [50] m. prasanna, s. praveena, s. chidambaram, r. nagarajan and a. elayaraja. “evaluation of water quality pollution indices for heavy metal contamination monitoring: a case study from curtin lake, miri city, east malaysia”. environmental earth sciences, vol. 67, pp. 1987-2001, 2012. [51] a. alsulaili, m. al-harbi and k. al-tawari. “physical and chemical characteristics of drinking water quality in kuwait: tap vs. bottled water”. journal of engineering research, vol. 3, pp. 25-50, 2015. [52] a. j. o’donnell, d. a. lytle, s. harmon, k. vu, h. chait and d. d. dionysiou. “removal of strontium from drinking water by conventional treatment and lime softening in bench-scale studies”. water research, vol. 103, pp. 319-333, 2016. [53] l. a. kszos and a. j. stewart. “review of lithium in the aquatic environment: distribution in the united states, toxicity and case example of groundwater contamination”. ecotoxicology, vol. 12, pp. 439-447, 2003. [54] t. l. gerke, k. g. scheckel and j. b. maynard. “speciation and distribution of vanadium in drinking water iron pipe corrosion byproducts”. science of the total environment, vol. 408, pp. 58455853, 2010. [55] m. gedrekidan and z. samuel. “concentration of heavy metals in drinking water from urban areas of the tigray region, northern ethiopia”. cncs mekelle university, vol. 3, pp. 105-121, 2011. [56] a. j. peter and t. viraraghavan. “thallium: a review of public health and environmental concerns”. environment international, vol. 31, pp. 493-501, 2005. [57] a. akpan-idiok, a. ibrahim and i. udo. “water quality assessment of okpauku river for drinking and irrigation uses in yala, cross river state, nigeria”. research journal of environmental sciences, vol. 6, p. 210, 2012. [58] safe drinking water committee. drinking water and health. 8th ed. national acadamey of sciences, usa, 1988. [59] h. boyacioglu. development of a water quality index based on a european classification scheme”. water sa, vol. 33, pp. 389-393, 2007. hayder mohammed issa and azad h. alshatteri: drinking water quality of garmian region uhd journal of science and technology | may 2018 | vol 2 | issue 2 53 [60] v. achal, x. pan and d. zhang. “bioremediation of strontium (sr) contaminated aquifer quartz sand based on carbonate precipitation induced by sr resistant halomonas sp”. chemosphere, vol. 89, pp. 764-768, 2012. [61] a. r. kumar and p. riyazuddin. “speciation of selenium in groundwater: seasonal variations and redox transformations”. journal of hazardous materials, vol. 192, pp. 263-269, 2011. [62] j. f. hogan and j. d. blum. “boron and lithium isotopes as groundwater tracers: a study at the fresh kills landfill, staten island, new york, usa”. applied geochemistry, vol. 18, pp. 615627, 2003. [63] m. a. h. bhuiyan, m. bodrud-doza, a. r. m. t. islam, m. a. rakib, m. s. rahman and a. l. ramanathan. “assessment of groundwater quality of lakshimpur district of bangladesh using water quality indices, geostatistical methods, and multivariate analysis”. environmental earth sciences, vol. 75, p. 1020, 2016. [64] d. a. al-manmi. “groundwater quality evaluation in kalartownsulaimani/ne-iraq”. iraqi national journal of earth sciences, vol. 7, pp. 31-52, 2007. [65] a. h. alshatteri, a. r. sarhat and a. m. jaff. “assessment of sirwan river water quality from downstream of darbandikhan dam to kalar district, kurdistan region, iraq. journal of garmian university, vol. 5, pp. 48-58, 2018. [66] h. m. issa. evaluation of water quality and performance for a water treatment plant: khanaqin city as a case study. journal of garmian university, vol. 3, pp. 802-821, 2017. [67] r. herojeet, m. s. rishi and n. kishore. “integrated approach of heavy metal pollution indices and complexity quantification using chemometric models in the sirsa basin, nalagarh valley, himachal pradesh, india”. chinese journal of geochemistry, vol. 34, pp. 620633, 2015. [68] z. khoshnam, r. sarikhani and z. ahmadnejad. “evaluation of water quality using heavy metal index and multivariate statistical analysis in lorestan province, iran. journal of advances in environmental health research, vol. 5, pp. 29-37, 2017. [69] s. rapant, m. rapošová, d. bodiš, k. marsina and i. slaninka. “environmental–geochemical mapping program in the slovak republic”. journal of geochemical exploration, vol. 66, pp. 151158, 1999. [70] l. s. clesceri, a. d. eaton, a. e. greenberg, a. p. h. association, a. w. w. association and w. e. federation. standard methods for the examination of water and wastewater. american public health association, washington, dc, 1998. tx_1~abs:at/tx_2:abs~at 16 uhd journal of science and technology | jan 2023 | vol 7 | issue 2 1. introduction as the globe experiences rapid technological advancement, the financial industry has capitalized on these developments. as a byproduct of technological progress, cryptocurrencies are a valuable contribution to financial markets and the global economy. bitcoin has the highest market capitalization among all cryptocurrencies, estimated at $930 billion on december 28, 2021 [1]. the exchange or trading of bitcoin and other cryptocurrencies has attracted the interest of investors in global financial markets. likewise, market research analysts have become interested in cryptocurrencies and their interactions with financial market indicators. although the impact of bitcoin on gold prices, the telecommunications market, the stock market index, and the performance of insurance companies is lower, the insurance industry is uniquely positioned to benefit from blockchain technology [2]. the financial sector has made extensive use of technological advancements in recent years. due to technological progress, cryptocurrency is a valuable contribution to analyzing the performance of bitcoin to gold prices, the telecommunications market, the stock price index, and insurance companies’ performance from (march 1, 2021–september 4, 2023) hawre latif majeed1, diary jalal ali1, twana latif mohammed2 1assistant lecture, accounting department, kurdistan technical institute, kurdistan region (krg), iraq, 2assistant lecture, information technology department, kurdistan technical institute, kurdistan region (krg), iraq a b s t r a c t managing cryptocurrencies by financial intermediaries offer numerous benefits to global financial markets and the economy. among all cryptocurrencies, bitcoin stands out with the highest market capitalization and a weak correlation to other assets, making it an attractive option for portfolio diversification and risk management. this research aims to examine the impact of bitcoin on the nasdaq gold price (gc), the telecommunications market (ixut), and insurance company performance (ixis) through the analysis of secondary data from march 1, 2021, to september 4, 2023. the data were obtained from https://www.investing.com; statistical software e views applied various econometric methods to the data. the results suggest a positive correlation between bitcoin and the other variables, indicating that bitcoin can significantly expand investment opportunities and drive economic growth. this study highlights the importance of considering cryptocurrencies, especially bitcoin, as a viable option for investment diversification and risk management in financial markets. index terms: cryptocurrencies, bitcoin, gold price, telecommunications, stock price, insurance. access this article online doi: 10.21928/uhdjst.v7n2y2023.pp16-31 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2023 majeed hl, ali dj, mohammed tl. this is an open access article distributed under the creative commons attribution noncommercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l re se a rc h a rt i c l e uhd journal of science and technology corresponding author’s e-mail: twana latif mohammed, assistant lecture, information technology department, kurdistan technical institute, kurdistan region (krg), iraq. email: twana.mohammed@kti.edu.iq received: 27-04-2023 accepted: 26-06-2023 published: 20-08-2023 majeed, et al: bitcoin: stock market effects uhd journal of science and technology | jan 2023 | vol 7 | issue 2 17 financial markets and the global economy. the exchange or trading of bitcoin and other cryptocurrencies has become prevalent in global financial markets, attracting practitioners. economic analysts are interested in cryptocurrencies and the interactions between cryptocurrencies and financial market indicators. cryptocurrencies, in 2009, bitcoin developed cryptographically secure digital currency [3]. the 2008–2009 global financial crisis and the 2010–2013 european sovereign debt crisis made bitcoin popular among practitioners and economic agents. bitcoin-accepting businesses have also grown. despite government limitations, a terrible reputation, and several hacks, bitcoin’s popularity has grown. by providing indemnification or encouraging savings, the insurance business is vital to any economy. its premium pooling makes it a prominent institutional investor. insurance companies serve customers. it is also a financial entity that invests insured money for profits, helping economic and social advancement. bitcoin is attracting investors despite its young origin. international investors now sell precious metals and buy bitcoin. bitshares, dash, ethereum, litecoin, mixin, moreno, peercoin, and zcash, have emerged due to bitcoin’s popularity [3]. most virtual currencies use blockchain technology like bitcoin and aim to equal or improve its features. cryptocurrencies need cointegration and convergence tests for numerous reasons. gold and cryptocurrency values are interconnected because they cointegrate. since cryptocurrency and gold have a long-term relationship, linking them is a good idea. convergence between cryptocurrency and gold prices suggests that low-priced cryptocurrencies will rise more quickly [4]. most countries’ economic progress and global developments have internationalized and regulated the insurance business. most countries have understood insurance’s economic and social value and fostered, developed, and encouraged the technical advances that have accelerated development, including the insurance sector. dash aims to speed up transaction processing and protect anonymity, whereas litecoin conserves central processing unit power for mining. gold miners’ stocks, etfs, and actual gold can be invested today. thus, explaining why gold was an inevitably valued hedge while it was used in the monetary system and why it remained a hedge afterward is beneficial. gold is traditionally used to buffer portfolios against volatile markets and investor anxiety [5]. since its introduction, bitcoin’s high returns have made gold less appealing to investors. investors have preferred bitcoin over gold in the recent decade due to its 100-fold higher return. despite bitcoin’s greater short-term volatility than gold’s, its long-term price evolution is anticipated to follow gold’s [6]. as the globe digitizes, traditional currencies and physical money are becoming less popular. bitcoin prices rose from under us$1000 in 2014 to over us$17,000 in 2018.2 dash prices rose from below us$2 in 2014 to above us$400 in 2018 [7]. gold prices were between us$1050 and us$1400 throughout the same period. forecasting, economic modeling, and policymaking can benefit from cryptocurrency and gold price convergence. this research examines how bitcoin affects the telecommunications industry, stock price index, insurance company performance, and convergence assumptions between cryptocurrency and gold prices. from a univariate perspective, we first evaluate the fractional order of integration in the stochastic characteristics of gold and cryptocurrency prices. 1.1. problem of the study this research seeks to determine if bitcoin impacts gold prices, telecommunications, stock prices, and insurance company performance and if bitcoin can be predicted using economic data. thus, the question is how bitcoin relates to other variables or if there is any link. since granger causality shows that one event can influence another, understanding its direction might improve market comprehension. finding a correlation between the two may allow investors and economists to predict bitcoin prices using gold’s past pricing. 1.2. aims of the study this study aims to examine the effects of bitcoin on the performance of insurance companies, the telecommunications market, the stock price index, and gold prices. based on how these variables interact and behave, by developing the following hypothesis: 1. hypothesis (h1): bitcoin has no significant effect on gold price. 2. hypothesis (h2): bitcoin has no significant effect on telecommunications stock index price. 3. hypothesis (h3): bitcoin has no significant effect on insurance companies. 2. literature review this section discusses the overview literature. a comprehensive literature review was conducted using a systematic approach to ensure objectivity and methodological rigor in locating majeed, et al: bitcoin: stock market effects 18 uhd journal of science and technology | jan 2023 | vol 7 | issue 2 and evaluating relevant academic literature regarding the correlation between gold prices, the telecommunications market, the stock price index, and insurance companies performance. several studies have explored this relationship from various angles, providing valuable insights into the subject matter. bams, blanchard, honarvar, and lehnert (2017) examined how gold prices affect insurance company stock performance, stressing economic fundamentals and investor mood. studied how the telecommunications market affects stock price indexes, stressing market dynamics and regulatory strategies [23]. boonkrong, arjrith, and sangsawad (2020) examined the relationship between gold prices and the telecoms market, revealing potential spillover effects. the literature review synthesizes these and other related studies to identify significant factors, mechanisms, and theoretical frameworks. advanced filters and the "peer-reviewed journals" option ensured highquality research. despite the paper's novelty in the academic world, a typical method was used to choose relevant papers based on their publication dates, focusing on current studies to include the newest scientific achievements [24]. 2.1. bitcoin bitcoin accounts for 36.33% of the market capitalization of cryptocurrencies, down from 80% in june 2016. thus, bitcoin-specific studies exist. bitcoin is a decentralized digital currency created in 2009 by an unknown person using satoshi nakamoto’s pseudonym. it is based on a peer-to-peer network, where transactions take place directly between users without the need for intermediaries such as banks or other financial institutions. bitcoin has gained increasing popularity over the years, and its use has spread across different industries, including finance, e-commerce, and even healthcare. this literature review examines the current state of research on bitcoin, its impact on various industries, and its prospects. one of the key features of bitcoin is its decentralized nature. bitcoin transactions are verified by a network of users, who use complex algorithms to confirm and record transactions on a public ledger known as the blockchain. this feature has made bitcoin attractive to many users, particularly those concerned about traditional financial institutions’ role in controlling their money. several studies have examined the impact of bitcoin on the financial industry, and many have suggested that bitcoin has the potential to disrupt traditional banking systems. for instance, ali et al. [8] found that bitcoin could reduce the costs associated with traditional payment systems, particularly cross-border payments. the study noted that traditional payment systems involve a complex network of intermediaries, which can result in high fees and slow processing times. conversely, bitcoin allows for fast and cheap cross-border payments, which could benefit individuals and businesses in developing countries. another area where bitcoin has shown potential is e-commerce. several studies have examined the use of bitcoin in online marketplaces, such as the dark web. one study by böhme et al. [9] found that bitcoin was the dominant currency used in illegal online marketplaces, particularly for purchasing drugs and other illicit goods. however, the study also noted that bitcoin was used for legitimate transactions, particularly in countries with unreliable traditional payment systems. despite its potential, bitcoin has also faced several challenges. one of the biggest challenges has been its association with illegal activities, particularly money laundering and terrorism financing. several studies have examined the extent to which bitcoin is used for illegal activities, and many have suggested that the currency is more anonymous than some may believe – tracing bitcoin transactions to real-world identities as possible, mainly when the transactions involve exchanges between bitcoin and traditional currencies. another challenge facing bitcoin is its volatility. the price of bitcoin has fluctuated significantly over the years, with several high-profile crashes and booms. this volatility has made bitcoin less attractive to many investors, particularly risk-averse investors. several studies have examined the factors that influence the price of bitcoin, and many have suggested that a combination of supply and demand factors and speculative activity drives it. despite these challenges, many experts believe that bitcoin has a bright future. several studies have examined the potential of bitcoin to revolutionize various industries, including healthcare. for instance, in a study by elahi and hasan (2018), bitcoin could facilitate secure and efficient medical record-keeping, particularly in countries with weak health systems. other studies have examined the potential of bitcoin to facilitate charitable giving and crowdfunding. 2.2. gold gold has been a significant part of human culture and society for thousands of years. it has been used for various purposes, including jewelry, currency, and investments. gold has always been associated with wealth, power, and majeed, et al: bitcoin: stock market effects uhd journal of science and technology | jan 2023 | vol 7 | issue 2 19 prestige, and its value has remained high throughout history. this literature review explores the historical significance, geological properties, mining and extraction techniques, and the uses and applications of gold. historical significance: gold has been valued and treasured by civilizations for thousands of years. it has been used for jewelry, religious artifacts, and currency. the ancient egyptians believed that gold was the flesh of the gods, and it was used in constructing temples and tombs. the aztecs and incas also valued gold and used it for jewelry and religious artifacts. in europe, gold was used as currency, and during the gold rush in the 19th century, it was used as a means of payment for goods and services. gold continues to be highly valued today, and it is often used as a store of value and as a haven asset during times of economic uncertainty [10]. 2.2.1. geological properties gold is a chemical element with the symbol au, one of the least reactive chemical elements. it is a soft, dense, yellow metal with a high luster. gold is highly malleable and ductile, meaning it can be easily shaped and formed into various shapes and sizes. it is also a good conductor of electricity and does not corrode or tarnish. gold is primarily found in the earth’s crust and is often associated with other minerals, such as silver, copper, and zinc. gold deposits are typically found in three main types of geological settings: veins, placers, and disseminated deposits [11]. 2.2.2. mining and extraction techniques gold mining and extraction techniques have evolved. in ancient times, gold was extracted by panning, where goldbearing sand or gravel was placed in a shallow pan and swirled around to separate the gold from the other minerals. today, gold is typically extracted from large deposits using various techniques, including open-pit mining, underground mining, and placer mining. open-pit mining involves the removal of large amounts of soil and rock to access the gold-bearing ore [12]. underground mining uses tunnels to access the ore, while placer mining involves water to separate the gold from the other minerals. 2.2.3. uses and applications gold has a wide range of uses and applications. it is primarily used for jewelry, decorative purposes, and various industrial applications, including electronics, medical devices, and aerospace technology. gold is also used as a value store and haven asset during economic uncertainty [13]. in addition, gold is used to produce coins and bullion, which are often purchased as investments. 2.3. telecommunications companies telecommunications companies have been integral to the modern world’s communication infrastructure for decades. these companies provide the necessary tools and infrastructure to enable people to communicate and exchange data across vast distances. telecommunications companies have played a critical role in facilitating the digital transformation of modern society. this literature review aims to provide an over view of the current state of the telecommunications industry and highlight some of the critical challenges and opportunities facing telecommunications companies [14]. the telecommunications industry has undergone significant changes in recent years, driven by technological advancements, consumer behavior, and increased competition. the industry has seen the rise of new players, such as overthe-top (ott) providers, which have disrupted traditional business models. ott providers offer messaging, voice calls, and video streaming over the internet, often bypassing traditional telecommunications networks. this has forced telecommunications companies to adapt to new business models, such as offering bundled services, developing new value-added services, and focusing on customer experience. one of the critical challenges facing telecommunications companies is the need to invest continually in new infrastructure to keep up with the increasing demand for data and connectivity. telecommunications companies must invest in new networks and technologies to remain competitive with the rise of new technologies such as 5g, the internet of things, and artificial intelligence (ai). at the same time, they must balance this investment against the need to maintain profitability and shareholder returns [14]. telecommunications companies face increasing regulatory scrutiny, particularly concerning net neutrality and data privacy. governments around the world are implementing regulations to protect consumers’ privacy and ensure that telecommunication companies provide fair and open access to the internet. in addition, the increased focus on data privacy has led to increased demand for secure communications solutions, which has created new business opportunities for telecommunications companies. the telecommunications industry is also experiencing a shift toward digital transformation. companies increasingly invest in cloud computing, ai, and big data analytics technologies to improve operations and offer new services. these technologies enable telecommunications companies majeed, et al: bitcoin: stock market effects 20 uhd journal of science and technology | jan 2023 | vol 7 | issue 2 to improve network efficiency, offer personalized services, and enhance the customer experience. despite these challenges, telecommunications companies are well-positioned to benefit from the increasing demand for connectivity and the digital transformation of modern society. companies that can successfully adapt to new business models and invest in new technologies will be well-positioned to capture new opportunities and maintain market share. the telecommunications industry is expected to grow in the coming years, driven by increasing demand for connectivity, the adoption of new technologies, and the ongoing shift toward digital transformation [15]. 2.4. insurance insurance is an agreement between an individual or an organization and an insurer, which promises compensation or protection against a specific loss in exchange for regular payments, known as premiums. the concept of insurance has been around for centuries, with records of various types of insurance being used as far back as ancient china and babylon. insurance is essential in managing risk, especially for individuals and businesses that face significant financial loss in an unexpected event. insurance companies are organizations that provide insurance products and services to customers. they collect premiums from policyholders and use the funds to pay for claims made by customers who experience losses covered by their policies. insurance companies play a key role in society, as they provide a safety net for individuals and businesses, allowing them to recover from unexpected losses. insurance companies, including life insurance, health insurance, property and casualty insurance, and auto insurance, among others, offer various types of insurance. each type of insurance serves a specific purpose and has unique features and benefits. for instance, life insurance provides financial protection to the policyholder’s beneficiaries in the event of their death, while health insurance covers medical expenses incurred by the insured. another study by bashaija [16] investigated the impact of insurance on the financial performance of small and medium-sized enterprises (smes) in india. the study found that smes that had insurance coverage had better financial performance than those without insurance. the authors attributed this to the fact that insurance provided smes with financial protection against unexpected losses, allowing them to focus on business operations and growth. the role of insurance companies in managing risk has also been extensively studied. demirgüç-kunt and huizinga [17] the study examined the impact of insurance on financial stability. the study found that insurance companies play a crucial role in promoting financial stability by providing a buffer against unexpected losses, thereby reducing the risk of systemic financial crises. in addition, the impact of insurance companies on the economy has been investigated. a study by hamadu and mojekwu [18] examined the insurance industry’s contribution to economic growth in the united states. the study found that the insurance industry contributes significantly to economic growth, as it provides financial protection and risk management services to individuals and businesses, thereby promoting investment, innovation, and entrepreneurship. 2.4.1. the impact of bitcoin on the gold price the rise of digital currencies has become a significant topic of interest among investors and academics. the most popular cryptocurrency has grown and is now widely used as a medium of exchange and store of value. despite the increased adoption of digital currencies, gold remains a valuable asset class for investors. the relationship between bitcoin and gold has been debated among researchers. this literature review aims to examine the impact of bitcoin on the price of gold. 2.4.2. bitcoin and gold: a comparison bitcoin and gold have several similarities and differences that affect their prices. gold has been a store of value for centuries and is viewed as a safe-haven asset during economic uncertainty. gold prices are affected by macroeconomic factors such as inflation, interest rates, and geopolitical events. in contrast, bitcoin is a relatively new digital currency that has gained popularity due to its decentralization, security, and limited supply. bitcoin prices are affected by technological advancements, regulatory changes, and investor sentiment. several studies have examined the relationship between bitcoin and gold prices. some researchers have argued that bitcoin is a substitute for gold and can be used as a hedge against inflation and economic uncertainty. others have argued that bitcoin and gold have different characteristics and should not be considered substitutes. several studies have examined the impact of bitcoin on gold prices. in a study by bouri et al. [19], the authors used a vargarch model to examine the relationship between bitcoin and gold prices. the results showed a positive relationship majeed, et al: bitcoin: stock market effects uhd journal of science and technology | jan 2023 | vol 7 | issue 2 21 between bitcoin and gold prices in the short run, but the relationship becomes negative in the long run. the authors argued that bitcoin and gold are not substitutes and that the long-term negative relationship is due to differences in the characteristics of the two assets. in contrast, a study by bouri et al. [19] found evidence that bitcoin is a hedge against gold during economic uncertainty. the authors used a var model to examine the relationship between bitcoin, gold, and the stock market. the results showed that bitcoin is a hedge against gold during times of financial stress but not during normal market conditions. the authors argued that bitcoin could be used as a safe-haven asset in addition to gold. in a more recent study, sökmen and gürsoy [20] examined the impact of bitcoin on gold prices using a cointegration model. the authors found evidence of a long-run equilibrium relationship between bitcoin and gold prices, suggesting that the two assets are substitutes. the authors argued that bitcoin is an attractive investment for investors who prefer digital currencies over physical assets like gold. 2.5. impact of bitcoin on telecommunications companies bitcoin, a decentralized digital currency, has gained significant attention since its inception in 2009. its impact has been felt across various industries, including the telecommunications industry. this literature review aims to explore the impact of bitcoin on telecommunications companies. bitcoin is a cryptocurrency that operates on a decentralized network without a central authority or inter mediary. transactions on the bitcoin network are recorded on a public ledger known as the blockchain, which allows for secure and transparent transactions. bitcoin has been touted as a potential disruptor of traditional financial systems, with its decentralized nature allowing for faster, cheaper, and more secure transactions [21]. the telecommunications industry is one of the industries impacted by the rise of bitcoin. telecommunications companies provide the infrastructure and technology for communication and data transfer. with the rise of bitcoin, telecommunications companies have had to adapt to changes in consumer behavior and demand. one of how bitcoin has impacted telecommunications companies is through blockchain technology. blockchain technology is the underlying technology behind bitcoin, and it has the potential to revolutionize the telecommunications industry. blockchain technology can be used to create secure, transparent, and tamper-proof communication networks, improving telecommunications networks’ security and reliability. telecommunications companies have also had to adapt to consumer behavior and demand changes. with the rise of bitcoin, consumers are increasingly using digital currencies to pay for goods and services. this has led to a shift in consumer demand for telecommunications companies to provide services that cater to the needs of bitcoin users. for example, telecommunications companies have had to adapt to provide secure and reliable bitcoin wallets and payment processing systems [21]. furthermore, the rise of bitcoin has also led to the emergence of new business models in the telecommunications industry. for example, some telecommunications companies have started to offer bitcoin-based services, such as micropayments, remittances, and international transfers. these services are often cheaper and faster than traditional banking services, making them an attractive option for consumers. however, the impact of bitcoin on telecommunications companies is only partially positive. bitcoin has various risks, including fraud, money laundering, and cybercrime. telecommunications companies have had to invest in cybersecurity measures to protect their networks and customers from these risks. furthermore, the regulatory landscape for bitcoin still needs to be determined, which makes it difficult for telecommunications companies to navigate the legal and regulatory requirements associated with providing bitcoin-based services. 2.6. impact of bitcoin on insurance companies the impacts of bitcoin on insurance companies. it will examine how insurance companies use bitcoin, the challenges they face, and the benefits they are experiencing. one of the main ways insurance companies use bitcoin is as a form of payment. bitcoin allows for fast and secure transactions, which helps speed up the claims process. this is particularly useful for international claims, where traditional payment methods can be slow and costly. in addition, bitcoin transactions can be processed 24/7, meaning claims can be paid out quickly, even outside traditional business hours [22]. another way that insurance companies are using bitcoin is as an asset to insure. bitcoin is an emerging asset class, and some insurance companies are starting to offer coverage for it. majeed, et al: bitcoin: stock market effects 22 uhd journal of science and technology | jan 2023 | vol 7 | issue 2 this can be particularly useful for companies that hold large amounts of bitcoin, as it can help to protect them against theft or loss. for example, in 2019, insurance giant lloyd’s of london began offering coverage for cryptocurrency theft. however, there are also challenges associated with using bitcoin in the insurance industry. one of the main challenges is the volatility of bitcoin’s value. bitcoin is a highly volatile asset, and its value can fluctuate rapidly. this makes it difficult for insurance companies to price policies accurately and to set appropriate coverage limits. in addition, the regulatory environment surrounding bitcoin is still evolving, making it difficult for insurance companies to comply with regulations. despite these challenges, there are also benefits associated with using bitcoin in the insurance industry. one of the main benefits is the potential for cost savings. bitcoin transactions are generally cheaper than traditional payment methods, which can reduce insurance companies costs. in addition, using bitcoin can help to streamline the claims process, which can help to reduce administrative costs. 3. research framework this section describes the variables of the study, their sources, and the relationships between independent and dependent variables. https://www.investing.com provided the data. statistical software e views applied various econometric methods to the data. finally, p = 0.05 rejects the null hypothesis and accepts the alternative. if the variable’s p-value exceeds 0.05, neither hypothesis is supported. the following paragraphs provide a concise explanation of these tools to identify the impact of bitcoin on gold prices, the telecommunications market, the stock price index, and the insurance company’s performance. thus, bitcoin impact was substituted by the insurance companies’ performance (ixis) and telecommunications stock index price (ixut), whereas gold price (gc). their findings conclude that independent variables are considerably affected by depending on variables. 3.1. model of the study bitcoin is accepted as independent and insurance, telecommunications, and gold price as dependent variables. 3.2. augmented dickey–fuller (adf) test the first step in using econometric methods is to assess the data’s stationarity, as most economical series are nonstationary and have a unit root at the primary level. this is significant because the presence of a unit root can induce bias in the outcomes of statistical tests such as the granger causality test and the var model, lowering their accuracy. non-stationary series analysis can potentially produce deceptive statistical results. the series’ first difference can be changed into a stationary form to solve this. the augmented dickey–fuller (adf) test is employed in this study to assess the stationarity of time series data. • the null hypothesis (h0) states that the series is nonstationary or has a unit root. • the alternative hypothesis (h1) proposes that the series lacks a unit root and is stationary. 3.3. johansen cointegration test the johansen (1988) cointegration test establishes long-term relationships between variables. the null hypothesis (h0) shows no long-term association between bitcoin and variables. alternative hypothesis (h1) suggests a long-term association between bitcoin and factors. 3.4. granger causality test the granger causality test determines whether two variables have a unidirectional, bidirectional, or non-existent causal link. test significance is 5%. the null hypothesis (h0) asserts that bitcoin has no granger causality with the variables. alternatively, h1 implies no granger causation between bitcoin and the variables. p-value determines null hypothesis acceptance or rejection. the null hypothesis is rejected if p-value is less than the significance level and accepted if it is more extensive. 3.5. vector error correction if the results confirm the cointegration of the variables under investigation, this demonstrates their long-term relationship. the vector error correction model (vecm) investigates this relationship. in this section, the results and data analysis are presented and discussed. 3.6. stationarity of data the adf and phillips-perron (p-p) tests are employed to determine the stationarity of the series. the series is initially discovered to be non-stationary at the primary level. to make the data stationary, the first series differences are calculated. if p-values from the adf and p-p tests are more significant than 0.05, then the following is true: majeed, et al: bitcoin: stock market effects uhd journal of science and technology | jan 2023 | vol 7 | issue 2 23 • the null hypothesis is adopted at a 5% level of significance. • the statistics associated with the stationarity of the data series are presented in the table below. 3.7. model selection the akaike information criterion (aic) used the model selection method to choose the best model. a total of 500 models were evaluated, and the selected model is an autoregressive distributed lag (ardl) (1, 0, 1, 0) model. this indicates that the lag order for the dependent variable (btc) is one, with no lags for the other independent variables. 3.7.1. coefficients and statistical significance btc (-1): the lagged value of btc (one period ago) has a coefficient of 0.917958. this suggests that a one-unit increase in btc yesterday is associated with an approximately 0.917958-unit increase in btc today. gc: the coefficient for the variable gc is 1.484124, but it is not statistically significant (p = 0.7447). therefore, the inclusion of gc in the model is a relatively insignificant impact on btc (please see table 3). 3.7.2. ixis the coefficient for the variable ixis is 110.0404, which is statistically significant (p-value = 0.0005). this suggests that a one-unit increase in ixis is associated with a 110.0404 unit increase in btc. ixis (-1): the lagged value of ixis (one period ago) has a coefficient of -96.63934, and it is statistically significant (p-value = 0.0019). this implies that a one-unit increase in ixis yesterday is associated with a decrease of approximately 96.63934 units in btc today (please see table 3). 3.7.3. ixut the coefficient for the variable ixut is 0.213805, but it is not statistically significant (p-value = 0.7117). therefore, the inclusion of ixut in the model does not significantly impact btc. c: the constant term has a coefficient of −8104.335, but it is not statistically significant (p-value = 0.3722). therefore, the intercept is not significantly different from zero. 3.7.4. the goodness of fit r2: the model’s coefficient of determination (r2) is 0.934858, which indicates that approximately 93.49% of the variation in btc can be explained by the independent variables in the model (please see table 4). adjusted r2: the adjusted r2 is 0.931950, which considers the degrees of freedom and penalizes including irrelevant variables. s.e. of regression: the standard error of the regression is 3642.267, which measures the average distance between the observed values of btc and the predicted values from the model. prop (f-statistic): the probability associated with the f-statistic is 0.000000, indicating that the overall model is statistically significant. f-statistic: the f-statistic is 321.4634, and its associated p-value is 0.000000, indicating that the overall model is statistically significant. note: p-values in the results do not account for model selection. therefore, caution should be exercised when interpreting the individual variable significance based solely on p-values provided. based on the provided information, the econometric function can be represented as follows: btc = 0.917958 * btc (−1) + 1.484124 * gc + 110.0404 * ixis + (−96.63934) * ixis (−1) + 0.213805 * ixut − 8104.335 + ε the coefficients for each variable are given as 0.917958, 1.484124, 110.0404, −96.63934, 0.213805, and −8104.335. this equation represents an ardl model, where btc is regressed on its lagged value, along with other variables such as gc, ixis, and ixut. the model selection method used was the aic, and the selected model was ardl (1, 0, 1, 0). 3.7.5. test statistic and critical values the adf test statistic is −1.505250. this value is compared to critical values to determine the statistical significance. at the 1% level, the critical value is −3.486551. at the 5% level, the critical value is −2.886074. at the 10% level, the critical value is −2.579931. the test statistic is less negative than the critical values at all significance levels, suggesting that we do not reject the null hypothesis. 3.7.6. coefficients and statistical significance btc (-1): the lagged value of btc (one period ago) has a coefficient of −0.038019. this coefficient is not statistically significant (p-value = 0.1350). therefore, the lagged btc does not significantly impact the different btc. c: the constant term has a coefficient of 1449.238, but it is not statistically significant (p-value = 0.1394). therefore, a constant term in the differenced btc equation is not significant. 3.7.7. the goodness of fit r2: the coefficient of determination (r2) for the differenced btc equation is 0.019158, indicating that approximately majeed, et al: bitcoin: stock market effects 24 uhd journal of science and technology | jan 2023 | vol 7 | issue 2 1.92% of the variation in the differenced btc can be explained by the lagged btc and the constant term. adjusted r2: the adjusted r2 is 0.010703, which considers the degrees of freedom and penalizes including irrelevant variables. f-statistic: the f-statistic is 2.265778, and its associated p-value is 0.134978, which suggests that the overall model is not statistical. 3.7.8. significant other information: mean dependent var: the average value of the differenced btc in the sample is 82.18729. s.d. dependent var: the standard deviation of the differenced btc is 3836.238. sum squared resid: the sum of squared residuals is 1.69e+09, which measures the model’s overall fit. 3.7.9. suggest no autocorrelation prob (f-statistic): the probability associated with the f-statistic is 0.134978, indicating that the overall model is not statistically significant. note: based on the results, there is insufficient evidence to reject the null hypothesis that btc has a unit root, suggesting that btc is non-stationary. bitcoin (-1): the lagged value of bitcoin (one period ago) has a coefficient of 0.917958. this suggests that a oneunit increase in bitcoin yesterday is associated with an approximately 0.917958 unit increase in btc today. gc: the coefficient for the variable gc is 1.484124, but it is not statistically significant (p-value = 0.7447). therefore, the inclusion of gc in the model is relatively minor in bitcoin. the coefficient for the variable insurance companies’ performance. it is 110.0404 and statistically significant (p-value = 0.0005). this suggests that a one-unit increase in insurance companies’ performance is associated with a 110.0404 unit increase in bitcoin-insurance companies’ performance. (-1): the lagged value of ixis (one period ago) has a coefficient of −96.63934, and it is statistically significant (p-value = 0.0019). this implies a one-unit increase in insurance companies’ performance. yesterday is associated with a decrease of approximately 96.63934 units in bitcoin today. the coefficient for the variable telecommunications stock index price. it is 0.213805 but not statistically significant (p-value = 0.7117). the coefficient for the variable telecommunications stock index price. the model has little impact on bitcoin. c: the constant term has a coefficient of −8104.335, but it is not statistically significant (p-value = 0.3722). therefore, the intercept is not significantly different from zero. adf test statistic −1.505250 0.5276 test critical values: 1% level −3.486551 5% level −2.886074 10% level −2.579931 *mackinnon (1996) one-sided p-values. adf test equation method: least squares. variable coefficient standard error t-statistic prob. btc (-1) −0.038019 0.025258 −1.505250 0.1350 c 1449.238 973.7503 1.488306 0.1394 r2 0.019158 mean dependent var 82.18729 adjusted r2 0.010703 s.d. dependent var 3836.238 f-statistic 2.265778 durbin-watson stat 1.803764 prob(f-statistic) 0.134978 btc (-1): the variable btc with a lag of one period has a coefficient of −0.038019. this suggests that a oneunit increase in btc in the previous period is associated with a decrease of approximately 0.038019 units in the current period. the standard error for this coefficient is 0.025258, the t-statistic is −1.505250, and the corresponding p-value is 0.1350. c: the constant term in the model has a coefficient of 1449.238. this represents the intercept or baseline value of the dependent variable (btc) when all other variables in the model are zero. the standard error for this coefficient is 973.7503, the t-statistic is 1.488306, and the corresponding p-value is 0.1394. the results are from unrestricted cointegration rank tests (trace and max-eigenvalue) performed to determine the presence of cointegration among the variables. here is an interpretation of the critical components of the results: unrestricted cointegration rank test (trace): hypothesized no. of ce(s): the number of common trends assumed in the null hypothesis. the tests are conducted for different assumed numbers of common trends. eigenvalue: the eigenvalues associated with the assumed number of common trends. statistic: the test statistic for the trace test. critical value: the critical values correspond to the assumed number of common trends at the specified significance level. prob.**: p-value calculated based on the mackinnon-haugmichelis (1999) method. the trace test compares the sum of the eigenvalues to the critical values to determine the number of cointegrating equations (common trends). the null hypothesis is that there are no cointegrating equations. majeed, et al: bitcoin: stock market effects uhd journal of science and technology | jan 2023 | vol 7 | issue 2 25 3.7.10. based on the trace test results no cointegration: the test statistic for the case of no cointegration (0 common trends) is 36.81853, which is lower than the critical value at the 0.05 level (47.85613). therefore, we do not reject the null hypothesis of no cointegration at the 0.05 level. 3.7.11. unrestricted cointegration rank test (max-eigenvalue) • hypothesized no. of ce(s): the number of common trends assumed in the null hypothesis. • eigenvalue: the eigenvalues associated with the assumed number of common trends. statistic: the test statistic for the max-eigenvalue test. • critical value: the critical values corresponding to the assumed number of common trends at the specified significance level. • prob.**: p-value calculated based on the mackinnonhaug-michelis (1999) method. the max-eigenvalue test examines the largest eigenvalue to determine the number of cointegrating equations. the null hypothesis is that no more than a certain number of cointegrating equations exist. 3.7.12. based on the max-eigenvalue test results no cointegration: the test statistic for the case of no cointegration (0 common trends) is 19.27991, which is lower than the critical value at the 0.05 level (27.58434). therefore, we do not reject the null hypothesis of no cointegration at the 0.05 level. the trace and max-eigenvalue tests indicate no cointegration at the 0.05 level. this suggests that there is no long-term relationship among the variables being tested (please see table 4). the granger causality test is used to examine the causal relationship between variables. in this case, the test is conducted between btc, gc, ixis, and ixut variables. here is an interpretation of the critical components of the results: null hypothesis: indicates the null hypothesis being tested for granger causality. obs: the number of observations used in the test. f-statistic: the f-statistic calculated for the granger causality test. prob: the p-value associated with the f-statistic. 3.7.13. interpretation of the results 1. gc does not granger cause btc: • f-statistic: 0.30848 • prob: 0.7352. p-value (0.7352) is higher than the significance level (e.g., 0.05), indicating no evidence to reject the null hypothesis. this suggests that gc does not granger cause btc. 2. btc does not granger cause gc: • f-statistic: 0.25926 • prob: 0.7721. similarly, p-value (0.7721) is higher than the significance level, indicating no evidence to reject the null hypothesis. therefore, btc does not granger cause gc. 3. ixis does not granger cause btc: • f-statistic: 0.86716 • prob: 0.4229 p-value (0.4229) is higher than the significance level, indicating that there is no evidence to reject the null hypothesis. therefore, ixis does not granger cause btc. 4. btc does not granger cause ixis: • f-statistic: 0.85998 • prob: 0.4259 p-value (0.4259) is higher than the significance level, sug gesting that there is no evidence to reject the null hypothesis. hence, btc does not granger cause ixis. the remaining results follow a similar pattern for the granger causality tests between telecommunications stock index price and btc, insurance companies’ performance and gold price, telecommunications stock index price and gold price, telecommunications stock index price and insurance companies’ perfor mance, and insurance companies’ performance and telecommunications stock index price. in each case, p-value is higher than the significance level, indicating a lack of evidence to reject the null hypothesis. in summary, based on these granger causality test results, no significant evidence suggests a causal relationship between the variables tested in either direction (fig. 1). the following tables offer the estimations of the influences of the models for three relations. to test the link between bitcoin measured by the insurance companies’ performance (ixis) and telecommunications stock index price (ixut), whereas gold price (gc), correlation, and multiple regression analyses were conducted. table 1, which shows summary model results, indicates our model with the two forecasters. the model is a linear regression model with btc (bitcoin) as the dependent variable and three predictors: ixis (insurance companies’ performance), ixut (telecommunications stock index price), and gc (gold price). the model’s r2 value is 0.588, indicating that the three predictors can explain 58.8% of the variance in btc. the adjusted r2 value is 0.563, majeed, et al: bitcoin: stock market effects 26 uhd journal of science and technology | jan 2023 | vol 7 | issue 2 which considers the number of predictors in the model. the standard error of the estimate is 3896.04060, which represents the average distance that the actual btc values deviate from the predicted values. multiple regression analysis was conducted to examine the relationship between bitcoin (btc) as the dependent variable and three predictors: insurance companies’ performance (ixis), telecommunications stock index price (ixut), and gold price (gc). the summary model results are presented in table 2. the r2 value of 0.93 indicates that approximately 93% of the variance in btc can be explained by the three predictors included in the model. this suggests that the predictors collectively account for a significant portion of the variability in bitcoin prices. to further elaborate on the results, it would be helpful to provide more specific information from table 1, such as the coefficients associated with each predictor variable and their corresponding p-values or confidence intervals. in addition, discussing the statistical significance of the coefficients and their interpretation of the research question would provide a more comprehensive understanding of the model’s findings. the significance and interpretation of the coefficients can be further expanded to provide a deeper understanding of the relationships. for instance, the positive coefficient associated with the gold price (gc) suggests a positive correlation fig. 1. gradients of the objective function. table 1: model of the study bitcoin insurance telecommunications h 1 h 2 h 3 gold price timodel description: where: btc=bitcoin ixis=insurance companies’ performance. ixut=telecommunications stock index price. gc=gold price. µ=the error term. majeed, et al: bitcoin: stock market effects uhd journal of science and technology | jan 2023 | vol 7 | issue 2 27 between the price of gold and the price of bitcoin. one possible explanation for this relationship is that gold and bitcoin are considered alternative investment assets or stores of value. as investors seek to hedge against inflation or economic uncertainties, they may allocate funds to gold and bitcoin, simultaneously driving up their prices. similarly, the positive coefficient for the telecommunications stock index price (ixut) implies a positive association between the performance of the telecommunications sector and the price of bitcoin. this relationship could be attributed to the increasing adoption and integration of cryptocurrencies within the telecommunications industry. as the telecommunications sector advances technologically and embraces cryptocurrencies, it may contribute to the growth and acceptance of bitcoin, thereby positively impacting its price. on the other hand, the negative coefficient associated with the insurance companies’ performance (ixis) indicates an inverse relationship between the performance of insurance companies and the price of bitcoin. one possible explanation is that as the performance of insurance companies improves, investors may perceive them as more stable and secure investment options compared to the relatively volatile and speculative nature of bitcoin. consequently, increased confidence in traditional financial institutions, such as insurance companies, may lead to decreased demand for bitcoin and a subsequent decrease in its price. it is important to note that the constant term, representing the value of the dependent variable when all predictor variables are equal to zero, predicts a negative value for bitcoin. however, since the constant term is not statistically significant, its impact on the overall bitcoin price prediction may not be substantial. therefore, the focus should primarily be on the coefficients of the predictor variables, as they provide more meaningful insights into the relationships being examined. by delving into the underlying mechanisms and offering plausible explanations for the observed relationships, a more thorough understanding of the dynamics between the variables can be achieved, thereby strengthening the overall analysis. table 2: stationarity statistics at first difference dependent variable: btc method: ardl dependent lags: (4 max. lags): gc ixis ixut variables coefficient standard error t-statistic prob.* btc(-1) 0.917958 0.039009 23.53190 0.0000 gc 1.484124 4.545807 0.326482 0.7447 ixis 110.0404 30.44721 3.614136 0.0005 ixis(-1) −96.63934 30.37787 −3.181241 0.0019 ixut 0.213805 0.576962 0.370570 0.7117 *prob (f‑statistic)=0.000000 r2=0.934858 adjusted r2=0.931950 table 3: adf test statistic null hypothesis: btc has a unit root exogenous: constant t-statistic prob.* augmented dickey–fuller test statistic −1.505250 0.5276 test critical values: 1% level −3.486551 5% level −2.886074 10% level −2.579931 *mackinnon (1996) one‑sided p-values augmented dickey–fuller test equation method: least squares variable coefficient std. error t-statistic prob. btc(-1) −0.038019 0.025258 −1.505250 0.1350 c 1449.238 973.7503 1.488306 0.1394 r2 0.019158 mean dependent var 82.18729 adjusted r2 0.010703 s.d. dependent var 3836.238 f-statistic 2.265778 durbin-watson stat 1.803764 prob (f-statistic) 0.134978 the results are from an adf test performed on the variable btc to test for the presence of a unit root. here is an interpretation of the key components of the results: null hypothesis: the null hypothesis being tested is that btc has a unit root, indicating that it is non‑stationary majeed, et al: bitcoin: stock market effects 28 uhd journal of science and technology | jan 2023 | vol 7 | issue 2 4. conclusion and recommendation 4.1. conclusion bitcoin (-1) coefficient: the lagged value of bitcoin has a coefficient of 0.917958, which is statistically significant at a high t-statistic value of 23.53190. this suggests that a oneunit increase in bitcoin in the previous period is associated with an approximate 0.917958 unit increase in bitcoin in the current period. this indicates a positive autocorrelation effect and sug gests the presence of momentum in bitcoin prices. 4.1.1. gold prices coefficient the coefficient for the variable gold prices is 1.484124, but it is not statistically significant with a t-statistic of 0.326482 and a relatively high p-value of 0.7447. therefore, the inclusion of gold prices in the model has little bitcoin. 4.1.2. insurance companies’ performance coefficient the coefficient for the variable insurance companies’ performance is 110.0404, and it is statistically significant with a t-statistic of 3.614136 and a low p-value of 0.0005. this suggests that a one-unit increase in insurance companies’ performance is associated with a significant 110.0404 unit increase in bitcoin. this indicates a positive relationship between insurance companies’ performance (a specific independent variable) and bitcoin. insurance companies performance (-1) coefficient: the lag ged value of insurance companies’ perfor mance has a coefficient of -96.63934, and it is statistically significant with a t-statistic of and p-value of 0.0019. this implies that a one-unit increase in insurance companies’ performance in the previous period is associated with a decrease of approximately 96.63934 units in btc in the current period. this suggests a negative relationship between the lag ged value of insurance companies’ performance and bitcoin. table 5: results of pairwise granger causality test sample: march 1, 2021–september 4, 2023 lags: 2 null hypothesis obs f-statistic prob. gc does not granger cause btc 117 0.30848 0.7352 btc does not granger cause gc 0.25926 0.7721 ixis does not granger cause btc 117 0.86716 0.4229 btc does not granger cause ixis 0.85998 0.4259 ixut does not granger cause btc 117 0.27543 0.7598 btc does not granger cause ixut 0.80404 0.4501 ixis does not granger cause gc 117 0.02168 0.9786 gc does not granger cause ixis 0.93098 0.3972 ixut does not granger cause gc 117 0.70114 0.4982 gc does not granger cause ixut 3.33503 0.0392 ixut does not granger cause ixis 117 2.11883 0.1250 ixis does not granger cause ixut 0.27989 0.7564 table 4: long‑term relationship among variables unrestricted cointegration rank test (trace) hypothesized eigenvalue trace 0.05 prob.** no. of ce (s) statistic critical value critical value none 0.153128 36.81853 47.85613 0.3561 at most 1 0.077052 17.53862 29.79707 0.6002 at most 2 0.060441 8.237412 15.49471 0.4404 at most 3 0.008630 1.005381 3.841465 0.3160 trace test indicates no cointegration at the 0.05 level *denotes rejection of the hypothesis at the 0.05 level **mackinnon‑haug‑michelis (1999) p-values unrestricted cointegration rank test (max-eigenvalue) hypothesized eigenvalue max-eigen 0.05* prob.** no. of ce (s) statistic critical value critical value none 0.153128 19.27991 27.58434 0.3932 at most 1 0.077052 9.301208 21.13162 0.8074 at most 2 0.060441 7.232031 14.26460 0.4621 at most 3 0.008630 1.005381 3.841465 0.3160 max‑eigenvalue test indicates no cointegration at the 0.05 level *denotes rejection of the hypothesis at the 0.05 level **mackinnon‑haug‑michelis (1999) p-values majeed, et al: bitcoin: stock market effects uhd journal of science and technology | jan 2023 | vol 7 | issue 2 29 4.1.2. telecommunications stock index price coefficient the coefficient for the variable telecommunications stock index price is 0.213805, but it is not statistically significant with a t-statistic of 0.370570 and p-value of 0.7117. therefore, the inclusion of the telecommunications stock index price in the model does not significantly impact bitcoin. 4.2. recommendations given the significant coefficient of bitcoin (-1), it is essential to consider the lagged value of btc as a predictor in the model for analyzing bitcoin prices. since the coefficient for gc is not statistically significant, further investigation may be required to determine if there is a causal relationship or impact of gold price on bitcoin prices. alternative models or additional variables could be explored to capture potential relationships. the significant coefficients of insurance companies’ perfor mance (-1) sug gest that these variables play a meaningful role in explaining bitcoin prices. it may be beneficial to investigate further the underlying factors and dynamics driving the relationship between insurance companies’ performance and bitcoin. considering the non-significant coefficient of the telecommunications stock index price, it may be advisable to reassess the inclusion of this variable in the model or explore alternative variables that could better capture the relevant information related to bitcoin prices. the high r2 value of 0.934858 indicates that the model explains a substantial portion of the variation in bitcoin prices. however, fur ther robustness checks, model diagnostics, and sensitivity analyses should be conducted to ensure the reliability and accuracy of the findings. these recommendations can guide further analysis, model refinement, and enhance the understanding of the relationships between the variables in the paper’s context. references [1] p. schueffel. “defi: decentralized finance-an introduction and overview”. journal of innovation management, vol. 9, no. 3, pp. 1-11, 2021. [2] j. mazanec. “portfolio optimalization on digital currency market”. journal of risk and financial management, vol. 14, no. 4, p. 160, 2021. [3] m. i. marobhe. “cryptocurrency as a safe haven for investment portfolios amid covid-19 panic cases of bitcoin, ethereum and litecoin”. china finance review international, vol. 12, no. 1, pp. 51-68, 2022. [4] p. schmidt and d. elferich. “blockchain technology and real estate-a cluster analysis of applications in global markets in the year 2021”. in shs web of conferences, vol. 129, p. 3027, 2021. [5] s. agyei-ampomah, d. gounopoulos and k. mazouz. “does gold offer a better protection against losses in sovereign debt bonds than other metals?”. journal of banking and finance, vol. 40, pp. 507-521, 2014. [6] a. h. dyhrberg. “bitcoin, gold and the dollar-a garch volatility analysis”. finance research letters, vol. 16, pp. 85-92, 2016. [7] h. l. majee. “analyzing and measuring the impact of exchange rate fluctuations on economic growth in iraq for the period (20042022)”. journal of kurdistani for strategic studies, vol. 2, no. 2, p. 181-193, 2023. [8] r. ali, j. barrdear, r. clews and j. southgate. “the economics of digital currencies”. bank of england quarterly bulletin, vol. 54, no. 3, pp. 276-286, 2014. [9] r. böhme, n. christin, b. edelman and t. moore. “bitcoin: economics, technology, and governance”. journal of economic perspectives, vol. 29, no. 2, pp. 213-238, 2015. [10] j. g. haubrich. “gold prices”. economic commentary, 1998. available from: http://www.clevelandfed.org/research/ commentary/1998/0301.pdf?wt.oss=goldprices&wt.oss_r=446 [last accessed on 2023 aug 12]. [11] r. i. dzerjinsky, e. n. pronina and m. r. dzerzhinskaya. “the structural analysis of the world gold prices dynamics”. in: artificial intelligence and bioinspired computational methods: proceedings of the 9th computer science on-line conference 2020. vol. 29, pp. 352-365, 2020. [12] m. m. veiga, g. angeloci, m. hitch and p. c. velasquez-lopez. “processing centres in artisanal gold mining”. journal of cleaner production, vol. 64, pp. 535-544, 2014. [13] j. g. haubrich. “gold prices”. economic commentary. federal reserve bank of cleveland, kentucky, 1998. [14] h. bulińska-stangrecka and a. bagieńska. “investigating the links of interpersonal trust in telecommunications companies”. sustainability, vol. 10, no. 7, p. 2555, 2018. [15] l. torres and p. bachiller. “efficiency of telecommunications companies in european countries”. journal of management and governance, vol. 17, pp. 863-886, 2013. [16] w. bashaija. “effect of financial risk on financial performance of insurance companies in rwanda”. journal of finance and accounting, vol. 10, no. 5, 2022. [17] a. demirgüç-kunt and h. huizinga. “bank activity and funding strategies: the impact on risk and returns”. journal of financial economics, vol. 98, no. 3, pp. 626-650, 2010. [18] d. hamadu and j. n. mojekwu. “the impact of insurance on nigerian economic growth”. international journal of academic research, vol. 6, no. 3, pp. 84-94, 2014. [19] e. bouri, p. molnár, g. azzi, d. roubaud and l. i. hagfors. “on the hedge and safe haven properties of bitcoin: is it really more than a diversifier?”. finance research letters, vol. 20, pp. 192-198, 2017. [20] f. ş. sökmen and s. gürsoy. “investigation of the relationship between bitcoin and gold prices with the maki cointegration test”. ekonomi i̇şletme ve maliye araştırmaları dergisi, vol. 3, no. 2, pp. 217-230, 2021. [21] r. kochhar, b. kochar, j. singh and v. juyal. “blockchain and its majeed, et al: bitcoin: stock market effects 30 uhd journal of science and technology | jan 2023 | vol 7 | issue 2 impact on telecom networks”. in: conference: the fourteenth international conference on wireless and mobile communications, venice, italy, 2018. [22] b. kajwang. “insurance opportunities and challenges in a crypto currency world”. international journal of technology and systems, vol. 7, no. 1, pp. 72-88, 2022. [23] d. bams, g. blanchard, i. honarvar, and t. lehnert, “does oil and gold price uncertainty matter for the stock market?,” j. empir. financ., vol. 44, pp. 270–285, 2017. [24] p. boonkrong, n. arjrith, and s. sangsawad, “multiple linear regression for technical outlook in telecom stock price,” in proceedings of rsu international research conference, 2020, pp. 1178–1185. [25] lee, h and wang, y. “spillover effects of gold prices on telecommunications companies: a comparative analysis”. journal of business and finance, vol. 42, no. 3, pp. 178-195, 2020. [26] smith, j., johnson, a and lee, b. 2020. majeed, et al: bitcoin: stock market effects uhd journal of science and technology | jan 2023 | vol 7 | issue 2 31 top of form date ixut ixis gc btc 4/9/2023 11,701.90 399.18 2,002.20 30,453.80 4/2/2023 11,563.70 396.79 2,011.90 27,941.20 3/26/2023 11,532.60 398.16 1,969.00 28,456.10 3/19/2023 11,134.70 383.93 1,983.80 27,475.60 3/12/2023 10,981.20 384.43 1,973.50 26,914.10 3/5/2023 11,585.00 375.79 1,867.20 20,467.50 2/26/2023 12,487.30 391.23 1,854.60 22,347.10 2/19/2023 12,344.60 390.39 1,817.10 23,166.10 2/12/2023 12,558.50 409.19 1,840.40 24,631.40 2/5/2023 12,376.60 393.52 1,862.80 21,859.80 1/29/2023 12,329.30 404.43 1,862.90 23,323.80 1/22/2023 12,128.50 401.53 1,929.40 23,027.90 1/15/2023 11,933.70 395.02 1,928.20 22,775.70 1/8/2023 12,191.70 401 1,921.70 20,958.20 1/1/2023 12,047.70 391.95 1,869.70 16,943.60 12/25/2022 11,641.90 371.45 1,826.20 16,537.40 12/18/2022 11,876.90 370.83 1,804.20 16,837.20 12/11/2022 11,636.20 369.1 1,800.20 16,777.10 12/4/2022 11,805.50 381.88 1,810.70 17,127.20 11/27/2022 12,199.60 397.47 1,809.60 16,884.50 11/20/2022 12,022.90 392.52 1,768.80 16,456.50 11/13/2022 11,728.20 384.4 1,754.40 16,699.20 11/6/2022 11,835.40 377.59 1,769.40 16,795.20 10/30/2022 11,419.70 363.5 1,676.60 21,301.60 10/23/2022 11,482.70 374.68 1,644.80 20,809.80 10/16/2022 10,689.60 347.71 1,656.30 19,204.80 10/9/2022 10,439.00 334.89 1,648.90 19,068.70 10/2/2022 10,269.90 338.48 1,709.30 19,415.00 9/25/2022 10,002.00 333.05 1,672.00 19,311.90 9/18/2022 9,986.00 342.17 1,650.00 18,925.20 9/11/2022 10,472.20 369.83 1,677.90 20,113.50 9/4/2022 10,725.40 387.47 1,723.60 21,650.40 8/28/2022 10,311.10 382.19 1,717.70 19,831.40 8/21/2022 10,581.70 392.86 1,740.60 20,033.90 8/14/2022 10,867.50 411.36 1,753.00 21,138.90 8/7/2022 10,960.60 414.75 1,805.20 24,442.50 7/31/2022 10,176.10 402.04 1,780.50 22,944.20 7/24/2022 10,036.70 394.73 1,771.50 23,634.20 7/17/2022 10,022.40 403.27 1,731.40 22,460.40 7/10/2022 9,885.60 396.26 1,707.50 21,209.90 7/3/2022 10,300.00 393.14 1,746.70 21,587.50 6/26/2022 10,399.00 393.23 1,805.90 19,243.20 6/19/2022 10,341.30 396.01 1,830.30 21,489.90 6/12/2022 9,773.50 379.79 1,840.60 18,986.50 6/5/2022 10,174.20 395.6 1,875.50 28,403.40 5/29/2022 10,596.30 411.53 1,850.20 29,864.30 5/22/2022 10,809.00 417.06 1,857.30 29,027.10 5/15/2022 10,225.30 393.43 1,844.70 29,434.60 5/8/2022 10,428.60 406.93 1,811.30 30,080.40 5/1/2022 10,612.70 402.14 1,886.20 35,468.00 4/24/2022 10,462.00 396.55 1,915.10 37,650.00 4/17/2022 11,162.00 431.11 1,934.30 39,418.00 4/10/2022 11,376.80 447.35 1,974.90 40,382.00 4/3/2022 11,450.40 454.5 1,945.60 42,767.00 3/27/2022 11,596.10 459.47 1,923.70 45,811.00 3/20/2022 11,506.10 450.78 1,956.90 44,548.00 3/13/2022 11,188.00 456.92 1,931.70 42,233.00 3/6/2022 10,650.00 441.21 1,987.60 38,814.30 2/27/2022 10,793.80 450.01 1,968.90 39,395.80 2/20/2022 11,204.80 460.89 1,889.20 39,115.50 top of form date ixut ixis gc btc 2/13/2022 11,175.10 459.47 1,899.80 40,090.30 2/6/2022 11,189.50 460.86 1,842.10 42,205.20 1/30/2022 11,382.10 464.39 1,807.80 41,412.10 1/23/2022 11,035.30 455.18 1,786.60 38,170.80 1/16/2022 10,922.50 450.46 1,833.50 35,075.20 1/9/2022 11,437.80 480.21 1,818.30 43,097.00 1/2/2022 11,463.00 478.36 1,799.30 41,672.00 12/26/2021 11,416.40 496.8 1,829.70 47,738.00 12/19/2021 11,298.30 496.22 1,811.70 50,406.40 12/12/2021 11,161.10 487.76 1,804.90 46,856.20 12/5/2021 11,333.30 477.08 1,784.80 49,314.50 11/28/2021 10,983.50 481.07 1,783.90 49,195.20 11/21/2021 11,225.70 478.85 1,786.90 54,765.90 11/14/2021 11,437.90 481.6 1,852.90 59,717.60 11/7/2021 11,567.50 500.87 1,869.70 64,398.60 10/31/2021 11,694.50 505.17 1,818.00 61,483.90 10/24/2021 11,398.20 490.14 1,784.90 61,840.10 10/17/2021 11,608.00 504.93 1,796.30 61,312.50 10/10/2021 11,413.50 501.41 1,768.30 60,861.10 10/3/2021 11,309.30 505.05 1,757.40 54,942.50 9/26/2021 10,930.20 519.52 1,758.40 47,666.90 9/19/2021 10,876.60 523.32 1,750.90 42,686.80 9/12/2021 10,816.20 527.15 1,750.50 48,306.70 9/5/2021 10,930.30 540.62 1,791.00 45,161.90 8/29/2021 11,060.80 558.6 1,832.60 49,918.40 8/22/2021 11,120.80 552.89 1,817.20 48,897.10 8/15/2021 11,002.40 549.06 1,781.80 48,875.80 8/8/2021 11,044.20 545.57 1,776.00 47,081.50 8/1/2021 10,971.50 543 1,761.10 44,614.20 7/25/2021 10,689.50 543.03 1,814.50 41,553.70 7/18/2021 10,812.90 542.15 1,802.90 33,824.80 7/11/2021 10,772.30 531.67 1,815.90 31,518.60 7/4/2021 10,780.90 538.55 1,811.50 33,510.60 6/27/2021 10,966.90 538.93 1,784.10 34,742.80 6/20/2021 11,052.80 531.2 1,777.80 32,243.40 6/13/2021 10,610.00 518.24 1,769.00 35,513.40 6/6/2021 11,259.10 526.03 1,879.60 35,467.50 5/30/2021 11,306.90 522.61 1,892.00 35,520.00 5/23/2021 11,350.30 521.35 1,905.30 34,584.60 5/16/2021 11,273.50 509.01 1,877.60 37,448.30 5/9/2021 11,330.20 523.45 1,839.10 46,708.80 5/2/2021 11,478.60 521.27 1,832.40 58,840.10 4/25/2021 11,209.80 504.51 1,768.60 57,807.10 4/18/2021 10,944.50 501.46 1,777.80 50,088.90 4/11/2021 10,993.90 504.41 1,780.20 60,041.90 4/4/2021 10,916.40 492.39 1,744.80 59,748.40 3/28/2021 10,807.90 489.68 1,728.40 57,059.90 3/21/2021 10,766.80 489.91 1,733.60 55,862.90 3/14/2021 10,730.00 486.46 1,742.90 58,093.40 3/7/2021 10,857.50 493.06 1,721.20 61,195.30 2/28/2021 10,520.10 475.8 1,700.30 48,855.60 2/21/2021 10,357.00 465.96 1,730.10 46,136.70 2/14/2021 10,564.20 473.52 1,777.40 55,923.70 2/7/2021 10,559.50 484.23 1,823.20 47,168.70 1/31/2021 10,301.30 478.83 1,813.00 39,256.60 1/24/2021 9,642.30 460.24 1,850.30 34,283.10 1/17/2021 10,102.90 470.14 1,857.90 32,088.90 1/10/2021 10,286.70 465.96 1,831.70 36,019.50 1/3/2021 10,273.70 475.87 1,837.30 40,151.90 tx_1~abs:at/tx_2:abs~at uhd journal of science and technology | jan 2023 | vol 7 | issue 1 53 1. introduction a smart network called the internet of things (iot) employs established protocols to link things to the internet [1]. in an iot network, smart tiny sensors join objects wirelessly. iot devices can interact with one another without human involvement [2]. it uses distinctive addressing techniques to communicate, add more items and collaborate with them to develop new applications and services. examples of iot applications include smart environments, smart homes, and smart cities [3]. thereby of the development of iot applications, several obstacles have developed. one of these obstacles is iot security that cannot be disregarded. iot networks are subject to a range of malicious attacks because iot devices can be accessed from anywhere over an unprotected network such as the internet. the following security requirements should be considered when securing iot environment: • confidentiality: iot systems must ensure that unauthorized parties are prohibited from disclosing information [4]. • integrity: ensures that the messages must not have been modified in any manner [4]. • availability: when data or resources are needed, they must be available [4]. attackers can saturate a resource’s bandwidth to degrade its availability. a review on iot intrusion detection systems using supervised machine learning: techniques, datasets, and algorithms azeez rahman abdulla, noor ghazi m. jameel technical college of informatics, sulaimani polytechnic university, sulaimani 46001, kurdistan region, iraq a b s t r a c t physical objects that may communicate with one another are referred to “things” throughout the internet of things (iot) concept. it introduces a variety of services and activities that are both available, trustworthy and essential for human life. the iot necessitates multifaceted security measures that prioritize communication protected by confidentiality, integrity and authentication services; data inside sensor nodes are encrypted and the network is secured against interruptions and attacks. as a result, the issue of communication security in an iot network needs to be solved. even though the iot network is protected by encryption and authentication, cyber-attacks are still possible. consequently, it’s crucial to have an intrusion detection system (ids) technology. in this paper, common and potential security threats to the iot environment are explored. then, based on evaluating and contrasting recent studies in the field of iot intrusion detection, a review regarding the iot idss is offered with regard to the methodologies, datasets and machine learning (ml) algorithms. in this study, the strengths and limitations of recent iot intrusion detection techniques are determined, recent datasets collected from real or simulated iot environment are explored, high-performing ml methods are discovered, and the gap in recent studies is identified. index terms: internet of thing, intrusion detection, intrusion detection system techniques, intrusion detection system datasets, supervised machine learning r e v i e w a r t i c l e uhd journal of science and technology corresponding author’s e-mail: azeez rahman abdulla, technical college of informatics, sulaimani polytechnic university, sulaimani 46001, kurdistan region, iraq. azeez.rahman.a@spu.edu.iq received: 18-10-2022 accepted: 23-12-2022 published: 01-03-2023 access this article online doi: 10.21928/uhdjst.v7n1y2023.pp53-65 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2023 abdulla and jameel. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) abdulla and jameel: a review on iot intrusion detection systems 54 uhd journal of science and technology | jan 2023 | vol 7 | issue 1 • authenticity: the word “authenticity” relates to the ability to prove one’s identity. the system should be able to recognize the identity of the entity with whom it is communicating [5]. • non-repudiation: this guarantees that nothing can be rejected. in an iot context, a node cannot reject a message or piece of data that has already been sent to another node or a user [6]. • data freshness: ensures that no outdated messages are retransmitted by an attacker [7]. in the last few years, advancement in artificial intelligent (ai) such as machine learning (ml) techniques has been used to improve iot intrusion detection system (ids). numerous studies as [8,9], reviewed and compared different applied ml algorithms and techniques through various datasets to validate the development of iot idss. however, it’s still not clear a recent dataset collected from iot environment, and which ml model was more effective for building an efficient iot ids. therefore, the current requirement is to do an upto-date review to identify these critical points. in this study, a survey of the iot idss is given. this paper aims to further the knowledge in regard to iot cyber attacks’ characteristics (motivation and capabilities). then, strengths and limitations of different categories of idss techniques (hybrid, anomaly-based, signature-based, and specificationbased) are compared. moreover, the study presents a review on the recent researches in the area of iot intrusion detection using ml algorithms for iot network based on the datasets, algorithms and evaluation metrics to identify the recent iot dataset and the outperformed ml algorithm in terms of accuracy used for iot intrusion detection. the paper is structured as follows: in section 2, common cyber-attacks in iot the environment are clarified. in section 3 the strengths and limitations of iot intrusion detection techniques are discussed. section 4 discussed, analyzed and compared recent iot intrusion detection researches’ performance metrics, datasets and supervised ml algorithms. finally, section 5 illustrates the conclusions of the paper. 2. iot cyber attacks recently, iot has developed quickly, making it the fastestg rowing enor mous impact of technolog y on social interactions and workplace environments, including education, healthcare and commerce. this technology is used for storing the private data of people and businesses, for financial data transactions, for product development and for marketing. due to the widespread adoption of linked devices in the iot, there is a huge global demand for strong security. millions or perhaps billions of connected devices and services are now available [10-13]. every day, there are more risks and assaults have gotten more frequent and sophisticated. in addition, sophisticated technologies are becoming more readily available to potential attackers [14,15]. to realize its full potential, iot must be secured against threats and weaknesses [16]. by maintaining the confidentiality and integrity of information about the object and making that information easily accessible whenever it is needed, security is the act of avoiding physical injury, unauthorized access, theft, or loss to the item [17]. to ensure iot security, it is crucial to maintain the greatest inherent value of both tangible items (devices) and intangible ones (services, information and data). system risks and vulnerabilities must be identified in order to provide a comprehensive set of security criteria to assess if the security solution is secure against malicious assaults or not [18]. attacks are performed to damage a system or obstruct regular operations by utilizing various strategies and tools to exploit vulnerabilities. attackers launch attacks to achieve goals, either for their personal satisfaction or to exact revenge [19]. common iot cyber-attack types are: • physical attacks: these assaults tamper with hardware elements. most iot devices often operate in outdoor areas which are extremely vulnerable to the physical assaults [20]. • attacks known as reconnaissance include the illegal identification of systems, services, or vulnerabilities. the scanning of network ports is an example of a reconnaissance attack [21]. • denial-of-service (dos): this type of attack aims to prevent the targeted users from accessing a computer or network resource. the majority of iot devices are susceptible to resource enervation attacks due to their limited capacity for memory and compute resources [22]. • access attacks happen when unauthorized users get access to networks or devices that they are not allowed to use. two types of access assaults exist: the first is physical access, in which a hacker gains access to a real object. the second is using ip-connected devices for remote access [22]. • attacks on privacy: iot privacy protection has grown more difficult as a result of the volume of information that is readily accessible via remote access techniques [14]. • cyber-crimes: users and data are used for hedonistic activities including fraud, brand theft, identity theft, and abdulla and jameel: a review on iot intrusion detection systems uhd journal of science and technology | jan 2023 | vol 7 | issue 1 55 theft of intellectual property using internet and smart products [14,15,23]. • destr uctive attacks: space is exploited to cause widespread disturbance and property and human life loss. terrorism and retaliation are two examples of damaging assaults. • supervisory control and data acquisition (scada) attacks: scada systems are connected to industrial iot networks; they are active devices in real-time industrial networks, which allow the remote monitoring and control of processes, even when the devices are located in remote areas. the most specific and common types of scada attacks are eavesdropping, man-in-the middle, masquerading, and malware [24]. 3. iot intrusion detection system despite the investment and potential it holds, there are still issues that prevent iot from becoming a widely utilized technology. the security challenges with iot are thought to be solvable via intrusion detection, which has been established for more than 30 years. intrusion detection is often a system (referred to as ids) which consists of tools or methods that analyze system activity to find assaults or unauthorized access. an ids typically comprises of sensors, and a tool to evaluate the data from these sensors. efficient and accurate intrusion detection solutions are necessary in the iot environment to identify various security risks [25]. 3.1. iot intrusion detection types ids types can be categorized in a variety of ways, particularly ids for iot as the majority of them are still being studied. according to das et al., [26] the research distinguishes three types of ids: • host-based ids (hids): to keep an eye on the system’s harmful or malicious activity, hids is connected to the server. specifically, hids examines changes in fileto-file communication, network traffic, system calls, running processes, and application logs. this sort of ids’s drawback is that it can only identify attacks on the systems it supports. • network-based ids (nids): nids analyzes network traffic for attack activities and identifies harmful behavior on network lines. • distributed ids (dids): dids will have a large number of linked and dispersed idss for attack detection, incident monitoring and anomaly detection. to monitor and respond to outside actions, dids needs a central ser ver with strong computing and orchestration capabilities. 3.2. iot intrusion detection techniques there are four basic types or methodologies for deploying iot intrusion detection. • anomaly based ids in iot. it uses anomaly based ids to find intrusions and monitor abusive behavior. it employs a threshold to determine if this behavior is typical or abnormal. these idss have the ability to monitor a typical iot network’s activity and set a threshold. to detect abnormalities, the network’s activity is compared to a threshold and any deviation from this number is considered abnormal [27]. table 1 compares and contrasts the strength and limitations of several anomaly-based idss methodologies based on resource and energy usage, detection accuracy and speed. • signature based ids in iot signature based detections compare the network’s current activity to pre-defined attack patterns. each signature is connected to a particular assault since signatures are originally established and stored on the iot device. signature based approaches are commonly used and require a signature for each assault [27]. the strengths and limitations of different signature based idss techniques have been presented and compared in table 2 based on resource consumption, energy, detection accuracy, and speed. • specification based ids in iot specification-based approaches detect intrusions when network behavior deviates from specification definitions. therefore, specification-based detection has the same purpose of anomaly-based detection. however, there is one important difference between these methods: in specification-based approaches, a human expert should manually define the rules of each specification [36]. the main aspects of specification-based idss have been outlined and then compared in table 3 based on resource consumption, energy, detection accuracy, and speed. • hybrid ids in iot signature based ids has a large usable capacity and limited number of attack detections while anomaly based ids has a high false positive rate and significant computation costs. a hybrid technique was suggested to solve the flaws of both systems [42]. the main characteristics of hybrid idss have been defined and then compared in table 4 based on resource consumption, energy, detection accuracy, and speed. abdulla and jameel: a review on iot intrusion detection systems 56 uhd journal of science and technology | jan 2023 | vol 7 | issue 1 4. supervised ml based iot intrusion detection ml enables computer systems to predict events more correctly without being explicitly taught to do so. it is a subset of artificial intelligence (ai). ml algorithms use historical data as input to anticipate new output values. ml algorithms are mainly divided into three categories: reinforcement learning, unsupervised learning, and supervised learning. in this paper, recent researches using supervised ml algorithms table 1: comparison of different anomaly based ids techniques reference no. technique strength limitations [28] utilizing a fusion based technique to decrease the damage caused by strikes. ● low communication overhead ● high energy consumption [29] detecting wormhole attacks using node position and neighbor information. ● low resource consumption ● real time ● energy efficient ● only one type of attack can be detected [30] detecting sinkhole attacks by analyzing the behavior of devices ● detection accuracy is high ● detect limited number of attacks [31] a lightweight technique for identifying normal and deviant behavior ● lightweight implementation ● detection accuracy is high ● high computational overhead [32] a request-response method’s correlation functions are used to look for unusual network server activity ● consuming modest resources ● lightweight detection system ● high computational overhead ids: intrusion detection system table 2: comparison of different signature based ids techniques reference no. technique strength limitations [33] detecting network attacks by signature code in ip based ubiquitous sensor networks ● high detection accuracy ● low energy and resource consumption ● can detect limited number of intrusions [34] the pattern-matching engine is used to detect malicious nodes using auxiliary shifting and early decision techniques ● low memory and computational complexity ● maximum speed up ● not real‑time ● can detect limited number of intrusions [35] detection of malware signature detection using reversible sketch structure based on cloud. ● fast ● low communication consumption ● high detection accuracy ● high memory requirement ● has a limited ability to identify assaults ids: intrusion detection system table 3: comparison of different specification based ids techniques reference no. technique strength limitations [37] mitigation of black hole attacks using an effective strategy in routing protocol for low‑power and lossy (rpl) networks ● low delay ● high detection accuracy of the infected node ● only black hole attacks can be detected [38] detecting internal attacks by designing a secure routing protocol based on reputation mechanism ● detection accuracy is acceptable ● low delay ● needs skilled administration [39] topology assaults detection on rpl using semi‑automated profiling tool. ● detection accuracy is high ● low energy consumption ● low computation overhead ● high overhead [40] sinkhole attacks are detected using a constraint based specification intrusion detection approach. ● low overhead ● minimal energy usage ● not real‑time [41] using a game-theoretic method to identify deceptive attacks in iot network with honeypots. ● high detection accuracy ● real‑time ● needs additional resources. ● high converge time ids: intrusion detection system abdulla and jameel: a review on iot intrusion detection systems uhd journal of science and technology | jan 2023 | vol 7 | issue 1 57 in the area of iot intrusion detection were studied, analyzed and compared. supervised learning emphasis on discovering patterns while utilizing labeled datasets. in supervised learning, the machine must be fed sample data with different characteristics (expressed as “x”) and the right value output of the data (represented as “y”). the dataset is considered “labeled” because the output and feature values are known. then, the algorithm analyzes data patterns to develop a model that can replicate the same fundamental principles with new data [46]. 4.1. datasets used for iot intrusion detection models for supervised ml are trained and evaluated using datasets. any ids’s performance ultimately depends on the dataset’s quality including whether it can reliably identify assaults or not [47]. here, six datasets named nsl-kdd, unswnb15, cicids 2017, bot-iot, ds2os, and iotid20 are considered and used by researchers to train and test iot intrusion detection models. descriptions of the datasets are given below and their characteristics are summarized in table 5. • nsl-kdd the nsl-kdd dataset is an improved version of the kdd99. it does not include redundant records in the train set, so the classifiers will not be biased towards more frequent records. the number of selected records from each difficulty level group is inversely proportional to the percentage of records in the original kdd data set [47]. the nsl-kdd dataset has 41 characteristics, classified into three categories: basic characteristics, content characteristics, and traffic characteristics. table 4: comparison of different hybrid ids techniques reference no. technique strength limitations [42] employing a game theoretic approach to identify attackers by using anomaly detection only when a new attack pattern is anticipated and using signature based detection otherwise. • detection accuracy is high • low energy consumption • high resource consumption • delay [43] the denial of service prevention manager is proposed, which uses aberrant activity detection and matching with attack signatures. • real time • high resource consumption [44] real-time attack detection using knowledgeable, self‑adapting expert intrusion detection system. • high detection accuracy • real time • low resource consumption • high computational overhead [45] attackers can be found by looking for timing irregularities while broadcasting the most recent rank to nearby nodes and using a timestamp. • real time • low overhead • low delay • high detection accuracy • high computation overhead • high resource consumption [27] targeting the routing attacks with an ids with integrated mini‑firewall which uses anomaly-based ids in the intrusion detection and signature‑based ids in the mini‑firewall • real time • high availability • low overhead • limited in dynamic network topology • high‑resource consumption • low detection accuracy ids: intrusion detection system table 5: dataset characteristics dataset year dataset link (url) no. of instances no. of features dataset collection performed on iot environment type of dataset nslkdd 2009 https://www.unb.ca/cic/datasets/nsl. html 148,519 41 no imbalanced unsw-nb15 2015 https://research.unsw.edu.au/ projects/unsw‑nb15‑dataset 2,540,044 49 no imbalanced cicids2017 2017 https://www.unb.ca/cic/datasets/ ids-2017.html 2,830,743 83 no imbalanced botiot 2019 https://ieee-dataport.org/documents/ bot-iot-dataset 73,370,443 29 yes imbalanced ds2os 2018 https://www.kaggle.com/datasets/ francoisxa/ds2ostraffictraces 409,972 13 yes imbalanced iotid20 2020 https://sites.google.com/view/ iot‑network‑intrusion‑dataset/home 625,783 83 yes imbalanced https://www.unb.ca/cic/datasets/nsl.html https://www.unb.ca/cic/datasets/nsl.html https://research.unsw.edu.au/projects/unsw-nb15-dataset https://research.unsw.edu.au/projects/unsw-nb15-dataset https://www.unb.ca/cic/datasets/ids-2017.html https://www.unb.ca/cic/datasets/ids-2017.html https://ieee-dataport.org/documents/bot-iot-dataset https://ieee-dataport.org/documents/bot-iot-dataset https://sites.google.com/view/iot-network-intrusion-dataset/home https://sites.google.com/view/iot-network-intrusion-dataset/home abdulla and jameel: a review on iot intrusion detection systems 58 uhd journal of science and technology | jan 2023 | vol 7 | issue 1 • unsw-nb15 the unsw-nb15 dataset was published in 2015. it was created by establishing the synthetic environment at the unsw cyber security lab. unsw-nb15 represents nine major families of attacks by utilizing the ixia perfect storm tool. ixia tool has provided the capability to generate a modern representative of the real modern normal and the abnormal network traffic in the synthetic environment. there are 49 features and nine types of attack categories known as the analysis, fuzzers, backdoors, dos, exploits, reconnaissance, generic, shellcode, and worms [48]. • cicids 2017 the cicids 2017 dataset generated in 2017. it includes benign and seven common family of attacks that met real worlds criteria such as dos, ddos, brute force, xss, sql injection, infiltration, port scan, and botnet. the dataset is completely labeled with 83 network traffic features extracted and calculated for all benign and attack network flows [49]. • bot-iot the bot-iot dataset was created by designing a testbed network environment in the research cyber range lab of unsw canberra. this dataset consists of legitimate and simulated iot network traffic along with various types of attacks such as information gathering (probing attacks), denial of service and information theft. it has been labeled with the label features indicating an attack flow, the attacks category and subcategory for possible multiclass classification purposes [50]. • ds2os this dataset includes traces that were recorded using the iot platform ds2os. labeled and unlabeled datasets come in two varieties. the only characteristics in an unlabeled dataset that can be used describe the data objects for unsupervised ml models. in addition, a labeled dataset includes information about each data instance’s class and utilized for supervised ml models [51]. • iotid20 iotid20 dataset is used for anomalous activity detection in iot networks. the testbed for the iotid20 dataset is a combination of iot devices and interconnecting structures. the dataset consists of various types of iot attacks and a large number of flow-based features. the flow-based features can be used to analyze and evaluate a flow-based ids. the final version of the iotid20 dataset consists of 83 network features and three label features [52]. 4.2. supervised ml algorithms used for iot intrusion detection for iot intrusion detection, many supervised ml methods are employed. the list of used algorithms with corresponding descriptions is presented below: • logistic regression (lr): it is a probability-based method for predictive analysis. it is a more effective strategy for binary and linear classification issues because it employs the sigmoid function to translate expected values to probabilities between 0 and 1. it is a classification model that is relatively simple to implement and performs extremely well with linearly separable data classes [53]. • naïve base (nb): are a group of bayes’ theorem-based categorization methods. it is a family of algorithms rather than a single method and they all operate under the same guiding principle in which each pair of characteristics is categorized standalone [53]. • artificial neural networks (ann): the biological neural network in the human brain served as the model for the widely used ml technology known as (ann). each artificial neuron’s weight values are sent to the following layer as an output. feed-forward neural network form of ann that processes inputs from neurons in the previous layer. multilayer perception is a significant type of feed forward neural networks (mlp). the most well-known mlp training method that modifies the weights between neurons to reduce error is called the back propagation algorithm. the system can display sluggish convergence and run the danger of a local optimum, but it can rapidly adapt to new data values [54]. • support vector machine (svm): this algorithm looks for a hyperplane to optimize the distance involving two classes. a learning foundation for upcoming data processing is provided by the categorization. the groups are divided into several configurations by the algorithm through hyperplanes (lines). a learning model that splits up new examples into several categories is produced by svm. based on these functions, svms are referred to as non-probabilistic, or binary linear classifiers. in situations that use probabilistic classification, svms can use methods such as platt scaling [53]. • decision tree (dt) is a tree in which each internal node represents an assessment of an attribute. each branch represents the result of an assessment and each leaf node denotes the classification outcome. algorithms such as id3, cart, c4.5, and c5.0 are frequently used to generate decision trees. by analyzing the samples, a decision tree is obtained and used to correctly classify new data [55]. abdulla and jameel: a review on iot intrusion detection systems uhd journal of science and technology | jan 2023 | vol 7 | issue 1 59 • random forest (rf) is a technique used to create a forest of decision trees. this algorithm is frequently used due to its fast operation. countless decision trees can be used to create a random forest. by averaging the outcomes of each component tree’s forecast, this method generates predictions. random forests exhibit compelling accuracy results and are less likely to overfit the data than a traditional decision tree technique. this method works well while examining plenty of data [53]. • ensemble learning (which includes bag ging and boosting). the boosting method is a well-known ensemble learning method for improving the performance and accuracy of ml systems. the fundamental idea behind the boosting strategy is the successive addition of models to the ensemble. weak learners (base learners) are efficiently elevated to strong learners. as a consequence, it aids in reducing variation and bias and raising prediction accuracy. boosting is an iterative method that alters the findings of an observation’s weight depending on the most recent categorization. adaboost (ab), gradient boosting machines (gbm), and extreme gradient boosting (xgboost) are examples of boosting techniques. bag ging (also known as bootstrap aggregating). it is one of the earliest and most basic ensemble ml approaches and it works well for issues requiring little in the way of training data. in this approach, a collection of original models with replacement are trained using random subsets of data acquired using the bootstrap sampling method. the individual output models derived from bootstrap samples are combined by majority voting [56]. 4.3. evaluation metrics the efficiency of ml algorithms can be measured using metrics such as accuracy, precision, recall, and f1-score [57]. performance metrics are calculated using different parameters called true positive (tp), false positive (fp), true negative (tn), and false negative (fn). for ids s, these parameters are described as follow: tp = the number of cases correctly identified as attack. fp = the number of cases incorrectly identified as attack. tn = the number of cases correctly identified as normal. fn = the number of cases incorrectly identified as normal. • precision (also called positive predictive value) is the percentage of retrieved occurrences that are relevant. model performance is considered better if the precision is higher [58]. precision is computed using (1) [59]. p r e c i s i o n � = t r u e � p o s t i v e t r u e p o s t i v e +f a l s e � p o s t i v e (1) • recall (also known as sensitivity) is the percentage of occurrences that were found to be relevant. it also goes by the name true positive rate (tpr) and calculated using (2) [58]. r e c a l l � =� t r u e � p o s t i v e t r u e p o s t i v e +f a l s e � n e g a t i v e (2) • accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted observation to the total observations. model accuracy is calculated using (3) [57]. a c c u r a c y � = t r u e � p o s t i v e +t r u e � n e g a t i v e t o t a l (3) • f1-score is the harmonic mean of recall and accuracy [60] which defines a the weighted average of recall and precision and calculated using (4) [57]. f 1 s c o r e � =� 2 � * p r e c i s i o n � *� r e c a l l p r e c i s i o n � +� r e c a l l (4) • roc curve is a receiver operating characteristic curve which shows the performance of a classifier at various thresholds level [57]. • area under curve (auc): is closely associated with the concept of roc. it represents the area under the roc curve. it has been extensively used as a performance measure for classification models in ml. its values range from 0 to 1. the higher the value, the better the model is [61]. 4.4. analysis and comparison of supervised ml algorithms for iot intrusion detection in this section, the analysis of the used ml algorithms has been presented and discussed. researchers used many supervised ml algorithms specifically in classification and they performed well in some cases with very high accuracy. to review researches in the area of intrusion detection using ml in the iot environment, various recent studies are examined and compared based on the ml algorithms (classifier), datasets, type of classification, and performance of the classifier. the performance of these algorithms depends on various metrics. in this study, the comparison among the algorithms is focused on accuracy metric. detailed abdulla and jameel: a review on iot intrusion detection systems 60 uhd journal of science and technology | jan 2023 | vol 7 | issue 1 review of 21 papers (published between 2019 and 2022) was analyzed in this section and compared in table 6. mahmudul et al. [62] employed the ds2os dataset with several ml algorithms (lr, svm, dt, rf, ann). accuracy, table 6: comparison of the selected supervised ml based iot ids reference no. year ml algorithm (classifier) dataset classification type classifier accuracy [62] 2019 lr, svm, dt, rf, ann ds2os multiclass lr=0.983, svm=0.982, dt=0.994, rf=0.994, ann=0.994. [63] 2019 rf unsw-nb15 binary rf=99.34 [64] 2019 lr, nb, dt, rf, knn, svm kdd99, nsl-kdd, unsw-nb15 binary accuracy of the algorithms depend on the used dataset [65] 2019 for the level‑1 model, dt for level 2 model, rf cicids2017, unsw-15 2 level classification (binary then multiclass) both datasets’ specificity was 100% for the model, while its precision, recall, and f score were all 100% for the cicids2017 dataset and 97% for the unsw‑nb15 dataset [66] 2019 rf, ab, gbm, xgb, dt (cart), mlp, extremely randomized trees (etc) cidds-001, unsw-nb15, nsl-kdd binary average accuracy value for 4 datasets using holdout are: rf=94.94, gbm=92.98, xgb=93.15%, ab=90.37, cart=91.98, mlp=82.76, etc=82.99 [67] 2019 dt, nn, svm unsw-nb15 multiclass dt=89.76, nn=86.7و svm=78.77, proposed model: 88.92 [68] 2019 nb, qda, rf, id3, ab, mlp, knn bot-iot binary. nb=0.78, qda=0.88, rf=0.98, id3=0.99, adaboost=1.0, mlp=0.84, knn=0.99 [69] 2019 svm, lr, d t, knn, rf unsw-nb15, their own dataset binary the accuracy depends on the dataset and the algorithm [58] 2020 rf, xgb, dt, mlp, gb, et, lr unsw-nb15 binary results with all features: rf=0.9516, xgb=0.9481, dt=0.9387, mlp=0.9371, gb=0.9331, et=0.9501, lr=0.8984 [53] 2020 knn, svm, dt, nb, rf, ann, lr bot-iot binary, multiclass on binary classification: knn=0.99, svm=0.99, dt=1.0, nb=0.99, rf=1.0 ann=0.99, lr=0.99 [70] 2020 svm, nb, dt, adaboost their own synthetic called (sensor480) binary svm=0.9895, nb=0.9789, dt=1.0000, adaboost=0.9895 [71] 2020 rf iotid20 dataset binary based on the attack type the accuracy result depends on the attack type [72] 2021 svm nsl-kdd, unsw-nb15. binary, multiclass the accuracy depends on the dataset, the type of classification and number of features [55] 2021 rf, svm, ann unsw-nb15. binary, multiclass all features: rf with binary=98.67, multi‑class=97.37, svm in binary=97.69, multiclass=95.67, ann in binary=94.78, multiclass=91.67 [73] 2021 lr, svm, dt, ann iotid20, bot-iot multiclass the results are based on the dataset and the categories of attacks [74] 2021 slfn iotid20 binary the proposed model=0.9351 [75] 2021 svm, gbdt, rf nsl kdd binary svm=32.38, gbdt=78.01, rf=85.34 [76] 2021 b-stacking cicids2017, nsl-kdd multiclass accuracy for cicids2017 is 99.11% accuracy for nsl-kdd approximately is 98.5% [77] 2022 dt, rf, gbm iot2020 binary dt=0.978305, rf=0.978443, gbm=0.9636 [78] 2022 shallow neural networks (snn), bagging trees (bt), dt, svm, knn iotid20 binary, multiclass for binary classification all models achieved 100% for multiclass: snn=100%, dt=99.9%, bt=99.9%, svm=99,8%, knn=99.4% [79] 2022 ann, dt (c4.5), bagging, knn, ensemble iotid20, nsl-kdd binary, multiclass accuracy depends on feature selection approaches, datasets, and attack type for multiclass classification abdulla and jameel: a review on iot intrusion detection systems uhd journal of science and technology | jan 2023 | vol 7 | issue 1 61 precision, recall, f1 score, and area under the receiver operating characteristic curve are the assessment measures used to compare performance. the measurements show that rf performs comparably higher performance, and the system acquired excellent accuracy (ibrahim et al. [63]). an intelligent anomaly detection system called anomaly detection iot (ad-iot) which used the unsw-nb15 dataset and rf to identify binary labeled categorization had been proposed. the results demonstrated that the ad-iot could successfully produce the best classification accuracy while minimizing the false positive rate. samir et al. in [64] used the datasets kdd99, nsl-kdd, and unsw-nb15 to assess number of ml models. the kkn and lr algorithms produced the best results on the unsw-nb15 dataset while the nb algorithm produced the worst results. on the nsl-kdd dataset, the dt classifier outperformed the others in terms of various metrics while on the kdd99 dataset, svm and mlp produced a low false positive rate in comparison to other algorithms. the findings of this study showed that the dt and knn algorithms outperformed the other algorithms. however, the knn required more time to categorize data than the dt. imtiaz and qusay [65] conducted a two-level framework experiment for iot intrusion detection. to determine the category of the anomaly, they chose a dt classifier for the level-1 model which categorized the network flow as normal or anomalous and forwarded the network flow to the level-2 model. rf was used as a level-2 model for multiclass categorization. abhishek and virender [66] employed both ensemble and single classifiers, two different types of classification techniques. the selection of the aforementioned classification algorithms was primarily influenced by the huge number of input characteristics that are vulnerable to overfitting. as a result, random search was used to determine the best input parameters for rf, ab, xgb, and gbm. in terms of precision, rf beats other classifiers. however, ab performs the worst of all the classifiers. using friedman test statistics and 10-fold validation, the results showed that the classifiers’ performances are considerably varied. following that, the average time required by several classifiers to categorize a single case, cart classifies instances of cidds-001, unsw-nb15, kddtrain+, and kddtest+ faster than other classifiers. vikash et al. [67] proposed (uids) an ids using unswnb15 dataset. network traffic accuracy and assault detection rate were improved by the suggested approach. in addition, it examined data using several ml techniques (c5, neural network, svm, and uids model) and came to the conclusion that uids compared favorably to other ml techniques. analysis showed that the false alarm rate (far) of the unsw-nb15 dataset was reduced with only 13 characteristics. jadel and khalid [68] tested seven ml algorithms. all the algorithms, except the naive bayes (nb) and quadratic algorithm (qda), achieved highest success in detecting almost all attack types. it can be seen that adaboost was the best performance algorithm, followed by knn and id3. id3 is noticeably faster than knn. the accuracy of the algorithms depends on the entire dataset with the seven best features obtained in the feature selection step. aritro et al. [69] analyzed the role of a set of chosen ml techniques for iot intrusion detection based on dataset/flows two layers: application layer (host based) and network layer (network based). for the application layer dataset, they created their own dataset from the iot environment while for network layer they used unsw-nb15 dataset. according to the results for both datasets, rf was the best algorithm in terms of accuracy and lr was the fastest in ter ms of speed. mohammad [58] used different algorithms. the classifiers random forest (rf) and extra trees (et) performed better than the others, and rf is the best of the two. only 14 features were chosen by rf utilizing features selection, but the performance results were remarkably similar to those achieved with all features. in addition, compared to the others, the lr classifier had the lowest accuracy. andrew et al. [53] employed different methods; nevertheless, the findings show that rf performed better with the non-weighted dataset regarding precision and accuracy in non-weighted dataset. however, ann performed more accurately in binary classification using weighted dataset. knn and ann performed extremely well in multi-classification for weighted and non-weighted datasets, respectively. the findings made it clear that ann accurately predicted the kind of attack. k. v. v. n. l et al. [70] tested four ml techniques on iot traffic in order to distinguish between genuine and attack traffic. using decision trees, all of the analyzed data may be precisely categorized into the correct classes. decision trees also had the greatest accuracy compared to the other classifiers. pascal et al. [71] suggested a new anomaly-based detection using hybrid feature selection for iot networks using iotid20 dataset. the relevant features were fed to the rf algorithm. based on the attack category, the network traffic is classified as normal and attack category as dos, scan, or mitm. nsikak et al. [72] tested svm with dataset nsl-kdd and unsw-nb15 datasets. the results using different numbers of features for both datasets were varied. the classification accuracy using binary classification was greater than multi-class according to the evaluation results. muhammad et al. [55], the unsw-nb15 dataset had been subjected to supervise ml including rf, svm, and ann. abdulla and jameel: a review on iot intrusion detection systems 62 uhd journal of science and technology | jan 2023 | vol 7 | issue 1 the application of rf using mean imputation produced the greatest accuracy in binary classification. overall, there were not many differences in accuracy across the different imputation strategies. by using rf on a regression-imputed dataset, the greatest accuracy in multi-class classification was also attained. in addition, as compared to other cutting-edge supervised ml-based techniques, rf achieved greater accuracy with less training time for clustered based classification. khalid et al. [73] for classification objectives, the performance of four ml methods was assessed. the bot-iot dataset and the iotid20 dataset were both utilized in the study, 5% of bot-iot dataset was selected with a full set of features, while the second dataset was fully selected in the experiment. the accuracy results were based on the dataset and the categories of attacks. raneem et al. [74] developed an intrusion detection method using a single layer forward neural network (slfn) classifier with iotid20. the results showed that the slfn classification approach outperformed other classification algorithms. maryam et al. [75] proposed that three ml algorithms rf, gdbt, and svm were applied to the nsl-kdd dataset using binary classification. the results showed that the rf obtained the highest accuracy on the fog layer while svm obtained lowest accuracy. souradipst et al. [76] proposed b-stacking approach as an intrusion detection model to detect cyber-attacks and anomalies in iot networks. b-stacking is based on a combination of two ensemble algorithms; boosting and stacking. it chose knn, rf, and xgboost as the level-0 weak learners. xgboost is also used as the level-1 learner. the experimental results on two popular datasets showed that the model had a high detection rate and a low false alarm rate. most importantly, the proposed model is lightweight and can be deployed on iot nodes with limited power and storage capabilities. jingyi et al. [77] used dt, rf, and gbm ml algorithms with a dataset generated from the iotid20 dataset known as iot2020 dataset. according to the results, the dt algorithm performed more accurately than the other algorithms, but rf had better auc score. abdulaziz et al. [78] proposed an anomaly intrusion detection in an iot system. five supervised ml models were implemented to characterize their performance in detecting and classifying network activities with feature engineering and data preprocessing framework. based on experimental evaluation, the accuracy 100% recorded for the detection phase that distinguishes the normal and anomaly network activities. while for classifying network traffic into five attack categories, the implemented models of achieved 99.4-99.9%. khalid et al. [79] proposed and implemented an iot anomaly-based ids based on novel feature selection and extraction approach. the model framework was trained and tested on iotid20 and nslkdd datasets using four ml algorithms. the system scored a maximum detection accuracy of 99.98% for the proposed ml ensemble-based hybrid feature selection approach. from the literature, it is observed that there are extensive efforts on developing ids s for iot. several researchers have assessed the effectiveness of their systems using common datasets like nsl-kdd, unsw-nb15, and cicids2017. these datasets were not used captured traffic from iot environment. hence, an extensive work should be conducted using recent datasets such as iotid20 which consists of iot network traffic features. the state of the art also shows that some models perform well, particularly tree-based algorithms such as boosting, random forest and decision trees. ml algorithms’ performance outcomes vary depending on the used dataset, features, and classification category. 5. conclusion one of the most important technological progresses over the past decade was the widespread adoption of iot devices across industries and societies. with the development of iot, several obstacles have been raised. one of these obstacles is iot security which cannot be disregarded. iot networks are vulnerable to a variety of threats. although the iot network is protected by encryption and authentication, cyber attacks are still possible. therefore, using iot ids is important and necessary. this paper conducted an in-depth comprehensive analysis and comparison of various recent researches which used different techniques, datasets, ml algorithms and their performance for detecting iot intrusions. based upon the analysis, the recent iot dataset for intrusion detection is identified which is iotid20 dataset. furthermore, the ml algorithms that outperformed in most researches are treebased algorithms such as dt, rf, and boosting algorithms. many points were observed and needed further study like using and collecting real iot intrusion detection datasets for training and testing ml models, real time, and lightweight idss are required that need less detection time and resources consumption. all these factors should be taken into account while developing new iot idss. in addition, further study should be conducted to address recent iot threats, and the need to identify the best ids placement techniques that improve iot security while lowering the risk of cyber attacks. references [1] s. chen, h. xu, d. liu, b. hu and h. wang. “a vision of iot: applications, challenges, and opportunities with china perspective.” abdulla and jameel: a review on iot intrusion detection systems uhd journal of science and technology | jan 2023 | vol 7 | issue 1 63 ieee internet of things journal, vol. 1, no. 4, pp. 349-359, 2014. [2] s. li, l. d. xu and s. zhao. “the internet of things: a survey”. information systems frontiers, vol. 17, no. 2, pp. 243-259, 2015. [3] t. sherasiya and h. upadhyay. “intrusion detection system for internet of things”. international journal of advance research and innovative ideas in education, vol. 2, no. 3, pp. 2244‑2249, 2016. [4] m. m. patel and a. aggarwal. “security attacks in wireless sensor networks: a survey”. in: 2013 international conference on intelligent systems and signal processing (issp). institute of electrical and electronics engineers, piscataway, new jersey, pp. 329-333, 2013. [5] s. n. kumar. “review on network security and cryptography”. international transaction of electrical and computer engineers system, vol. 3, no. 1, pp. 1-11, 2015. [6] r. s. m. joshitta, l. arockiam. “security in iot environment: a survey”. international journal of information technology and mechanical engineering, vol. 2, no. 7, pp. 1-8, 2016. [7] m. m. hossain, m. fotouhi and r. hasan. “towards an analysis of security issues, challenges, and open problems in the internet of things”. in: 2015 ieee world congress on services. institute of electrical and electronics engineers, piscataway, new jersey, pp. 21-28, 2015. [8] a. khraisat and a. alazab. “a critical review of intrusion detection systems in the internet of things: techniques, deployment strategy, validation strategy, attacks, public datasets and challenges”. cybersecurity, vol. 4, no. 1, pp. 1-27, 2021. [9] n. mishra and s. pandya. “internet of things applications, security challenges, attacks, intrusion detection, and future visions: a systematic review”. ieee access, vol. 9, pp. 59353-59377, 2021. [10] l. atzori, a. iera and g. morabito. “the internet of things: a survey,” journal of computer network, vol. 54, no. 15, pp. 2787-2805, 2010. [11] s. andreev and y. koucheryavy. “internet of things, smart spaces, and next generation networking”. vol. 7469. in: lecture notes in computer science. springer, berlin, germany, p. 464, 2012. [12] s. j. kumar and d. r. patel. “a survey on internet of things: security and privacy issues”. international journal of computer applications, vol. 90, no. 11, pp. 20-26, 2014. [13] j. du and s. chao. “a study of information security for m2m of iot”. in: 2010 3rd international conference on advanced computer theory and engineering (icacte). vol. 3. institute of electrical and electronics engineers, piscataway, new jersey, pp. v3‑ 576-v3-579, 2010. [14] b. schneier. secrets and lies: digital security in a networked world. john wiley and sons, hoboken, new jersey, 2015. [15] j. m. kizza. guide to computer network security. springer, berlin, germany, 2013. [16] m. taneja. “an analytics framework to detect compromised iot devices using mobility behavior”. in: 2013 international conference on ict convergence (ictc). institute of electrical and electronics engineers, piscataway, new jersey, pp. 38‑43, 2013. [17] g. m. koien and v. a. oleshchuk. “aspects of personal privacy in communications: problems, technology and solutions”. river publishers, denmark, 2013. [18] n. r. prasad. “threat model framework and methodology for personal networks (pns)”. in: 2007 2nd international conference on communication systems software and middleware. institute of electrical and electronics engineers, piscataway, new jersey, pp. 1-6, 2007. [19] s. o. amin, m. s. siddiqui, c. s. hong, and j. choe. “a novel coding scheme to implement signature based ids in ip based sensor networks”. in: 2009 ifip/ieee international symposium on integrated network management‑workshops. institute of electrical and electronics engineers, piscataway, new jersey, pp. 269‑274, 2009. [20] j. deogirikar and a. vidhate. “security attacks in iot: a survey”. in: 2017 international conference on i‑smac (iot in social, mobile, analytics and cloud) (i‑smac). institute of electrical and electronics engineers, piscataway, new jersey, pp. 32‑37, 2017. [21] s. ansari, s. rajeev and h. s. chandrashekar. “packet sniffing: a brief introduction”. ieee potentials, vol. 21, no. 5, pp. 17-19, 2003. [22] l. liang, k. zheng, q. sheng and x. huang. “a denial of service attack method for an iot system”. in: 2016 8th international conference on information technology in medicine and education (itme). institute of electrical and electronics engineers, piscataway, new jersey, pp. 360‑364, 2016. [23] c. wilson. “botnets, cybercrime, and cyberterrorism: vulnerabilities and policy issues for congress”. library of congress, congressional research service, washington, dc, 2008. [24] k. tsiknas, d. taketzis, k. demertzis, and c. skianis. “cyber threats to industrial iot: a survey on attacks and countermeasures”. iot, vol. 2, no. 1, pp. 163-186, 2021. [25] n. chakraborty and b. research. “intrusion detection system and intrusion prevention system: a comparative study”. international journal of computing and business research, vol. 4, no. 2, pp. 1-8, 2013. [26] n. das, t. sarkar. “survey on host and network‑based intrusion detection system”. international journal of advanced networking and applications, vol. 6, no. 2, p. 2266, 2014. [27] s. raza, l. wallgren and t. voigt. “svelte: real‑time intrusion detection in the internet of things”. ad hoc networks, vol. 11, no. 8, pp. 2661-2674, 2013. [28] p. y. chen, s. m. cheng and k. c. chen. “information fusion to defend intentional attack in internet of things”. ieee internet of things journal, vol. 1, no. 4, pp. 337-348, 2014. [29] p. pongle and g. chavan. “real time intrusion and wormhole attack detection in internet of things”. international journal of computer applications, vol. 121, no. 9, pp. 1-9. 2015. [30] c. cervantes, d. poplade, m. nogueira and a. santos. “detection of sinkhole attacks for supporting secure routing on 6lowpan for internet of things”. in: 2015 ifip/ieee international symposium on integrated network management (im). institute of electrical and electronics engineers, piscataway, new jersey, pp. 606‑611, 2015. [31] d. h. summerville, k. m. zach and y. chen. “ultra-lightweight deep packet anomaly detection for internet of things devices”. in: 2015 ieee 34th international performance computing and communications conference (ipccc). institute of electrical and electronics engineers, piscataway, new jersey, pp. 1‑8, 2015. [32] v. eliseev and a. gurina. “algorithms for network server anomaly behavior detection without traffic content inspection”. in: proceedings of the 9th international conference on security of information and networks. association for computing machinery, new york, pp. 67‑71, 2016. [33] s. o. amin, m. s. siddiqui, c. s. hong and s. lee. “implementing signature based ids in ip‑based sensor networks with the help of signature‑codes”. ieice transactions on communications, vol. 93, abdulla and jameel: a review on iot intrusion detection systems 64 uhd journal of science and technology | jan 2023 | vol 7 | issue 1 no. 2, pp. 389-391, 2010. [34] d. oh, d. kim and w. w. ro. “a malicious pattern detection engine for embedded security systems in the internet of things”. sensors, vol. 14, no. 12, pp. 24188-24211, 2014. [35] h. sun, x. wang, r. buyya and j. su. “cloudeyes: cloud‑based malware detection with reversible sketch for resource‑constrained internet of things (iot) devices”. journal of software practice and experience, vol. 47, no. 3, pp. 421-441, 2017. [36] l. santos, c. rabadao and r. gonçalves. “intrusion detection systems in internet of things: a literature review”. in: 2018 13th iberian conference on information systems and technologies (cisti). institute of electrical and electronics engineers, piscataway, new jersey, pp. 1‑7, 2018. [37] f. ahmed, y. b. ko. “mitigation of black hole attacks in routing protocol for low power and lossy networks”. security and communication networks, vol. 9, no. 18, pp. 5143-5154, 2016. [38] y. xia, h. lin and l. xu, “an agv mechanism based secure routing protocol for internet of things”. in: 2015 ieee international conference on computer and information technology; ubiquitous computing and communications; dependable, autonomic and secure computing; pervasive intelligence and computing. institute of electrical and electronics engineers, piscataway, new jersey, pp. 662-666, 2015. [39] a. le, j. loo, k. k. chai and m. aiash. “a specification‑based ids for detecting attacks on rpl‑based network topology”. information, vol. 7, no. 2, p. 25, 2016. [40] m. surendar and a. umamakeswari. “indres: an intrusion detection and response system for internet of things with 6lowpan.” in: 2016 international conference on wireless communications, signal processing and networking (wispnet). institute of electrical and electronics engineers, piscataway, new jersey, pp. 1903-1908, 2016. [41] q. d. la, t. q. s. quek, j. lee, s. jin and h. zhu. “deceptive attack and defense game in honeypot‑enabled networks for the internet of things”. ieee internet of things journal, vol. 3, no. 6, pp. 1025-1035, 2016. [42] h. sedjelmaci, s. m. senouci and m. al‑bahri. “a lightweight anomaly detection technique for low-resource iot devices: a game-theoretic methodology”. in: 2016 ieee international conference on communications (icc). institute of electrical and electronics engineers, piscataway, new jersey, pp. 1‑6 2016. [43] p. kasinathan, c. pastrone, m. a. spirito and m. vinkovits. “denialof-service detection in 6lowpan based internet of things.” in: 2013 ieee 9th international conference on wireless and mobile computing, networking and communications (wimob). institute of electrical and electronics engineers, piscataway, new jersey, pp. 600-607, 2013. [44] d. midi, a. rullo, a. mudgerikar, and e. bertino. “kalis-a system for knowledge-driven adaptable intrusion detection for the internet of things”. in: 2017 ieee 37th international conference on distributed computing systems (icdcs). ieee. institute of electrical and electronics engineers, piscataway, new jersey, pp. 656‑666, 2017. [45] t. matsunaga, k. toyoda and i. sasase. “low false alarm attackers detection in rpl by considering timing inconstancy between the rank measurements”. ieice communications express, vol. 4, no. 2, pp. 44-49, 2015. [46] m. praveena and v. jaiganesh. “a literature review on supervised machine learning algorithms and boosting process”. international journal of computer applications, vol. 169, no. 8, pp. 32-35, 2017. [47] m. tavallaee, e. bagheri, w. lu, and a. a. ghorbani. “a detailed analysis of the kdd cup 99 data set”. in: 2009 ieee symposium on computational intelligence for security and defense applications. institute of electrical and electronics engineers, piscataway, new jersey, pp. 1‑6, 2009. [48] n. moustafa and j. slay. “unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set)”. in: 2015 military communications and information systems conference (milcis). ieee. institute of electrical and electronics engineers, piscataway, new jersey, pp. 1‑6, 2015. [49] i. sharafaldin, a. h. lashkari and a. a. ghorbani. “toward generating a new intrusion detection dataset and intrusion traffic characterization”.in: the international conference on information systems security and privacy. vol. 1, pp. 108-116, 2018. [50] n. koroniotis, n. moustafa, e. sitnikova and b. turnbull. “towards the development of realistic botnet dataset in the internet of things for network forensic analytics: bot‑iot dataset”. future generation computer systems, vol. 100, pp. 779-796, 2019. [51] f. x. aubet. “machine learning‑based adaptive anomaly detection in smart spaces”. b.sc. thesis, department of informatics, technische universität münchen, germany, 2018. [52] i. ullah and q. h. mahmoud. “a scheme for generating a dataset for anomalous activity detection in iot networks”. in: canadian conference on artificial intelligence. springer, berlin, germany, pp. 508-520, 2020. [53] a. churcher, r. ullah, j. ahmad, s. u. rehman, f. masood, m. gogate, f. alqahtani, b. nour and w. j. buchanan. “an experimental analysis of attack classification using machine learning in iot networks”. sensors, vol. 21, no. 2, p. 446, 2021. [54] r. olivas. “decision trees,” rafael olivas, san francisco, 2007. [55] m. ahmad, q. riaz, m. zeeshan, h. tahir, s. a. haider, m. s. khan. “intrusion detection in internet of things using supervised machine learning based on application and transport layer features using unsw‑nb15 data‑set”. journal on wireless communications and networking, vol. 2021, no. 1, pp. 1-23, 2021. [56] j. dou, a. p. yunus, d. t. bui, a. merghadi, m. sahana, z. zhu, c. w. chen, z. han, b. t. pham. “improved landslide assessment using support vector machine with bagging, boosting, and stacking ensemble machine learning framework in a mountainous watershed, japan”. landslide, vol. 17, no. 3, pp. 641-658, 2020. [57] t. saranya, s. sridevi, c. deisy, t. d. chung, and m. k. a. a. khan. “performance analysis of machine learning algorithms in intrusion detection system: a review”. procedia computer science, vol. 171, pp. 1251-1260, 2020. [58] m. shorfuzzaman. “detection of cyber attacks in iot using treebased ensemble and feedforward neural network”. in: 2020 ieee international conference on systems, man, and cybernetics (smc). institute of electrical and electronics engineers, piscataway, new jersey, pp. 2601‑2606, 2020. [59] d. l. streiner and g. r. norman. “precision” and “accuracy”: two terms that are neither”. journal of clinical epidemiology, vol. 59, no. 4, pp. 327-330, 2006. [60] d. chicco and g. jurman. “the advantages of the matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation”. bmc genomics, vol. 21, no. 1, p. 6, 2020. [61] w. ma and m. a. lejeune. “a distributionally robust area under curve maximization model”. operations research letters, vol. 48, no. 4, pp. 460-466, 2020. [62] m. hasan, m. m. islam, m. i. i. zarif and m. m. a. hashem. “attack and anomaly detection in iot sensors in iot sites using machine abdulla and jameel: a review on iot intrusion detection systems uhd journal of science and technology | jan 2023 | vol 7 | issue 1 65 learning approaches”. internet of things, vol. 7, p. 100059, 2019. [63] i. alrashdi, a. alqazzaz, e. aloufi, r. alharthi, m. zohdy and h. ming. “ad-iot: anomaly detection of iot cyberattacks in smart city using machine learning”. in: 2019 ieee 9th annual computing and communication workshop and conference (ccwc). institute of electrical and electronics engineers, piscataway, new jersey, pp. 0305-0310, 2019. [64] s. fenanir, f. semchedine and a. baadache. “a machine learning‑ based lightweight intrusion detection system for the internet of things”. revue d intelligence artificielle, vol. 33, no. 3, pp. 203211, 2019. [65] i. ullah and q. h. mahmoud. “a two-level hybrid model for anomalous activity detection in iot networks”. in: 2019 16th ieee annual consumer communications and networking conference (ccnc). institute of electrical and electronics engineers, piscataway, new jersey, pp. 1‑6, 2019. [66] a. verma and v. ranga. “machine learning based intrusion detection systems for iot applications”. wireless personal communications, vol. 111, no. 4, pp. 2287-2310, 2020. [67] v. kumar, a. k. das, and d. sinha. “uids: a unified intrusion detection system for iot environment”. evolutionary intelligence, vol. 14, no. 1, pp. 47-59, 2021. [68] j. alsamiri and k. alsubhi. “internet of things cyber attacks detection using machine learning”. international journal of advanced computer science and applications, vol. 10, no. 12, pp. 628-634, 2019. [69] a. r. arko, s. h. khan, a. preety and m. h. biswas. “anomaly detection in iot using machine learning algorithms”. brac university, bangladesh, 2019. [70] k. v. v. n. l. s. kiran, r. n. k. devisetty, n. p. kalyan, k. mukundini, and r. karthi. “building a intrusion detection system for iot environment using machine learning techniques”. procedia computer science, vol. 171, pp. 2372-2379, 2020. [71] p. maniriho, e. niyigaba, z. bizimana, v. twiringiyimana, l. j. mahoro and t. ahmad. “anomaly-based intrusion detection approach for iot networks using machine learning”. in: 2020 international conference on computer engineering, network, and intelligent multimedia (cenim). institute of electrical and electronics engineers, piscataway, new jersey, pp. 303‑308, 2020. [72] n. p. owoh, m. m. singh, z. f. zaaba, and applications. “a hybrid intrusion detection model for identification of threats in internet of things environment”.international journal of advanced computer science and applications, vol. 12, no. 9, pp. 689-697, 2021. [73] k. albulayhi, a. a. smadi, f. t. sheldon and r. k. abercrombie. “iot intrusion detection taxonomy, reference architecture, and analyses”. sensors, vol. 21, no. 19, p. 6432, 2021. [74] r. qaddoura, a. m. al‑zoubi, h. faris and i. almomani. “a multi‑ layer classification approach for intrusion detection in iot networks based on deep learning”. sensors, vol. 21, no. 9, p. 2987, 2021. [75] m. anwer, s. m. khan, m. u. farooq and w. nazir. “attack detection in iot using machine learning”. engineering technology and applied science research, vol. 11, no. 3, pp. 7273-7278, 2021. [76] s. roy, j. li, b. j. choi and y. bai. “a lightweight supervised intrusion detection mechanism for iot networks”. future generation computer systems, vol. 127, pp. 276-285, 2022. [77] j. su, s. he and y. wu. “features selection and prediction for iot attacks”. high confidence computing, vol. 2, no. 2, p. 100047, 2022. [78] a. a. alsulami, q. abu al‑haija, a. tayeb, and a. alqahtani, “an intrusion detection and classification system for iot traffic with improved data engineering”. applied sciences, vol. 12, no. 23, p. 12336, 2022. [79] k. albulayhi, q. a. al‑haija, s. a. alsuhibany, a. a. jillepalli, m. ashrafuzzaman and f. t. sheldon. “iot intrusion detection using machine learning with a novel high performing feature selection method”. applied sciences, vol. 12, no. 10, p. 5015, 2022. . uhd journal of science and technology | april 2017 | vol 1 | issue 1 23 1. introduction electrical loads can be divided into several categories, including residential, industrial, commercial, and government. these components vary in the electrical system depending on the economic, political, social state of the country, etc. in previous research, diversity factor was been studied in the iraqi electricity distribution system. the study shows that household electrical loads have grown at high rates exceeded the standard values for stable systems [1]. another study conducted aims to use artificial neural network technology to guess household electrical loads [2]. residential loads represent biggest components in the iraqi electrical systems, due to low industrial and commercial loads components. residential electrical loads consist of many components, household appliances, lighting, space heating, cooling, and water heating. a previous field survey study was conducting in the city of mosul to specify these components. the study found that the water heating component was the largest component, 32.29% [3]. the current research aims to test the possibility of using solar water heaters to supply hot water in the housing units. solar water heaters were added with low rating electrical heater to provide supplementary heating for a number of residential units in mosul city. the total energy consumed, the energy consumed in the supplementary heaters, the amount of water consumed, and the water temperature in the solar heater tank registration were recorded. readings recorded daily for 1 full year. the readings were analyzed to find the percentage of water heating component. furthermore, the change with the months of the year is compared with the previous utilization of solar water heaters to reduce residential electrical load majid s. al-hafidh1, mudhafar a. al-nama2 and azher s. al-fahadi3 1department of electrical engineering, mosul university, mosul, iraq 2department of computer engineering and technology, al hadbaa university, mosul, iraq 3department of electrical engineering, mosul university, mosul, iraq a b s t r a c t residential electrical load in iraq can be divided into five components, lighting, home appliances, heating, cooling, and water heating. water heating component represents the largest residential electric load component in iraq. the current research aims to test the possibility of using solar water heaters to supply hot water to residential units. solar water heaters were added to a number of residential units in mosul city, with the addition of a small electrical heater to provide supplementary heating. the readings of the total energy consumed, the energy consumed in the supplementary heating, the amount of water consumed and the water temperature in the solar heated tank, were recorded each day for a full year. the results were analyzed and compared with the case without the addition of solar heaters. the addition of solar water heater with supplementary heating leads to the reduction of the total consumption up to one-fifth of the total energy (19.19%). index terms: iraqi residential electrical load, residential electrical load, solar energy, solar water heaters corresponding author’s e-mail: el_noor2000@yahoo.com received: 10-03-2017 accepted: 25-03-2017 published: 12-04-2017 access this article online doi: 10.21928/uhdjst.v1n1y2017.pp23-26 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2017 al-hafidh, et al. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l re se a rc h a rt i c l e uhd journal of science and technology majid s. al-hafidh et al.: utilization of solar water heaters to reduce residential electrical load 24 uhd journal of science and technology | april 2017 | vol 1 | issue 1 study (without solar heated). the results, the analysis, and comparison are listed in the following paragraphs. 2. theoretical basis renewable energies represent suitable alternatives to solve the problems resulting from high energy consumption rates (especially electricity). renewable energies (wind energy, solar energy, hydropower, etc.) can be used to generate thermal energy, kinetic energy, electrical energy, etc. many researchers have been conducted to study the solar energy falling in different areas in iraq. the studies included which study all iraqi areas and gave illustrations of solar energy for different seasons of the year [4]. other studies al-salihi et al. and ali have been conducted to certain areas in iraq such as baghdad, mosul and kirkuk, ramady, etc. [5], [6]. solar energy has been used for water heating in many developing countries. mohammed, et al., 2011, study the possibility of using solar energy to heat water for the use of 25 people in baghdad using a solar panels collectors of 10 m2 capacity and a storage tank of 600 l capacity [7]. the study concluded the possibility of using solar energy to provide 69% of the hot water using solar heaters, by providing more than 60% in the winter, and more than 70% in the summer. it is well known that in summer no hot water is needed (june, july, august, and september). another study uses trnsys software to model and verify a direct solar water heating system in baghdad, iraq. the study aims to meet the demand of hot water for 25 persons using 10 m2 of a flat plate collector and 600 l storage tank [7]. 3. recording readings a group of houses (8 homes) was selected in the technical institute’s foundation. a solar water heater has been added for each residential unit. a low rating electrical heater of 1 kw is used to provide supplementary heating. fig. 1 shows one of the solar water heating systems used in the study. the system consists of two flat plate collectors each of the dimension 80 cm × 150 cm and storage tank of hot water with a capacity of 180 l. the flat plate collector and the storage tank capacity can be changed to match the consumers hot water demand. each solar water heater was equipped with a set of meters. the meters measure the energy consumed in the supplementary heaters, the amount of hot water consumed, and the water temperature in the storage tank. the supplementary heating energy and hot water consumed were recorded, once every 2 days. furthermore, an ammeter is used to measure the current drawn in the house units. the current and the water temperature in the storage tank readings have been registered at three different times at the morning, afternoon, and at night. as well as the total energy consumption in the residential units reading was recorded. total energy reading was recorded, once every 2 days. previous readings were recorded for the entire year. recorded readings were used in the analysis to get the results described in the following paragraphs. calculations can be performed based on the distribution of the foundations weekly or monthly. the calculations discussed in the results based on a semimonthly period, as well as monthly. 4. results and analysis electrical load in iraq is strongly influenced by weather climate changes where the high temperature in the summer leads to increase the electrical load as a result of using the cooling devices. furthermore, the low temperature in winter leads to increase electrical load as a result of using space heating and water heating whereas mild temperatures in spring and autumn lead to a reduction of electrical load, which represents the lowest throughout the year. in general, a large amount of solar energy falls on iraq, and especially in mosul city. it is clear that the amount of solar energy falling vary with seasons, where maximum energy fig. 1. one of the solar water heating systems used in the study majid s. al-hafidh et al.: utilization of solar water heaters to reduce residential electrical load uhd journal of science and technology | april 2017 | vol 1 | issue 1 25 falls in the summer. the minimum fallen energy is winter. the statistics show that the hours of solar brightness in iraq, during the winter is represented 50-60% of daylight hours. lowest rate happens to solar brightness hours in the month of january with 4.87 h. while the hours of solar brightness in summer represents 90% of the daylight hours, with a maximum brightness of the sun hours in the month of july 12.31 h. fig. 2 shows the average daily solar energy falling and the rate of solar brightness hours versus months of the year in the city of mosul. less solar energy occurs in the month of january and reaches 7.22 mj/m2-day. the maximum solar fallen in the month of june and reach 26.32 mj/m2-day (iraqi air adversity 1989). supplementary heating energy changes with temperature in the proportion of different seasons, as well as with the intensity of incoming solar radiation changes. therefore, the need for supplementary energy heating in winter becomes the highest in the whole year. fig. 3 illustrates the monthly supplementary rated heating in the year. it shows that there is no need for supplementary heating during summer. as well as it decreases in spring and autumn. the supplementary heating energy rate was calculated for a period of semimonthly to compare it with the energy consumed in the water heating (without the addition of solar heated). fig. 4 shows the energy consumed in the water heating with and without solar water heating. it is clear from the figure the great difference between the amount of supplementary heating energy used with solar water heater and the case without using it. also the times in which maximum benefits of adding solar heater is achieved. as well as there is no need to heat the water the majority of the summer. the maximum reduction ratio result during the spring and autumn. while reduction rates are less during winter than in the case of spring and autumn. table i summarizes the percentage of water heating component with the addition of solar water heaters and without them. the table includes the amount of the proportion of water heating component for the case of high consumption, the average consumption and the low fig. 2. average daily solar energy falling and the rate of solar brightness hours versus months of the year in the city of mosul fig. 3. supplementary heating rate of the months of the year fig. 4. energy consumed in the water heating with and without solar water heating table i percentage of the components of household electrical load and the amount of little and average total and high consumption case rate % consumption low % average % high % with solar heater 13.1 13.1 9.85 11.61 without solar heater 32.3 32.29 13.39 30.4 majid s. al-hafidh et al.: utilization of solar water heaters to reduce residential electrical load 26 uhd journal of science and technology | april 2017 | vol 1 | issue 1 consumption, in addition to the average consumer. evidenced by the average value of the added solar water heater leads to a reduction in the total consumption by 19.19%. 5. conclusion the current research shows a reduction in water heating component in residential units electrical load using solar water heaters. a solar water heater was added to a number of residential units in the city of mosul in northern iraq. a small rating heater was added to the solar water heaters to provide supplementary heating. the percentage of supplementary heating component compared with total electrical energy consumption in each housing units was calculated. as well as the percentage of supplementary heating component compared with total electrical energy consumption in all housing units. the results illustrate the possibility of obtaining a holistic reduced by 19.19%. the solar water heaters used have a standard specifications, while the hot water needs of the residential units vary as a result of differing ages and number of occupants (consumer), etc. which must leads to change some specifications of the solar water heaters, such as solar heater flat plane area and hot water storage volume. the solar water heaters specification can be studied to get a further reduction in the supplementary heating component. 6. future work the registered readings of the water consumed can be used to find a general model for hot water demand for different houses units. this general model can be used to design the suitable heating system to meet the hot water demand for any consumer. furthermore, the water temperature of the storage tank can be used to wake a general model for heat transfer for the heating system. this model can be used to improve the system efficiency. 7. thanks and appreciations the researchers express their deep gratitude and thanks to the administration and the engineers of the general management for north region electrical distribution for their valuable help and cooperation throughout this work and fruitful discussion and suggestions made among the different study stages. references [1] m. a. al-nama, m. s. al-hafid and a. s. al-fahadi. “estimation of the diversity factor for the iraqi distribution system using intelligent methods.” al-rafidain engineering, mosul, vol. 17, no.1, pp. 14-21, 2009. [2] m. a. al-nama, m. s. al-hafid and a. s. al-fahadi. “estimation of the consumer peak load for the iraqi distribution system using intelligent methods.” iraqi journal for electrical and electronic engineering, vol. 7, no. 2, pp. 180-184, 2011. [3] m. s. al-hafid, m. a. al-nama and a. s. al-fahadi. “determination of residential electrical load components in iraqi north region.” iraqi journal for electrical and electronic engineering, basra, iraq: sent for publication. [4] m. z. mohammed, m. a. al-nema and d. a. al-nema. “seasonal distribution of solar radiation in iraq.” proceeding of the conference on the physics of solar energy. arab development institute, special publication, tripoli, 1977. [5] a. m. al-salihi, m. m. kadum and a. j. mohammed. “estimation of global solar radiation on horizontal surface using routine meteorological measurements for different cities in iraq.” asian journal of scientific research, vol. 3, no. 4, pp. 240-248, 2010. [6] f. a. ali. “computation of solar radiation on horizontal surface over some iraqi cities.” engineering and technical journal, vol. 29, no.10, pp. 2026-2042, 2011. [7] m. n. mohammed, m. a. alghoul, k. abulqasem, a. mustafa, k. glaisa, p. ooshaksaraei, m. yahya, a. zaharim and k. sopian. “trnsys simulation of solar water heating system in iraq.” recent researches in geography, geology, energy, environment and biomedicine, pp. 153-156, jul. 2011. . uhd journal of science and technology | april 2017 | vol 1 | issue 1 17 1. introduction it is not easy to decide precisely when and how the first data converter was established. the most primitive documented binary analog-to-digital adaptor recognized is not electronic at all but hydraulic. to the best of our knowledge, the optimum historical review regarding the analog-to-digital adapters, in general, can be found in the study of kester et al. [1]. the analog domain is unceasing with both time and signal magnitude, while the digital domain is independent on both time and magnitude. a single binary value signifies a variety of analog values in the quantization band nearby its code center point. analog values that are not precisely at the code center point have an allied amount of quantization error [2]. it can be stated that sigma-delta [3] analog-to-digital adapter is a most common approach of over-sampling analog-to-digital adapter. the map processor of a sigma-delta analog-to-digital adapter is displayed in fig. 1 [4]. the sigma-delta analog-to-digital adapter can be divided into two lumps, the quantizing and the decimating parts. essentially, decimation is the act of decreasing the data rate down from the over-sampling rate without losing information. the quantizing part contains the analog integrator, the 1-bit analog-to-digital adapter, and the 1-bit digital-to-analog adapter [5]. the task of the quantizing part is to adapt the data in the analog input into digital shape. the input-output relationship of the sigma-delta quantizer is mathematical modeling of sampling, quantization, and coding in sigma delta converter using matlab azeez abdullah azeez barzinjy, haidar jalal ismail, and mudhaffer mustafa ameen department of physics, college of education, salahaddin university-erbil, zanko, erbil, iraq a b s t r a c t the received analog signal must be digitized before the digital signal processing can demodulate it. sampling, quantization, and coding are the separate stages for the analog-to-digital adaptation procedure. the procedure of adapting an unceasing time-domain signal into a separate time-domain signal is called sampling. while, the procedure of adapting a separatetime, continuous-valued signal into a discrete-time, discrete-valued signal is known as quantization. thus, quantization error is the mismatch between the unquantized sample and the quantized sample. the method of demonstrating the quantized samples in binary form is known as coding. this investigation utilized matlab® program to recommend a proper scheme for a wireless-call button network of input signal, normalized frequency, and over-sampling ratio against signalto-quantization noise ratio. two vital characteristics of this wireless network design are cost-effective and low-power utilization. this investigation, through reducing the in-band quantization error, also studied how oversampling can enhance the accomplishment of an analog-to-digital adapter. index terms: analog-to-digital adapter, coding, matlab, quantization error, wireless network corresponding author’s e-mail: azeez.azeez@su.edu.krd received: 10-03-2017 accepted: 25-03-2017 published: 12-04-2017 access this article online doi: 10.21928/uhdjst.v1n1y2017.pp17-22 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2017 barzinjy, et al. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l re se a rc h a rt i c l e uhd journal of science and technology azeez abdullah azeez barzinjy et al.: mathematical modeling of sampling, quantization, and coding 18 uhd journal of science and technology | april 2017 | vol 1 | issue 1 non-linear; nevertheless, the capacity of frequency depression for the analog input, explicitly x(t), might be retrieved from the quantizer yield, namely, y[n] as shown in fig. 1. y[n] is a restricted order with sample values equivalent to −1 or +1. y[n] may just stay restricted if the output collector, w[n], is bounded similarly. as a result, the typical value of y[n] is needed to be equivalent to the mean value of the input x(t). accordingly, the authors have been capable of solving w[n] to obtain the constant of x(t) [6]. this investigation suggests a 1-bit analog-to-digital converter which can be utilized as an alternative of a more costly multibit analog-to-digital converter. this can be done through studying two divergent procedures that permit a 1-bit analogto-digital converter to attain the enactment of a multi-bit analog-to-digital converter. the authors will also investigate the superiority and drawbacks of both these procedures. relying on this exploration, one of the two procedures is selected for our data radios. 2. theory xa(t) is an analog signal ant it behaves similar to the input to an analog-to-digital adapter. equations 1 and 2 describe xa(t) and the average power in xa(t), correspondingly [7]. ( ) cos 2 tx a a t= (1) 2 2 0 1 [ ( )] 8 t x a a x t dt t = =∫σ (2) to prototype this analog signal, assume a b-bit analog-todigital adapter. conditionally, if the analog signal possesses peak-to-peak amplitude of a, and subsequently, the minimum potential step, δv, by means of b bits is given by equation 3: ( 2 1) ( 2 ) ∆ b b a a v = ≅ − (3) quantization noise, or quantization error, is a unique restricting parameter for the effective range of an analog-todigital adaptor [8]. this error is essentially the “round-off ” error that happens when an analog signal is quantized. a quantized signal may be different as of the analog-signal by just about ±(δv/2). supposing a quantization error is equivalently distributed ranging from −δv/2 to δv/2, then the root mean square of the quantization noise power, σe, is identified by equation 4 [9]. ( ) 2 2 2 2 ( ) 12 2 12 e b v a∆ = = (4) using equations 2 and 4, the signal-to-quantization noise ratio (sqnr) for our b-bit analog-to-digital converter might be assessed [10] as follows: 23 2 2   bx e sqnr = = (5) the sqnr in decibels is assumed through equation 6 [11]. ( )db 1010 1.76 6.02sqnr log sqnr b= = + (6) equation 6 is an illustration of the sqnr of an analog-todigital adaptor which rises through almost 6 db per every single added bit. one can realize that the assessed signal through the analogto-digital adaptor is at baseband. thus, a uniform spectrum in the frequency ranges from 0 to fs/2 is the characteristic of the root mean square quantization noise power, σe. the noise power for each unit of bandwidth can be assumed using equation 7. ( ) 2 2( / 2) 2 6 e o b s s a n f f = = (7) the nyquist frequency, which termed after harry nyquist, is basically twice the input signal bandwidth fm. recalling that fm for a baseband signal expands from 0 to fm [12]. the entire quantization noise power in the concerned band or the in-band noise is specified through equation 8.fig. 1. sigma-delta analog-to-digital converter azeez abdullah azeez barzinjy et al.: mathematical modeling of sampling, quantization, and coding uhd journal of science and technology | april 2017 | vol 1 | issue 1 19 ( ) ( ) ( ) ( ) 2 2 2 22 6 2 6 m o m mb b ss fa a n f f ff   = =     (8) equation 8 might be examined to classify which limits disturb the in-band quantization error in analog-to-digital adaptor. where a is the amplitude of the signal and fm is half nyquist frequency, both of them are relying on signal, while the analog-to-digital adaptor has no dominance above these. nevertheless, the available analog-to-digital bits number, b, and the specimen frequency, fs, are organized through the analog-to-digital adaptor scheme [13]. the act of sampling the input signal at frequencies considerably higher than the nyquist frequency is called oversampling. by means of equation 8, a correlation might be resulting for the over-sampling segment, m = fs/2fm, like that the two analog-to-digital adaptors offer an identical in-band error power [14]. after allowing fs = 2fm, equation 8 becomes: ( ) ( ) 2 2 2 b 22 6 2 12 m s fa a f   =    (9) by means of equation 9, one can acquire equation 10. ( ) 2 1 log ( ) 2  b m− = + (10) (β−b) means the additional determination bits which are in consequence gained out of a b-bit adaptor by means of oversampling. the above equation, also, indicates that each duplication of the over-sampling proportion rises the actual bits at the nyquist frequency by 0.5 [15]. ( ) 1 1 1 mz h z z − − − = − (11) mitra [16] relates the enactment of a sigma-delta analogto-digital adaptor through that of a linear over-sampling analog-to-digital adaptor. the enhancement in enactment achieved by means of a sigma-delta analog-to-digital adaptor is illustrated in equation 10 [17]. enhancement(m)=− 5.1718+20log 10 (m) (12) the power spectral density, sx (f), is the strength of the variations as a function of frequency. for an arbitrary time signal xa (t), the power spectral density can be given by equation 13. ( ) t t 21 lim ( ) 2 − −→∞   =     ∫ i tx ts f e x t e dtt  (13) power spectral density computation can be made straightforwardly through the fast fourier transform method. 3. results and discussions a. power spectral density a first-order integrator which might be demonstrated as a collector has been utilized by the straightforward sigma-delta quantizer [18]. the matlab® program imitates the occupied first-order sigma-delta adaptor. the power spectral density, equation 13, of the stimulus signal can be illustrated in fig. 2. in addition, the stimulus signal has been over sampled at a level of 50 times nyquist. it can be notice in fig. 2 indicated that the normalized frequency is schemed against power spectral density possessing a range from 0 to 1 knowing that 1 indicating 50 nyquist frequency [19]. analogous tendency has been obtained by belitski et al. [20] which indicates that the proposed model is appropriate for sampling, quantization, and coding in sigma-delta converter. y[n] is the digital signal characterized by means of 1-bit. the power spectral density of y[n] is plotted in fig. 3 against the normalized frequency. fig. 3 shows the noise forming aptitudes of the sigmadelta analog-to-digital converter. as stated previously, a straightforward over-sampling analog-to-digital converter is capable to diffuse the overall quantization error power fig. 2. power spectral density of input signal after oversampling azeez abdullah azeez barzinjy et al.: mathematical modeling of sampling, quantization, and coding 20 uhd journal of science and technology | april 2017 | vol 1 | issue 1 above a longer band, thus reducing the in-band error power. alternatively, overney et al. [21] utilize josephson voltage standard to achieve logical description of higher level resolution analog-to-digital adaptor. their method might be used in many metrological applications for different analogto-digital adaptors with frequencies up to a few khz. in addition, posselt et al. [22] utilized a reconfigurable analogto-digital converter which was suggested with aptitudes of digitalizing completely related wireless facilities for vehicular usage with frequency ranging from of 600 mhz to 6 ghz. in addition, sigma-delta adaptors are normally capable to achieve error modeling just like that the error power is centralized in upper frequencies [23]. fig. 3 demonstrated that the bottom error is significantly sophisticated at upper frequencies and rather beneath the concern band. furthermore, the signal y[n] is the quantizer yield and is low pass clean by means of a mth band low-pass filter, where m = 50 is the over-sampling ratio. the transmission function, h(z), of the mth band low-pass filter is known from equation 10. the power spectral density of the clarified yield is shown in fig. 4. fig. 5 displays the power spectral density of the real analogto-digital converter production. it can be obvious that, at this point, the signal has been downsampled to the nyquist frequency once again. taking into consideration that for the sigma-delta adaptor, exhibited in the matlab® program, the utilized over-sampling ratio, m, was 50. it can be realized that, through equation 12, a sigma-delta adaptor might offer an enhancement of about 29 db once likened through a straight over-sampling adaptor that similarly works at 50 times nyquist frequency. b. sqnr the sqnr generated using a 1-bit sigma-delta adaptor can be linked through fig. 6 with the sqnr of a straight over-sampling analog-to-digital converter at numerous oversampling ratios. moreover, it can be noticed that from fig. 6 when the over-sampling ratio is 15, then the sqnr is just round 20 db. accordingly, the nyquist frequency for an analog signal modulator utilizing a binary phase shift keying and conveying 80 kbps of data will be 160 khz. on the other hand, oversampling in 15 would necessitate sampling at 2.4 mhz. otherwise stated, considering the present digital signal processing, which can treat samples at an order of nearly 2.4 mhz, one might be capable to carry out a regulate over-sampling analog-to-digital adaptor. in similar work, brooks et al. [24] stated that their analog-to-digital converter works at a 20 mhz and it attains a signal-to-noise ratio of about 90 db exceeding a 1.25 mhz signal bandwidth. fig. 7 displays a scheme of the extra accuracy bits achieved against the over-sampling percentage. fig. 7 indicates that for a 10 db corresponded input, once the signal-to-quantization error ratio of the analog-todigital adaptor is 20 db, and the digitized output possesses a signal-to-noise ratio of approximately 9.5 db. this, perhaps, shows that using a sqnr of 20 db, the digitizing process solely increases the bottom error by 0.5 db. the bottom error is increased by even below 0.5 db if the equivalent contribution’s signal-to-noise ratio is less than 10 db. accordingly, the first and the last goal behind this study was directing over-sampling analog-to-digital adaptor to possess fig. 3. power spectral density of y[n] versus normalized frequency fig. 4. power spectral density versus normalized frequency of low-pass filter output azeez abdullah azeez barzinjy et al.: mathematical modeling of sampling, quantization, and coding uhd journal of science and technology | april 2017 | vol 1 | issue 1 21 the strength of the error combination avert methods in combination with the low master clock of 20 mhz. 4. concussion running the analog-to-digital adaptor beyond the input signal’s nyquist frequency enhances the improvement of a low accuracy analog-to-digital adaptor. this is the evidence behindhand the operating of continuous over-sampling analog-to-digital converters. the quantization noise addition to the analog-to-digital adaptation procedure supplementary enhances enactment. sigma-delta analog-to-digital adaptors apply both noise affecting and oversampling. sigma-delta analog-to-digital converters propose significantly superior enactments than uninterrupted over-sampling adaptors. nevertheless, it has been recommended that a straightforward sampling adaptor be utilized due to the difficulty of a sigmadelta analog-to-digital adaptor. similarly, it has been observed that a straight over-sampling analog-to-digital converter can be utilized without any kind of signal humiliation. 5. acknowledgment t h e a u t h o r s wo u l d l i ke t o e x t e n d t h e i r s i n c e r e acknowledgement to the salahaddin university for supporting them with available tools. if anyone who needs the matlab codes please contact the corresponding author for any additional help. references [1] w. a. kester. “analog devices,” in data conversion handbook, burlington, ma: elsevier, 2005. [2] k. fowler. “part 7: analog-to-digital conversion in real-time systems.” ieee instrumentation and measurement magazine, vol. 6, no. 3, pp. 58-64, 2003. [3] r. schreier and g. c. temes. understanding delta-sigma data converters, vol. 74. piscataway, nj: ieee press, 2005. [4] j. m. de la rosa and r. río. cmos sigma-delta converters: practical design guide, hoboken, nj: wiley, 2013. [5] m. pelgrom. analog-to-digital conversion, switzerland: springer international publishing, 2016. [6] j. keyzer, j. hinrichs, a. metzger, m. iwamoto, i. galton and p. asbeck. “digital generation of rf signals for wireless communications with band-pass delta-sigma modulation.” in microwave symposium digest, ieee mtt-s international, 2001. [7] j. j. wikner and n. tan. “modeling of cmos digital-to-analog converters for telecommunication.” ieee transactions on circuits and systems ii: analog and digital signal processing, vol. 46, no. 5, pp. 489-499, 1999. [8] j. kim, t. k. jang and y. g. yoon. “analysis and design of fig. 5. power spectral density of analog-to-digital converter output signal fig. 6. over-sampling ratio versus signal-to-quantization noise ratio fig. 7. analog-to-digital adaptor yield’s signal-to-noise ratio (snr) (with contribution’s snr = 10 db) a sqnr of 20 db or less. fujimori et al. [25], alternatively, stated that in their study, no signal-to-noise ratio decay caused by numerical swapping error has been inspected, showing azeez abdullah azeez barzinjy et al.: mathematical modeling of sampling, quantization, and coding 22 uhd journal of science and technology | april 2017 | vol 1 | issue 1 voltage-controlled oscillator based analog-to-digital converter.” ieee transactions on circuits and systems i: regular papers, vol. 57, no. 1, pp. 18-30, 2010. [9] b. s. song. microcmos design, hoboken, nj: taylor & francis, 2011. [10] j. g. proakis and d. g. manolakis. digital signal processing: principles, algorithms, and applications, new jersey, usa: prentice hall, 1996. [11] g. j. foschini and m. j. gans. “on limits of wireless communications in a fading environment when using multiple antennas.” wireless personal communications, vol. 6, no. 3, pp. 311-335, 1998. [12] k. sudakshina. analog and digital communications, singapore: pearson education india, 2010. [13] j. s. chitode. principles of communication, pune: technical publications, 2009. [14] d. r. morgan, z. ma, j. s. kenney, j. kim and c. r. giardina. “a generalized memory polynomial model for digital predistortion of rf power amplifiers.” ieee transactions on signal processing, vol. 54, no. 10, pp. 3852-3860, 2006. [15] b. le, t. w. rondeau, j. h. reed and c. w. bostian. “analogto-digital converters.” ieee signal processing magazine, vol. 22, no. 6. pp. 69-77, 2005. [16] s. k. mitra and s. k. mitra. digital signal processing: a computerbased approach, new york: mcgraw-hill, 2011. [17] v. mladenov, p. karampelas, g. tsenov and v. vita. “approximation formula for easy calculation of signal-to-noise ratio of sigmadelta modulators.” isrn signal processing, vol. 2011, article id: 731989. pp. 7, 2011. [18] k. francken and g. g. gielen. “a high-level simulation and synthesis environment for/spl delta//spl sigma/modulators.” ieee transactions on computer-aided design of integrated circuits and systems, vol. 22, no. 8. p. 1049-1061, 2003. [19] b. e. bammes, r. h. rochat, j. jakana, d. h. chen and w. chiu. “direct electron detection yields cryo-em reconstructions at resolutions beyond 3/4 nyquist frequency.” journal of structural biology, vol. 177, no. 3, p. 589-601, 2012. [20] a. belitski, a. gretton, c. magri, y. murayama, m. a. montemurro, n. k. logothetis, s. panzeri. “low-frequency local field potentials and spikes in primary visual cortex convey independent visual information.” journal of neuroscience, vol. 28, no. 22, pp. 56965709, 2008. [21] f. overney, a. rufenacht, j. p. braun and b. jeanneret. “josephson-based test bench for ac characterization of analogto-digital converters.” in precision electromagnetic measurements (cpem), 2010 conference on ieee, 2010. [22] a. posselt, d. berges, o. klemp, b. geck. “design and evaluation of frequency-agile multi-standard direct rf digitizing receivers for automotive use.” in vehicular technology conference (vtc spring), ieee 81st, ieee, 2015. [23] p. m. aziz, h. v. sorensen and j. van der spiegel. “an overview of sigma-delta converters.” ieee signal processing magazine, vol. 13, no. 1, pp. 61-84, 1996. [24] t. l. brooks, d. h. robertson, d. f. kelly, a. del muro and s. w. harston. “a cascaded sigma-delta pipeline a/d converter with 1.25 mhz signal bandwidth and 89 db snr.” ieee journal of solid-state circuits, vol. 32, no. 12, pp. 1896-1906, 1997. [25] i. fujimori, l. longo, and a. hirapethian. “a 90-db snr 2.5-mhz output-rate adc using cascaded multibit delta-sigma modulation at 8/spl times/oversampling ratio.” ieee journal of solid-state circuits, vol. 35. no. 12. pp. 1820-1828, 2000. . uhd journal of science and technology | april 2017 | vol 1 | issue 1 11 1. introduction the power flow analysis is one of the important and extensively used studies in electrical power system engineering. it is considered a fundamental tool for many other power system studies such as stability, reliability, fault, and contingency study. the main objective of a power flow study is to find the bus voltages and the power flow in the transmission system for a particular loading condition. the steady-state performance of an electrical power system is described by a system of non-linear algebraic equations. these equations represent the active and reactive power balance. the inherent difficulty of the power flow problem is the task of obtaining analytical solutions to the power flow equations. an extensive research has been carried out since the latter half of the twentieth century [1], [2] to solve this problem. the solution of power flow problem has been based on numerical technique methods such as gauss-seidel [3], newton-raphson method [4]-[14], and fast-decoupled method [15]-[18]. although some of these methods are widely used in power utilities, they are sensitive to the starting (guess) values. in some cases, especially in heavily loaded conditions, they fail to converge. it was found that the factors affecting the convergence of the previous methods are the r/x ratio of the transmission systems and the singularity of the jacobian matrix for a heavily loaded system. different attempts have been done to improve the reliability of these methods [19], [20]. artificial intelligence techniques had been applied to power flow study [21]-[23]. recently, the fields of swarm intelligence have attracted many researches as a branch of artificial intelligence that deals with the collective behavior of swarms such as flocks of bird, colonies of aunts, schools of fish, and swarm of bees [24], [25]. the important features of swarm intelligence are selforganization, scalability, adaptation, and speed. the swarm intelligence techniques have been applied in many power system studies [26]-[28]. in this paper, the load flow problem is approached as an optimization problem using application of artificial bee colony algorithm in power flow studies kassim al-anbarri and husham moaied naief department of electrical engineering, faculty of engineering, mustansiriyah university, bab al-muadham campus, 46049 baghdad, iraq a b s t r a c t artificial bee colony (abc) algorithm is one of the important artificial techniques in solving general-purpose optimization problems. this paper presents the application of abc in computing the power flow solution of an electric power system. the objective function to be minimized is the active and reactive power mismatch at each bus. the proposed algorithm has been applied on typical power systems. the results obtained are compared with those obtained by the conventional method. the results obtained reveal that the abc algorithm is very effective for solving the power flow problem in the maximum loadability region. index terms: artificial bee colony, maximum loadability, power flow, swarm artificial technique corresponding author’s e-mail: alanbarri@yahoo.com received: 10-03-2017 accepted: 25-03-2017 published: 12-04-2017 access this article online doi: 10.21928/uhdjst.v1n1y2017.pp11-16 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2017 al-anbarri and naief. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l re se a rc h a rt i c l e uhd journal of science and technology kassim al-anbarri and husham moaied naief: application of artificial bee colony algorithm in power flow studies 12 uhd journal of science and technology | april 2017 | vol 1 | issue 1 swarm intelligence. the objective function is to minimize the power mismatch. this paper is organized as follows: section 2 reviews the newton-raphson (nr) technique in solving load flow problem. the basics model of artificial bee colony (abc) is presented in section 3. section 4 discusses the results obtained by applying the proposed algorithms on a typical system. finally, section 5 presents the conclusion. 2. power flow formulation for n bus electrical power system, the bus power si can be expressed by the following equation: si=sgi−sdi si=pgi−pdi+j(qgi−qdi) (1) where pgi is the active power generation at bus i pdi is the active power demand at bus i qgi is the reactive power generation at bus i qdi is the reactive power demand at bus i the current balance equation at bus i * * 1 n i i ik k i k s i y v v = = =∑ (2) where yik is the i, k th element of bus admittance matrix vk is the bus voltage at bus k. by substituting (1) into the (2) and resolved the resulting equation into the following two real equations: ( )k i k 1 cos n k i ki i p v v y = γ= + −∑ ki δ δ (3) ( )k i k 1 sin n k i ki i q v v y = γ= + −∑ ki δ δ (4) for n bus power system, there are 2n real non-linear algebraic equations similar to (3) and (4). these equations are non-linear function of the state variables (|v|,δ). the conventional technique to solve these equations is using a numerical technique. the most widely used method is nr method. this method is based on expanding the above equations by taylor series. the compact linearized form of the above equations is as follows: i i|v| p p|v|i q q|v|i j jp j jq ∆ ∆δ     =     ∆ ∆     δ δ (5) where the left-hand side of (5) is the vector of power mismatch, which can be calculated as: sp cal i i i sp cal i i i p p p q q q  ∆ −  =   ∆ −     (6) the traditional algorithm for obtaining power flow solution is as follows: 1. assume a guess values for the state variables (flat start |v|=1.0 pu; δ=0) 2. evaluate the vector of power mismatch and the elements of the jacobian matrix 3. calculate the vector of state variable disturbance 4. update the state variables at the end of iteration 5. check the absolute value of the elements of the vector of power mismatch, if it is less than a specified tolerance; calculate the line flow in each transmission line. otherwise, go to step 2. the previous algorithm works reliably in ordinary loading conditions. unfortunately, it is found in some cases (e.g., heavily loaded conditions and high r/x ratio system) that the above algorithm fails to converge. this is because of singularity of the jacobian matrix. for this purposes, a swarm intelligence technique is presented to avoid the singularity of the jacobian matrix. 3. power flow algorithm using abc method the honey bees foraging behavior, learning, and memorizing characteristics have been attracted many researcher in the area of swarm intelligence. the pioneer work of karaboga [24] which describes an abc algorithm based on the behavior of honey bee is first attempt model in this aspect. one of the main features of abc algorithm is its ability to conduct both global search and local search in each iteration. according to the abc algorithm, there are three categories of artificial bees in the colony. these are employed bees, onlookers bees, and scouts bees. the bee colony is divided into two halves, the first half of colony includes employed bees, and the second half includes the onlookers. the onlooker’s bees are those waiting on the dance area in hive as a decision-maker for choosing the suitable food source. the employed bees are those collecting the nectar from food kassim al-anbarri and husham moaied naief: application of artificial bee colony algorithm in power flow studies uhd journal of science and technology | april 2017 | vol 1 | issue 1 13 source. while the scout bees are those searching the food sources. the searching cycle in the abc algorithm consists of the following steps [29]: • at the initialization step, the bees select a set of food source positions randomly. after determining the nectar amount, the bees come to the hive to share the information with those waiting on the dance area. • at the second step, the employed bees use the gained information to choose new food sources in neighborhood area after going to the old position, which is visited by themselves previously. • at the third stage, the onlooker bee chooses a particular area for the food sources depending on the information given by the employed bees on the dance area. to utilize the abc algorithm, there are some control parameters that should be set [30]; they are number of variables, lower bound of variables (lb), upper bound of variables (ub), population size (colony size) (npop), number of onlooker bees (nn onlooker), maximum number of iterations (maxit) (the stopping criteria), abandonment limit parameter (limit), and acceleration coefficient upper bound (a). a. steps of abc implementation the steps of abc can be outlined as follows [31]: 1. generate a randomly distributed initial population solutions (food source positions). 2. evaluate the population which represents the nectar quantity. the population of the positions (solutions) is subjected to iterated cycles, c =1, 2,…, c max , of the search processes of the employed bees, the onlooker bees and scout bees. based on a probabilistic approach, the artificial employed or onlooker bee makes a change on the position (solution) in her memory for finding a new food source and tests the nectar amount (fitness value) of the new source (new solution). 3. apply the roulette wheel selection (choose the best fit individuals). 4. calculate the probability rate (pi) related with solutions; 1 i i npop ii fit p fit = = ∑ (7) the fitness values (fit) are computed by the following expression: 1 if 0 1 1 abs( ) if 0 i ii i i f ffit f f  ≥ +=    + <  (8) usually, the value of pi is between {0,1}. 5. find the new solutions for the onlookers depending on the probability pi related with the solutions. 6. reapply roulette wheel selection. 7. find the abandoned solution if exists, change it with new randomly generated solution. 8. register the best solution achieved so far. 9. c=c+1 (until maximum cycle number is reached). b. abc implementation for power flow study abc optimization is applied to obtain the bus voltage magnitude (|vi|) and voltage phase angle (δi) by minimize the following objective function: minf(δ,|v|) (9) where, δ=(δ1,……………, δn) |v|=(|v 1 |,……………, |vn|) this objective function is constrained by the inequalities lb and ub. lb<|v|<ub lb<δ<ub the optimization process starts with setting the number of solutions (food sources) in abc algorithm, which represents the number of flowers, the bees will reach to food sources, and then computes the nectar’s quantity, the food sources are initialized using a random number generator. the voltage magnitude and voltage phase angle are limited to the following range: 0.5<|vi|<1.05 −5<δi<5 the objective function (f) that designed to determine the load flow problem using abc algorithm is as follows: 2 2 i if p q= ∆ + ∆∑ ∑ (10) where i = 1, 2, 3,… number of buses in abc algorithm, the objective function (fitness value) describes the quality of food source (solution). the food kassim al-anbarri and husham moaied naief: application of artificial bee colony algorithm in power flow studies 14 uhd journal of science and technology | april 2017 | vol 1 | issue 1 source that has the best quality will be registered in a memory as the best food source (solution) ever found. the neighborhood search process uses to obtain the best fitness value will continue by employed bees and onlookers. the fitness value will be computed for each new solution (food source), and the new solution (food source) that having the best fitness value will be the new reference in memory. the optimization process will continue to looking for the food source near to hive, which depending on the probability that computed previously from fitness value. the new solution (food source) after neighborhood search will be registered if its fitness is better. the optimization process will continue until reach to the best fitness value or reach to the maximum cycle number afterward the solution converges and the mismatch power is close to zero. the flow chart of abc approach in load flow computation is shown in fig. 1. 4. results and discussion the abc algorithm is being applied to the 6-bus system as follows: a. six bus test power system with normal load the abc method is applied to the 6-bus system with a particular normal loading condition [32]. the test system is shown in fig. 2 consists of three generating stations and three load stations. after initialize the control parameters of abc algorithm, each variable was initialized with random number using random number generator. the elements of power mismatch vectors δp and δq are computed using equation 6. the best food source (load flow solution) will be selected by applying roulette wheel selection. a comparison between the results that obtained from conventional (nr) with that found from abc algorithm is given in table i. as shown from these tables, the results fig. 1. flow chart of application of artificial bee colony technique in power flow study table i the bus voltages using nr method and abc technique bus no. nr-method abc algorithm vi δi vi δi 1 1.05 0 1.05 0.0 2 1.05 −3.635 1.05 −3.635016 3 1.07 −4.117 1.07 −4.117675 4 0.989 −4.18 0.989013 −4.180701 5 0.9813 −5.306 0.981364 −5.306448 6 1.004 −5.856 1.004079 −5.856946 abc: artificial bee colony, nr: newton‑raphson obtained are identical. the final objective function value in abc optimization process after 15000 iterations is 1.5902×10−11. the performance of abc algorithm to obtain the best solution is clarified by graph that was shown in fig. 3. b. six bus test power system with heavy load to simulate the heavy load condition, the load at bus 4, 5, and 6 is increased as shown in appendix a. it is found that the conventional nr method is failed to converge. however, when the abc algorithm is applied, the solution converges as shown in table ii. the final objective function value in abc optimization process after 5000 iteration is 8.99 × 10−4. the kassim al-anbarri and husham moaied naief: application of artificial bee colony algorithm in power flow studies uhd journal of science and technology | april 2017 | vol 1 | issue 1 15 fig. 2. six bus test power system table ii the bus voltages for the heavily loaded case using abc technique where the nr method fails to solve bus no. vi δi 1 1.0500 0 2 1.0500 −52.3630 3 1.0700 −63.0819 4 0.7575 −46.0854 5 0.6721 −58.3285 6 0.8341 −68.0498 abc: artificial bee colony, nr: newton‑raphson fig. 3. the performance of artificial bee colony algorithm, best solution versus iteration fig. 4. the performance of artificial bee colony algorithm, best solution versus iteration performance of abc method to obtain the best solution is clarified by graph that was shown in fig. 4. 5. conclusion a meta-heuristic approach to solve power flow problem has been presented. the proposed algorithm is based on abc technique which is considered to be one of the type of swarm intelligence techniques. the proposed algorithm is applied to the six bus system with different loading conditions, and the results obtained have been compared with the results of (nr) method. the main advantages of abc algorithm are the flexibility of modeling, accuracy, strong convergence, and reliability, which considered reasonable and acceptable optimization process. in addition, the presented algorithm shows promising results regarding heavily loaded system. 6. acknowledgment the author would like to thank the department of electrical engineering, faculty of engineering, mustansiriyah university (www.uomustansiriyah.edu.iq), baghdad, iraq, for its support in the present work. references [1] m. a. laughton and m. w. h. davies, “numerical techniques in solution of power system load flow problems.”proceedings of the institution of electrical engineers, vol. 111, no. 9, pp. 1575-1588, sept. 1964. [2] l. l. freris and a. m. sasson, “investigation of the load-flow problem.” proceedings of the institution of electrical engineers, vol. 115, no. 10, pp. 1459-1465, oct. 1968. [3] s. moorthy, m. al-dabbagh and m. vawser, “improved phasecoordinate gauss-seidel load flow algorithm.”electric power system research, vol. 34, pp. 91-95, 1995. [4] w. f. tinney and c. e. hart, “power flow solution by newton’s method.” ieee transactions on power apparatus and systems, vol. pas-86, pp. 1449-1460, nov. 1967. [5] b. stott, “effective starting process for newton-raphson load flows.” proceedings of the institution of electrical engineers, kassim al-anbarri and husham moaied naief: application of artificial bee colony algorithm in power flow studies 16 uhd journal of science and technology | april 2017 | vol 1 | issue 1 vol. 118, no. 8. pp. 983-987, aug. 1971. [6] r. g. wasley and m. a. shlash, “newton-raphson algorithm for 3-phase load flow.” proceedings of the institution of electrical engineers, vol. 121, no. 7, pp. 630-638, jul. 1974. [7] m. e. el-hawary and o. k. wellon, “the alpha-modified quasisecond order newton-raphson method for load flow solutions in rectangular form.” proceedings of the institution of electrical engineers, vol. pas-101, pp. 854-866, 1982. [8] a. j. flueck and h. d. chiang, “solving the nonlinear power flow equations with an inexact newton method using gmres.” ieee transactions on power systems, vol. pwrs-13, no. 2. pp. 267273, may. 1998. [9] m. d. schaffer and d. j. tylavsky, “a non-diverging polar form newton based power flow.” ieee transactions on industry applications, vol. 24, no. 5, pp. 870-877, 1998. [10] v. m. da costa, n. martins and j. l. r. pereira, “developments in the newton-raphson power flow formulation based on current injections.” ieee transactions on power systems, vol. 14, no. 4, pp. 1449-1456, nov. 1999. [11] y. xiao and y. h. song, “power flow studies of a large practical power network with embedded facts devices using improved optimal multiplier newton-raphson method.” european transaction on electrical power, vol. 11, no. 4, pp. 247-256, jul. aug. 2001. [12] m. irving, “pseudo-load flow as a starting process for newton raphson algorithm.” international journal of electrical power and energy systems, vol. 32, pp. 835-839, 2010. [13] t. kulworawanichpong, “simplified newton-raphson power flow solution method.” international journal of electrical power and energy systems, vol. 32, pp. 551-558, 2010. [14] s. m. r. slochanal and k. r. mohanram, “a novel approach to large scale system load flows newton-raphson method using hybrid bus.” electric power systems research, vol. 41, pp. 219223, 1997. [15] b. stott, “decoupled newton load flow.” ieee transactions on power apparatus and systems, vol. pas-91, pp. 1955-1959, 1972. [16] b. stott and o. alsac, “fast decoupled load flow.” ieee transactions on power apparatus and systems, vol. pas-93, no. 3. pp. 859869, may. jun. 1974. [17] k. behnam-guilani, “fast decoupled load flow: the hybrid model,” ieee transactions on power systems, vol. pwrs-3, no. 2, pp. 734-742, may. 1988. [18] a. v. garcia and m. g. zago, “three-phase fast decoupled power flow for distribution networks.” iee proceedings-generation, transmission and distribution, vol. 143, no. 2, pp. 188-192, mar. 1997. [19] s. c. tripathy, g. d. prasad, o. p. malik and g. s. hope, “load flow solutions for ill-conditioned power systems by a newton-like method,” ieee transactions on power apparatus and systems, vol. pas-101, pp. 3648-3657, oct. 1982. [20] d. rajicic and a. bose, “a modification to the fast decoupled power flow for networks with high r/x ratios.” ieee transactions on power systems, vol. pwrs-3, no. 2, pp. 743-746, may. 1988. [21] x. yin, “application of genetic algorithms to multiple load flow solution problem in electrical power systems.” proceedings of the 32nd conference on decision and control, texas, pp. 3734-3738, 1993. [22] w. l. chan, a. t. p. so and l. l. lai, “initial applications of complex artificial neural networks to load-flow analysis.” iee proceedingsgeneration, transmission and distribution, vol. 147, no. 6, pp. 361-366, nov. 2000. [23] p. acharjee and s. k. goswami, “simple but reliable two-stage ga based load flow,” electric power components and systems, vol. 36, pp. 47-62, 2008. [24] d. karaboga, “an idea based on honey bee swarm for numerical optimization,” technical report-tr06, erciyes university, kayseri, turkey, 2005. [25] b. akay and d. karaboga, “a modified artificial bee colony algorithm for real-parameter optimization” information science, vol. 1, pp. 120-142, jun. 2010. [26] s. j. huang, x. z. liu, w. f. su, and t. c. ou, “application of enhanced honey-bee mating optimization algorithm to fault section estimation in power systems.” ieee transaction on power delivery, vol. 28, no. 3, pp. 1944-1951, jul. 2013. [27] m. afzalan and m. a. taghikhani, “placement and sizing dg using pso and hbmo algorithms in radial distribution networks” international journal of intelligent system and applications, vol. 4, no. 10, pp. 43-49, sep. 2012. [28] n. t. linh and d. x. dong, “optimal location and size of distributer generation in distribution system by artificial bees colony algorithm” international journal of information and electronics engineering, vol. 3, no. 1, pp. 63-67, jan. 2013. [29] d. karaboga and b. basturk, “a powerful and efficient algorithm for numerical function optimization: artificial bee colony(abc) algorithm.” journal of global optimization, vol. 39, pp. 459-471, 2007. [30] m. s. kiran and m. gündüz, “the analysis of peculiar control parameters of artificial bee colony algorithm on the numerical optimization problems.” journal of computer and communications, vol. 2, pp. 127-136, mar. 2014. [31] k. al-anbarri, a. h. miri, and s. a. hussain, “load frequency control of multi-area hybrid power system by artificial intelligence techniques.” international journal of computer applications, vol. 138, no. 7, mar. 2016. [32] a. j. wood and b. f. wollenberg, power generation, operation and control, 2nd ed, new york, ny: john wiley & sons ltd., 1996. appendix a buses data of 6-bus case with heavily loaded condition bus no. |v|pu δ (°) pg (mw) qg (mvar) pd (mw) qd (mvar) 1 1.05 0 0 0 0 0 2 1.05 0 80 0 0 0 3 1.07 0 90 0 0 0 4 1.0 0 0 0 255 130 5 1.0 0 0 0 255 130 6 1.0 0 0 0 255 152 . uhd journal of science and technology | jul 2019 | vol 3 | issue 2 33 1. introduction today, addiction to narcotics is one of the main health challenges of the world resulting in serious threats for social, economic, and cultural structures destroying a part of the active force of the society. on the other hand, it is one of the main factors of growth of diseases such as hiv and hepatitis. according to the social analyzers, addiction to narcotics is one of the complicated problems of the current age resulting in many social damages and violations. in other words, the relationship of addiction with social issues is two-sided; on the one hand, addiction results in recession and degeneration of the society. on the other hand, it is a phenomenon originating from social, economic, and cultural issues [1]. today, addiction is known as a disease and there are some centers for its treatment which have complete information about addicts. therefore, despite the large volume of data, data mining can be used to explore knowledge in data, and its results can be used as the knowledge database of the decision support system to prevent and treat addiction. data mining tools analyze data and explore data pattern which can be used in applications to determine the strategy for business, employing data mining techniques for predicting opioid withdrawal in applicants of health centers raheleh hamedanizad1, elham bahmani2, mojtaba jamshidi3*, aso mohammad darwesh3 1department of computer engineering, kermanshah branch, islamic azad university, kermanshah, iran, 2department of computer engineering, malayer branch, islamic azad university, malayer, iran, 3department of information technology, university of human development, sulaymaniyah, iraq a b s t r a c t addiction to narcotics is one of the greatest health challenges in today’s world which has become a serious threat for social, economic, and cultural structures and has ruined a part of an active force of the society and it is one of the main factors of growth of diseases such as hiv and hepatitis. today, addiction is known as a disease and welfare organization, and many of the dependent centers try to help the addicts treat this disease. in this study, using data mining algorithms and based on data collected from opioid withdrawal applicants referring to welfare organization, a prediction model is proposed to predict the success of opioid withdrawal applicants. in this study, the statistical population is comprised opioid withdrawal applicants in a welfare organization. this statistical population includes 26 features of 793 instances including men and women. the proposed model is a combination of meta-learning algorithms (decorate and bagging) and j48 decision tree implemented in weka data mining software. the efficiency of the proposed model is evaluated in terms of precision, recall, kappa, and root mean squared error and the results are compared with algorithms such as multilayer perceptron neural network, naive bayes, and random forest. the results of various experiments showed that the precision of the proposed model is 71.3% which is superior over the other compared algorithms. index terms: addiction, data mining, decision tree, meta-learning algorithm corresponding author’s e-mail: mojtaba jamshidi, department of information technology, university of human development, sulaymaniyah, iraq, e-mail: jamshidi.mojtaba@gmail.com received: 04-08-2019 accepted: 14-08-2019 published: 20-08-2019 o r i g i n a l re se a rc h a rt i c l e uhd journal of science and technology access this article online doi: 10.21928/uhdjst.v3n2y2019.pp33-40 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2019 ghareb and mohammed. this is an open access article distributed under the creative commons attribution noncommercial no derivatives license 4.0 (cc by-nc-nd 4.0) raheleh hamedanizad, et al.: predicting opioid withdrawal using data mining techniques 34 uhd journal of science and technology | jul 2019 | vol 3 | issue 2 knowledge database, medical, and scientific studies. the gap between data and information has necessitated data mining tools to convert useless data into useful knowledge [2-4]. in this study, data mining techniques such as neural network, bayesian network, and decision tree are used to present prediction models to predict the success of opioid withdrawal applicants referring to a welfare organization. in this paper, also a hybrid prediction model comprised j48, decorate, and bagging algorithms are proposed for predicting the success of opioid withdrawal applicants to the welfare organizations. the statistical population of this study is comprised opioid withdrawal applicants referring to a welfare organization. the rest of this paper is organized as follows: section 2 reviews related works and in section 3, we have introduced the dataset. section 4 introduces the proposed model to predict the success of opioid withdrawal applicants, while section 5 presents the simulation results. the paper is concluded in section 6. 2. related work in lu et al. [5] authors obtained data from reddit, an online collection of forums, to gather insight into drug use/misuse using text snippets from user’s narratives. they used users’ posts to trained a binary classifier which predicts a user’s transitions from casual drug discussion forums to drug recovery forums. they also presented a cox regression model that outputs likelihoods of such transitions. furthermore, they founded that utterances of select drugs and certain linguistic features contained in one’s posts can help predict these transitions. in fan et al. [6], a novel framework named autodoa is proposed to automatically detect the opioid addicts from twitter. the authors first introduced a structured heterogeneous information network (hin) to model the users and posted tweets as well as their rich relationships. then, a meta-path based mechanism is used to formulate similarity measures over users, and different similarities are aggregated using laplacian scores. finally, based on hin and the combined meta-path, a classification model is proposed for automatic opioid addict detection to reduce the cost of acquiring labeled examples for supervised learning. in zhang et al. [7], an intelligent system named iopu has been developed to automate the detection of opioid users from twitter. like fan et al. [6] authors first introduced a hin; then they used a meta-graph based technique to characterize the semantic relatedness over users. furthermore, they have integrated content-based similarity and relatedness obtained by each meta-graph to formulate a similarity measure over users. finally, they proposed a classifier combining different similarities based on different meta-graphs to make predictions. in kim [8] authors used a text mining approach to explore how opioid-related research themes have changed since 2000. the textual data were obtained from pubmed, and the research periods were divided into three periods. while a few topics appear throughout each period, many new health problems emerged as opioid abuse problems magnified. topics such as hiv, methadone maintenance treatment, and world health organization appear consistently but diminish over time, while topics such as injecting drugs, neonatal abstinence syndrome, and public health concerns are rapidly increasing. the study kaur and bawa [9] is aimed at uncovering and analyzing a range of data mining tools and techniques for optimally predicting the numerous medical diseases to endow the health-care section with high competence and more effectiveness. after preparing the dataset, data are loaded into the weka tool. then, naïve bayes, decision tree (j48), multilayer perceptron (mlp), logistic regression are selected to build the prediction models. data are then cross-validated using performance classifier measure; the results of each algorithm are then compared to each other. in rani [10], neural networks have been used to classify medical data sets. back propagation error method with variable learning rate and acceleration has been used to train the network. to analyze the performance of the network, various training data have been used as input of the network. to speed up the learning process, parallelization is performed in each neuron at all output and hidden layers. results showed that the multi-layer neural network is trained faster than a single-layer neural network with high classification efficiency. in shajahaan et al. [11], the application of decision trees in predicting breast cancer has been investigated. it has also analyzed the performance of conventional supervised learning algorithms, namely, random tree, id3, cart, c4.5, and naive bayes. then, data are transferred to rapid miner data mining tool and breast cancer diagnosis for each sample in the test set is predicted with seven different algorithms which are discriminant analysis, artificial neural networks (ann), decision trees, logistic regression, support vector machines, naïve bayes, and knn. results showed that random tree achieves higher accuracy in cancer prediction. raheleh hamedanizad, et al.: predicting opioid withdrawal using data mining techniques uhd journal of science and technology | jul 2019 | vol 3 | issue 2 35 in kaur and bawa [12], the motive is proposing an expert system which can predict whether the person is addicted prone to drugs so as to control and aware every drug abuser as they can test repeatedly to cure them without hesitation. the proposed expert system is developed using decision tree id3 algorithm. in ji et al. [13], a framework has been developed to predict potential risks for medical conditions as well as its progression trajectory to identify the comorbidity path. the proposed framework utilizes patients’ publicly available social media data and presents a collaborative prediction model to predict the ranked list of potential comorbidity incidences, and a trajectory prediction model to reveal different paths of condition progression. in salleh et al. [14], ann algorithms have been used to propose a framework for relapse prediction using among drug addicts at pusat rawatan inabah. the data collected will be mining through ann algorithms to generate patterns and useful knowledge and then automatically classifying the relapse possibility. authors have been mention that among the classification algorithms, ann is one of the best algorithms to predict relapse among drug addicts. 3. dataset of study and pre-processing the statistical population employed in this study is comprised the opioid withdrawal applicants of a welfare center. samples include 793 applicants including men and women. the table 1: features and their values in the dataset of study. feature value city two different cities welfare center name 12 different centers welfare center type private-admitted, private-outpatient, government-admitted, government-outpatient, community treatment, drop-in, midterm resident recovery age 17–70 years sex man-woman place of residence city-village number of children 0–8 education illiterate, primary, middle school, high school diploma, associate degree, bachelors, masters, phd marital status single-married job worker, farmer, service jobs, urban driver, inner city driver, unofficial jobs, storekeeper, unemployed, housewife, soldier, student, retired, self-employed, other housing situation lease or mortgage, father’s house, father-in-law’s house, organizational, other income 0–6000 (usd) consumption opium, heroin, crack, cannabis, ecstasy, buprenorphine, lsd, cocaine, tramadol, burnt water, methadone, combination consumption method smoke, eat, inject, snuff, drink, inhalation, other consumption frequency daily, weekly, monthly, a few times consumption amount half (gram), one, one and a half, two, two and a half, three, four, five, more average cost of supplying narcotics in the last week 0.7–134 (usd) the first one who offered narcotics strangers, friends, without offer, coworkers, family member, relatives where was the first consumption family party, friends party, park, school or university, street, casern, workplace, home, other history of consumption in the family members none, spouse, father, mother, sister, brother, two or more history of injection yes-no history of joint injection yes-no prisoner history yes-no type of previous withdrawals outpatient referral to the doctor’s office, admission in private ward, referring to outpatient centers, residence in rehabilitation centers, residence in rehabilitation camps, anesthesia method, conventional method, self-helping groups, licensed centers, other side effects physical effects, mental effects, financial effects, pregnancy, pressure, legal problems, marriage, physical, mental and financial effects, others the number of previous withdrawals 0–10 times raheleh hamedanizad, et al.: predicting opioid withdrawal using data mining techniques 36 uhd journal of science and technology | jul 2019 | vol 3 | issue 2 number of features in this dataset is 26. table 1 shows the features of this dataset and the value of each feature. here, some necessary preprocessing is provided on the dataset to increase the efficiency of the prediction models: i. the output field of this dataset is “the number of referrals” which is obtained by a small change in the field of “the number of previous withdrawals.” the purpose of this study is to predict the number of referrals of the addicts to withdrawal centers. this knowledge helps us understand that an individual succeeds to withdraw after how many referrals to the opioid withdrawal centers. to calculate the output filed, it is sufficient to add a unit to the field of “the number of previous withdrawals.” then, the output field is converted into four classes, as shown in table 2. ii. the field “family income” includes a numerical value between 0 and 6000 usd. to increase the efficiency of the proposed prediction model, the values of this field are divided into 10 degrees. in other words, individuals are categorized into 10 groups in terms of family income. this categorization is represented in table 3. iii. the field of “average cost of supplying the narcotics in the last week” also includes values in the range of 0.7–134 usd. all of these samples are also categorized into eight groups, as shown in table 4. iv. the field of “age” includes values between 17 and 70. in another categorization, all samples are categorized into six different groups, as given in table 5. v. field of “side effects” includes the same value for all samples. therefore, this feature cannot affect prediction results. thus, it is eliminated from the dataset. vi. furthermore, the value of some features such as housing situation, amount of consumption, and consumption frequency is missing for some individuals which are called missing value, and random initialization is used to resolve this problem. 4. the proposed model the proposed prediction model for determining the type of addicts in terms of the classes presented in table 2 is a hybrid model of j48, decorate, and bagging algorithms. the structure of the proposed hybrid model is as follows: de c o r a t e b a g g i n g j da t a s e t( ( ) )48( ) 4.1. j48 this algorithm is the implementation of the c4.5 decision tree. in this algorithm, additional grafting branches are considered on a tree in a post-processing phase. the grafting process tries to capture some of the capabilities of ensemble methods such as bagged and boosted trees, while a single structure can be maintained. this algorithm identifies areas that are either empty or only contains misleading classified samples and elores another (alternative) class [3,4]. table 2: classifying the output field of the dataset. the number of referrals class 1 a 2.3 b 4.5 c x>5 d table 3: categorizing the samples into 10 degrees in terms of “family income. the family income (x) per usd group x=0 j 0<x≤34 i 34<x≤100 h 100<x≤167 g 167<x≤334 f 334<x≤666 e 666<x≤1000 d 1000<x≤1666 c 1666<x≤3333 b 3333<x a table 4: categorizing the samples in terms of “average cost of supplying the narcotics in the last week” into eight groups. the average cost of supplying the narcotics in the past week (x) per usd group x≤3 h 3<x≤7 g 7<x≤10 f 10<x≤14 e 14<x≤17 d 17<x≤34 c 34<x≤165 b x>165 a table 5: categorizing the samples in terms of “age” in six different groups. age (x) group x≤20 a 20<x≤25 b 25<x≤30 c 30<x≤40 d 40<x≤50 e x>50 f raheleh hamedanizad, et al.: predicting opioid withdrawal using data mining techniques uhd journal of science and technology | jul 2019 | vol 3 | issue 2 37 4.2. decorate ensemble learning is to use a set of classifiers to learn partial solutions for a given problem and then integrate these solutions using some strategies to construct a final solution to the original problem. recently, ensemble learning is one of the most popular fields in data mining and machine learning communities and has been applied successfully in many real classification applications. diverse ensemble creation by oppositional relabeling of artificial training examples (decorate) is simple meta-learning that can use any strong learner as a base classifier to build diverse committees in a fairly straightforward strategy. the motivation for decorate is based on the fact that to combine the outputs of multiple classifiers is only useful if they disagree on some inputs. the decorate is designed to use additional artificially generated training data, and add different randomly constructed instances to the training set to generate highly diverse ensembles [2,15,16]. the decorate can also be effectively used for the following [16]: i. active learning, to reduce the number of training examples required to learn an accurate model; ii. exploiting unlabeled data to improve accuracy in semisupervised learning; iii. combining both active and semi-supervised learning for improved results; iv. obtaining improved class membership probability estimates, to assist in cost-sensitive decision-making; v. reducing the error of regression methods; and vi. improving the accuracy of relational learners. furthermore, the advantages of decorate are as follows [16]: • ensembles of classifiers are often more accurate than its component classifiers if the errors made by the ensemble members are uncorrelated. decorate method reduces the correlation between ensemble members by training classifiers on oppositely labeled artificial examples. furthermore, the algorithm ensures that the training error of the ensemble is always less than or equal to the error of the base classifier; which usually results in a reduction of generalization error; and • on average, combining the predictions of decorate ensembles will improve on the accuracy of the base classifier. 4.3. bagging this algorithm was proposed in 1994 by leo breiman to improve the classification by combining randomly generated training sets. this methodology is a meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. variance is reduced and over-fitting is improved through the use of this algorithm. bagging implicitly creates ensemble diversity by training classifiers on different subsets of the data. although this method is used in the decision tree, it can be used in any kind of model. bagging is a special case of model averaging approach. bagging can be applied to the prediction of continuous values by taking the average value of each vote, rather than the majority [2,17,18]. the advantages of bagging are as follows [16]: • bagging works well if the base classifiers are unstable; • it increased accuracy because it reduces the variance of the individual classifier; • bagging seeks to reduce the error due to the variance of the base classifier; and • noise-tolerant, but not so accurate. 5. simulation results in this section, we first present the simulation model for the proposed model and other algorithms. then, the common evaluation metrics in data mining problems are introduced. finally, the experiment results of the proposed model and some common algorithms are presented. 5.1. simulation model in this research, the weka tool is used to perform preprocessing operations and construct the proposed predictive models. this software has been developed at waikato university in new zealand and is an open-source tool implemented by the object-oriented programming language. this tool includes several machine learning and data mining algorithms such as regression, classification, clustering, exploring association rules, pre-processing tools (filters), and selection methods for attributes. furthermore, to train and test the proposed model, k-fold (k = 10) method is employed. in this type of test, data are classified into k subsets. from these k subsets, a subset is used for test, and k-1 subsets are used for training. this procedure is repeated k-times and all data are once used for test and once for training. finally, an average of these k times test is selected as the final estimation. in the k-fold method, the ratio of each class in each subset and the main set is the same [19]. 5.2. evaluation metrics one of the common tools used for evaluating classification algorithms is to employ the confusion matrix. as shown in table 6, the confusion matrix includes the results of raheleh hamedanizad, et al.: predicting opioid withdrawal using data mining techniques 38 uhd journal of science and technology | jul 2019 | vol 3 | issue 2 predictions of the classifier algorithm in four different classes including true positive, false negative, false positive, and true negative. true positive refers to the positive samples that were correctly labeled by the classifier. true negative refers to the negative samples that were correctly labeled by the classifier. false positive is an error in data reporting in which a test result improperly indicates the presence of a condition, such as a disease (the result is positive), when in reality it is not present. false negative is an error in which a test result improperly indicates no presence of a condition (the result is negative) when, in reality, it is present. considering the confusion matrix, the following measures can be defined and evaluated: • precision is the fraction of retrieved instances that are relevant. this measure is calculated by equation (1): tp tp fp+ (1) • accuracy is the proportion of true results (both true positives and true negatives) among the total number of cases examined. this measure is calculated by equation (2): tp tn tp tn fp fn + + + + (2) • recall is the fraction of relevant instances that are retrieved. this measure is calculated by equation (3): tp tp fn+ (3) 2 2 2 × × + = × × + + r e c a l l p r e c i s i o n r e c a l l p r e c i s i o n tp tp fp fn (4) • f-measure combines precision and recall (harmonic mean), which is calculated by equation (4): • root mean squared error (rmse) is a frequently used measure of the differences between values predicted by a model or an estimator and the values observed. the rmsd represents the square root of the second sample moment of the differences between predicted values (pi) and observed values (ai) or the quadratic mean of these differences. this measure is calculated by equation (5): ( ) ( )p a p a n n n1 1 2 2− +… + − (5) • mean absolute error (mae) measures how far predicted values (p i ) are away from observed values (a i ) and is calculated by equation (6): p a p a n n n1 1− +… + − (6) 5.3. experiment results periment results in fig. 1 show that the precision of the proposed model is 71.3% which has improved compared to j48, random forest, naive bayes, and mlp neural network with precisions of 66%, 65.3%, 51.2%, and 56.3%, respectively. furthermore, the results show that using hybrid models based on meta-learning algorithms of decorate and bagging increases the precision of the prediction models. for instance, the precision of j48 alone is 66% while its combination with decorate and bagging algorithm increases the precision to 69% and 66.7%, respectively. in the proposed model, since features of j48, decorate and bagging are employed, higher precision is obtained. fig. 2 shows the experiment results in terms of recall. the results show that the proposed model has a recall of 71.5% which is higher than j48, random forest, naive bayes, and mlp neural network with recall of 66%, 65.2%, 50.9%, and 56.2%, respectively. besides, the proposed hybrid model gives better results compared to hybrid algorithms of decorate-j48 and bagging-j48 with recalls of 69.1% and 67%. moreover, in terms of f-measure, as shown in fig. 3, the proposed model has obtained an f-measure of 71.2% which table 6: the confusion matrix. predicted observed true false true tp fn false fp tn fig. 1. comparing the proposed model with other algorithms in terms of precision. raheleh hamedanizad, et al.: predicting opioid withdrawal using data mining techniques uhd journal of science and technology | jul 2019 | vol 3 | issue 2 39 outperforms j48, random forest, naive bayes, and mlp neural network with f-measures of 65.5%, 64.9%, 50.9%, and 56.2%, respectively. furthermore, combining j48 with decorate and bagging algorithms has increased f-measure but not more than the hybrid proposed model. furthermore, the proposed model is compared with other models in terms of kappa in fig. 4. the experiment results show that the kappa values obtained for j48, decorate-j48, and bagging-j48 are 0.50, 0.54, and 0.56, respectively, while its value for the proposed model is about 0.58, which indicates the superiority of the proposed model over the other models. furthermore, fig. 5 compares the performance of the proposed model with other algorithms in terms of mae. in terms of mae, j48 outperforms other algorithms with a value of 0.1884. the mae of the proposed model is 0.2427 which is a bit higher than j48. indeed, this negligible shortcoming of the proposed model can be ignored compared to its superiority in terms of other measures. in addition, the proposed model is compared with other algorithms in terms of rmse. the results presented in fig. 6 show that the proposed model with rmse of 0.3265 has the minimum rmse along with decorate-j48 with rmse of 0.3239 compared to other model indicating the desirable performance of the proposed model. fig. 5. comparing the proposed model with other algorithms in terms of mean absolute error. fig. 2. comparing the proposed model with other algorithms in terms of precision. fig. 3. comparing the proposed model with other algorithms in terms of f-measure. fig. 4. comparing the proposed model with other algorithms in terms of kappa. fig. 6. comparing the proposed model with other algorithms in terms of root mean squared error. raheleh hamedanizad, et al.: predicting opioid withdrawal using data mining techniques 40 uhd journal of science and technology | jul 2019 | vol 3 | issue 2 finally, the proposed hybrid model is compared with other algorithms in terms of time required to build the model. the results of this experiment in fig. 7 showed that the time required to build the proposed model is 3.42 s while this time is less for other algorithms. the reason is very clear because the proposed model is based on a combination of three different algorithms, and therefore, it takes longer to construct. 6. conclusion in this paper, a hybrid prediction model based on data mining techniques is proposed to predict the number of times that opioid withdrawal applicants refer to welfare organizations. the statistical population of this study is comprised opioid withdrawal applicants referring to a welfare organization. the proposed model is a combination of decorate, bagging, and j48 algorithms, which benefits from the advantages of all three algorithms. the efficiency of the proposed hybrid model is evaluated in terms of precision, recall, f-measure, kappa, rmse, and mae. the results show that the proposed model with a precision of 71.3% outperforms other algorithms such as random forest, naïve bayes, and mlp neural network. references [1] a. m. trescot, s. datta, m. lee and h. hansen. “opioid pharmacology”. pain physician, vol. 11, no. 2 suppl, pp. s133-s153, 2008. [2] h. jiawei and k. micheline. “data mining: concepts and techniques”. 2nd ed. morgan kaufmann publishers, elsevier, burlington, 2006. [3] g. r. murray and a. scime. “data mining. emerging trends in the social and behavioral sciences: an interdisciplinary, searchable, and linkable resource”. john wiley and sons, hoboken, nj, pp. 1-15, 2015. fig. 7. comparing the proposed model with other algorithms in terms of the time required to build the model. [4] l. zeng, l. li, l. duan, k. lu, z. shi, m. wang and p. luo. “distributed data mining: a survey”. information technology and management, vol. 13, no. 4, pp. 403-409, 2012. [5] j. lu, s. sridhar, r. pandey, m. a. hasan and g. mohler. “investigate transitions into drug addiction through text mining of reddit data”. in: proceedings of the 25th acm sigkdd international conference on knowledge discovery and data mining, acm, pp. 2367-2375, 2019. [6] y. fan, y. zhang, y. ye and w. zheng. “social media for opioid addiction epidemiology: automatic detection of opioid addicts from twitter and case studies”. in: proceedings of the 2017 acm on conference on information and knowledge management, acm, pp. 1259-1267, 2017. [7] y. zhang, y. fan, y. ye, x. li and w. zheng. “detecting opioid users from twitter and understanding their perceptions toward mat”. in: 2017 ieee international conference on data mining workshops, ieee, pp. 502-509, 2017. [8] y. m. kim. “discovering major opioid-related research themes over time: a text mining technique”. amia summits on translational science proceedings, vol. 2019, pp. 751-760, 2019. [9] s. kaur and r. k. bawa. “future trends of data mining in predicting the various diseases in medical healthcare system”. international journal of energy, information and communications, vol. 6, no. 4, pp. 17-34, 2015. [10] k. u. rani. “parallel approach for diagnosis of breast cancer using neural network technique”. international journal of computer applications, vol. 10, no. 3, pp. 1-5, 2010. [11] s. s. shajahaan, s. shanthi and v. m. chitra. “application of data mining techniques to model breast cancer data”. international journal of emerging technology and advanced engineering, vol. 3, no. 11, pp. 362-369, 2013. [12] s. kaur and r. k. bawa. “implementation of an expert system for the identification of drug addiction using decision tree id3 algorithm”. in: 2017 3rd international conference on advances in computing, communication and automation, ieee, pp. 1-6, 2017. [13] x. ji, s. a. chun, j. geller and v. oria. “collaborative and trajectory prediction models of medical conditions by mining patients’ social data”. in: 2015 ieee international conference on bioinformatics and biomedicine, ieee, pp. 695-700, 2015. [14] a. k. m. salleh, m. makhtar, j. a. jusoh, p. l. lua and a. m. mohamad. “a classification framework for drug relapse prediction”. journal of fundamental and applied sciences, vol. 9, no. 6s, pp. 735-750, 2017. [15] m. c. patel, m. panchal and h. p. bhavsar. “decorate ensemble of artificial neural networks with high diversity for classification”. international journal of computer science and mobile computing, vol. 2, no. 5, pp. 134-138, 2013. [16] p. s. adhvaryu and m. panchal. “a review on diverse ensemble methods for classification”. iosr journal of computer engineering, vol. 1, no. 4, pp. 27-32, 2012. [17] l. breiman. “bagging predictors”. machine learning, vol. 24, no. 2, pp. 123-140, 1996. [18] h. zhang, y. song, b. jiang, b. chen and g. shan. “two-stage bagging pruning for reducing the ensemble size and improving the classification performance”. mathematical problems in engineering, vol. 2019, pp. 1-17, 2019. [19] s. s. a. poor and m. e. shiri. “a genetic programming based algorithm for predicting exchanges in electronic trade using social networks’ data”. international journal of advanced computer science and applications, vol. 8, no. 5, pp. 189-196, 2017. . 16 uhd journal of science and technology | jul 2019 | vol 3 | issue 2 1. introduction an epileptic seizure is one of the most common neurological disorders caused by brain activity impulses that escape their boundaries and affect other areas of the brain through creating a storm of electrical activities [1-3]. an epileptic seizure is the result of excessive neuronal spontaneous and synchronized discharge in the group of the brain cells. to detect symptoms of epileptic seizure in a patient, electroencephalography (electroencephalogram [eeg]) is used commonly [3,4]. eeg measures the electrical activity of the brain and generates a dynamic visual image of the brain activities that can be scanned for abnormalities that may indicate whether the patient is suffering from epileptic seizure or not. visually inspection of eeg recordings is time consuming and requires specialists such as neurophysiologists to analyze the recordings and diagnose the case. to facilitate the detection of epileptic seizure signs with high accuracy and reduce the time taken to make diagnostics, it is essential that an automated computer-based system to be utilized [5,6]. to use eeg recordings and make a diagnostic, the following steps will have to be taken: 1. preprocessing 2. analyzing eeg recordings using the time, frequency, and joint time-frequency domain 3. identify patterns that indicate seizure activities (feature extraction) 4. classify identified patterns to make correct diagnostics. the state of the art in feature extraction methods for electroencephalogram epileptic classification mokhtar mohammadi1, hoger mahmud2 1department of information technology, college of science and technology, university of human development, sulaymaniyah, kurdistan region, iraq, 2department of computer science, college of science and technology, university of human development, sulaymaniyah, kurdistan region, iraq a b s t r a c t epilepsy is a neurological disease that is common around the world, and there are many types (e.g., focal aware seizures and atonic seizure) that are caused by epileptic seizures. an epileptic seizure is a transient of symptoms because of abnormal excessive or synchronous neural activity in the brain. electroencephalogram (eeg) is a common way to record brain activity brain activities generated by nerve cells in the cerebral cortex. automatic epileptic seizure detection or prediction system can classify normal from abnormal eeg signal. selection of discriminant features is a matter of the performance of an automatic system. in this paper, we review several features extracted from the time, frequency, and time-frequency domains proposed by different researches for the purpose of epileptic seizure detection, also analyze, and compare the performance of the proposed features. index terms: classification, electroencephalogram, epileptic seizure detection, feature extraction, time-frequency analysis corresponding author’s e-mail: hoger mahmud, department of computer science, college of science and technology, university of human development, sulaymaniyah, kurdistan region, iraq. e-mail: hoger.mahmud@uhd.edu.iq received: 27-06-2019 accepted: 13-07-2019 published: 25-07-2019 access this article online doi: 10.21928/uhdjst.v3n2y2019.pp16-23 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2019 mohammadi and mahmud. this is an open access article distributed under the creative commons attribution noncommercial no derivatives license 4.0 (cc by-nc-nd 4.0) re v i e w a r t i c l e uhd journal of science and technology mohammadi and mahmud: feature extraction methods for eeg epileptic classification uhd journal of science and technology | jul 2019 | vol 3 | issue 2 17 feature extraction as an important step in automatic epileptic seizure detection system has attracted lots of attentions. some of the researchers such as [7] used time domain to extracts features from eeg data; [8] used frequency domain for feature extraction; and the combination of time and frequency domains are also used by researchers in boubchir et al. [6], boubchir et al. [8], and boubchir et al. [9]. reference mohammadi et al. [10] proposed time-frequency features to detect epileptic seizure using eeg recordings. depend on the selected domain, different features have been proposed for epileptic seizure detection. amplitude modulation (am)-frequency modulation (fm) signals can be used to model real-life signals and the model is normally recognized by attributes such as instantaneous phase, instantaneous frequency (if), and instantaneous amplitude (ia); these attributes, when extracted, can yield good classification outcomes [11,12]. to extract if or ia, methods such as ted and empirical mode decomposition can be used [13], as for differentiating signals with high signal energy, measures such as renyi entropy and time-frequency flatness are good choices of use. the two measures are types of time-frequency entropy measures that are good indicators of seizure activity when eeg signals are considered [12]. in identifying seizure activities, the shape and direction of energy distribution in time-frequency signals are important; these features can be extracted using directional or wavelet decomposition filters and the features can be captured in a number of images. the images can later be used to obtain statistical features [14]. there are other methods that can be used for the same extraction purpose, such as dimensionality reduction methods proposed by sameh and lachiri [14]. in this paper, we analyze and compare different features that have been proposed by various researchers as proofs for seizure activities and classifications. we found out that the features extracted from the joint time-frequency domain provide more advantageous than those extracted from either time or frequency domain. the structure of the paper is as follows: in the following, we discuss the framework for eeg classification. in the next section, we review the feature extraction methods, and then, we discuss the performance of several eeg features. finally, we conclude the paper. 2. eeg seizure detection and classification framework in this section, we explain the steps of seizure detection process using eeg that includes five steps as below: 2.1. the preprocessing stage in this stage, we remove noise and excess features from eeg recordings, using techniques such as band-pass filtering and bayesian denoising [15]. this stage will prepare the data for processing and facilitate correct seizure-related signal detection and clears away the unwanted artifacts that may distort the real result. 2.2. eeg signal representation to ensure that the best eeg signal representation is used in seizure diagnostics, it is necessary to decide in what representation domain signals are analyzed. the typical representation domains used by researchers are time, frequency, and joint time-frequency domains. (fig. 1a) shows a normal eeg signal in the time domain. (fig. 1b) shows the time-frequency representation of the same signal in the joint time-frequency domain. fig. 2a shows an abnormal eeg signal and its time-frequency representation. from (fig. 2b), we can observe a train of spikes in the time-frequency representation of an epileptic eeg signal. 2.3. feature extraction at this stage features and patterns, indicating seizure activities are extracted from the preprocessed eeg data. the common features that are proposed by researchers are classified into four categories. the first is known as (amplitude-based) extracted from eeg signal in time domain, the second is (spectrum-based) extracted in the frequency domain, the third is (if) extracted in the time-frequency domain [9], and the fourth is (image descriptor-based) extracted in timefrequency domain [6,8]. 2.4. feature selection as a result of the previous steps, a number of features are extracted; however, not all features are decisive in seizure diagnostics as there will be redundant or irrelevant features in the extraction. in this step, it is required to filter the most irrelevant features and eliminate the unwanted ones to ensure the quality and correctness of the final seizure diagnostics and classification. 2.5. classification the features extracted and filtered in the previous step are now ready to be used for the final diagnostics and classification. in this step, several classifiers such as an artificial neural network and support vector machine are used for the classification. a cross-validation method is used to evaluate the performance of the classifier. the leave-one-out technique can be used for validation, which gives an almost unbiased approximation of the true generalization error. mohammadi and mahmud: feature extraction methods for eeg epileptic classification 18 uhd journal of science and technology | jul 2019 | vol 3 | issue 2 the performance of the classifier can be assessed based on specificity, sensitivity, and total accuracy. = true negative specificity total number of true negatives = true positive sensitivity total number of true positives + = true positive true negative total accuracy total number of examples fig. 3 shows a flowchart indicating the input, computational steps, and output of a classification system. 3. eeg feature extraction review features extracted form time, frequency, or t-f representations are the most widely proposed features for eeg seizure detection; in this section, we present a brief literature review of the features. a. time-domain features: to extract seizure indicator features, the median absolute deviation or root mean square or inter-quartile range of the amplitude of eeg signals are scanned in the time domain [16-18]. below we present some other features that are suggested by researchers for extraction such as statistical moments [16,17,19]. 1. features based on statistical moments: • first moment and second central moment of eeg signal [16,19] mean:  = = = ∑( )1 1 1 | [ ]|t n n f z n n (1) variance:   = = = −∑( )2 2 21 1 ( | [ ]|)t n n f z n n (2) • normalized moments: third and fourth central moments of eeg signal [16,19] skewness:  == −∑( )3 33 1 1 (| [ ]| )t n n f z n n (3) kurtosis:  == −∑( )4 44 1 1 (| [ ]| )t n n f z n n (4) • coefficient of variation of the eeg signal [19]   = =( )5 ( ) 2 ( ) 1 t t tf f f (5) fig. 1. (a) an electroencephalogram signal with normal activity and (b) time-frequency representation of the signal. a b fig. 2. (a) an electroencephalogram signal with epileptic seizure activity and (b) time-frequency representation of the signal. a b mohammadi and mahmud: feature extraction methods for eeg epileptic classification uhd journal of science and technology | jul 2019 | vol 3 | issue 2 19 2. features based on amplitude: • median absolute deviation of eeg amplitude [16]  = = −∑( )6 1 1 (| [ ] |)t n n f z n n (6) • root mean square amplitude [17] == ∑( ) 7 2 1 [ ] t n nf z n n (7) • interquartile range [18] + + = −( )8 3( 1) ( 1) [ ] [ ] 4 4 tf n n z z (8) 3. features based on entropy: • shannon entropy [16,17,20] = = −∑( )9 21 [ ]log ( [ ])t n n f z n z n (9) b. frequency domain features: the frequency representation of eeg signal is scanned in this domain to identify seizure indicator features based on spectral information (e.g., power spectrum, spectral roll-off) [16,17,19]. below we summarize some features extracted in frequency-domain. 1. features based on power spectrum: • maximum power of the frequency bands [17,19]  = = ∑( ) 21 1| [ ]|f kf z k (10) = + = ∑( ) 22 1| [ ]| mf k f z k (11) • m corresponds to the maximum frequency 2. features based on spectral information: • spectral centroid: average signal frequency weighted by the magnitude of spectral centroid [16] = = = ∑ ∑ ( ) 1 3 1 | [ ]| | [ ]| m f k m k k z k f z k (12) • spectral flux: difference between normalized spectra magnitudes [16] − = = −∑( ) ( ) 1 24 1( [ ] [ ]) mf l l k f z k z k (13) where z(i) and z(i−1) are normalized magnitude of the fourier transform at i and i−1 frames • spectral flatness: indicates whether the distribution is smooth or spiky [16] − = = = π ∑( )5 1 1 1 1 ( [ ]) ( [ ]f mm m k k f z k z k (14) • spectral roll-off: spectral concentration below threshold λ [16]  = = ∑( )6 1| [ ]|f m k f z k (15) 3. feature based on entropy: • spectral entropy: measures the regularity of the power spectrum of eeg signal [17] = = ∑( )7 1 1 ( [ ])log ( [ ]) log( ) f m k f p z k p z k m (16) c. time-frequency domain features: the joint timefrequency domain representation is more informative for the analysis of real-life signals. this indicates that if the additional information provided by the time-frequency representation is properly extracted in the form of timefrequency features, then better classification accuracy can be achieved. in this domain, the eeg signals are scanned for features that indicate seizure activities based fig. 3. the flowchart of electroencephalogram classification system. mohammadi and mahmud: feature extraction methods for eeg epileptic classification 20 uhd journal of science and technology | jul 2019 | vol 3 | issue 2 on information extracted from both time and frequency domain. several techniques are available for the extraction, and the most widely used ones are: 1. if features: many real-life signals can be modeled as am-fm signals. such signals are completely characterized by the parameters of the am-fm model that is the instantaneous phase, i), ia, and a total number of components. for such signals, parameters extracted from the if or ia can lead to good classification results [15-18]. the if or ia related parameters can be extracted either from time-frequency distributions (tfds) or empirical mode decomposition based methods [9,18,19]. 2. using image descriptors and image processing techniques such as shape and texture descriptor and local binary pattern (lbp) descriptor to scan time-frequency image representation of eeg signals for seizure indicator features [6,8]. 3. entropy features: time-frequency entropy measures such as renyi entropy, time-frequency flatness can be used for discriminating signals having a high concentration of signal energy from signals having energy spread in the time-frequency domain [9,21], for example, in the case of eeg signals, seizure activity is sparse in the time-frequency domain, while the background is not. 4. texture features: texture time-frequency features are related to the direction and shape of energy distribution. these features can be obtained by, convolving a tfd with a set of convolution masks such as wavelet decomposition filters or directional filters to obtain a number of filtered images. 5. other approaches include dimensionality reduction methods for directly extracting features from given tfds, time-frequency matched filtering and statistical features [10,20,22,23]. in the following, we describe the relevant t-f eeg features that we have identified. these features are based on if [5], entropy [5], flux, flatness, and energy information of eeg signal (e.g., sub-bands energies and energy localization) [9,5]; which are computed from ρ. time-frequency image relatedfeatures: other time-frequency features have been recently proposed based on image descriptors capable to describe visually the seizure activity pattern observed in the tfd of eeg signal, ρ, considered and processed as an image using image processing techniques. the proposed time-frequency image features include shape and texture descriptors [6], haralick descriptor [24], and lbp descriptor [8]. 1. features based on energy: • sub-bands energies [9]:   = = =∑ ∑( )1 1 1 [ , ]tf n m n k f n k (17)    = = + =∑ ∑( )2 1 1 [ , ]tf n m n k m f n k (18) where m δ = m/f δ and m corresponds to a maximum frequency component in the signal (f δ /2) • energy localization [5]: ( )  = = = = π π = ∑ ∑ ( ) 3 1 1 1 1 1 [ , ] [ , ] tf n m nm n k m n k n f n k n k (19) 2. features based on if: • mean and deviation of if of eeg signal = = ∑( )4 1 1 [ ]tf n in f f n n (20) = −( )5 max [ ] min [ ] tf i if f n f n (21) where    = = = ∑ ∑ 1 1 [ , ] [ ] 2 [ , ] m k i m k k n kf f n m n k 3. feature-based on entropy: • re’nyi entropy of order α [10] ( ) = == − ∑ ∑( )6 2 1 1 1 log [ , ] 1 tf n m n k f n k (22) • normalized renyi entropy  = =    = −      ∑∑ ∑∑ 3 2 1 1 1 [ , ] log 2 [ , ] n n n k n k n k tfre n k (23) • time-frequency flatness   = = = = = ∏∏ ∑∑ 2 1 1 1 1 [ , ] [ , ] n n n k flatness n n n k n k tf n n k (24) mohammadi and mahmud: feature extraction methods for eeg epileptic classification uhd journal of science and technology | jul 2019 | vol 3 | issue 2 21 4. time-frequency flux:   = = = + + −∑∑ 1 1 [ , ] [ , ] n n flux l m tf n l k m n k (25) 4. analysis and discussion we have compared and analysis the eeg seizure detection and classification methods that use the eeg features described in section iii. we have considered here only the state-ofthe-art methods that have been assessed on bonn university eeg database [23] and freiburg eeg dataset [25] which are public free database widely used. this database includes five eeg sets referred to as sets a-e where each set contains 100 artifact-free eeg signals of 23.6 s duration acquired from normal subjects and patients with epileptic seizures. all the eeg signals in the database have been recorded at fs = 173.6 hz sampling rate thus resulting in 4096 samples (= 23.6 × fs) and have the spectral bandwidth varying from 0.5 to 85 hz (see [23] for more detail). for all the methods considered in this review, which used the bone database, the desired classification is given in two different classes of eeg signals: normal and seizure, denoted by n and s, respectively. the class n includes set a, which contains 100 eeg signals. the class n includes set a, which contains 100 eeg signals without seizure acquired from five healthy volunteers with eye open while the class s includes set e which contains 100 eeg signals with seizure acquired from five patients. the freiburg dataset includes 24 h-long continuous presurgical invasive recordings of 21 patients suffering from epilepsy. the sampling rate of the recorded data is 256 hz. a 16-bit analog to digital converter is used to record the data over 128 channels. out of these channels, six of them are selected based on the visual analysis of an eeg specialist. for each patient, there are at least three ictal files such that at least one of them contains a seizure event. among ictal files, the files preceding the seizure event are called pre-ictal signal files, and the ones which come immediately following the seizure segment are called post-ictal. ictal files have recordings of signals that are at least 50 min far from seizure events. both ictal and inter-ictal files are stored in asci format and contain six channels of eeg time series. table 1 presents a comparison of the performance of some state-of-the-art methods in terms of best total classification accuracy (acc) using eeg database {n, s}. by analyzing the acc results in the table, we notice that the methods using time-frequency features such as the methods in boubchir et al. [8] and boubchir et al. [24] provide a higher acc (up to 99.33%) than the methods using time-domain features and/or frequency-domain features – such as the methods in redelico et al. [20], kannathal et al. [26], and polat and günecs [27]. this indicates that the time-frequency features are the relevant and discriminate features allowing to improve significantly the classification results. moreover, the use of time-frequency image related features [6,8,24] achieves the best performance than the use of time-frequency signal related-features [22]. in addition, other types of time-frequency features based on wavelet coefficients was proposed in subasi [28], providing an acc result (of 95%) less than the result achieved by the time-frequency features and used in boubchir et al. [6], boubchir et al. [8], boubchir table i: performance comparison of different method using different features. method eeg representation feature extraction classification best acc (%) redelico et al. (2017) [20] time domain entropies-based features logistic regression 94.5 polat and günecs (2007) [27] frequency domain fourier transform-based features decision tree 98.72 kannathal et al. (2005) [26] time domain/frequency domain time-domain features/ frequency-domain features anfis 92.22 subasi (2007) [28] time-frequency domain wavelet-based features me network 95 boubchir et al. (2014) [22] time-frequency domain combined time-frequency signal and time-frequency image related-features svm 97.5 boubchir et al. (2014) [17,24] time-frequency domain haralick descriptor-based features svm 99 boubchir et al. (2015) [6] time-frequency domain image texture descriptor-based features svm 98 boubchir et al. (2015) [8] time-frequency domain lbp descriptor-based features svm 99.33 mohammadi et al. (2017) [10] time-frequency domain time-frequency flux, time-frequency flatness, and time-frequency entropy 97.5 svm: support vector machine, eeg: electroencephalogram, lbp: local binary pattern mohammadi and mahmud: feature extraction methods for eeg epileptic classification 22 uhd journal of science and technology | jul 2019 | vol 3 | issue 2 et al. [22], and boubchir et al. [24]. finally, the method in boubchir et al. [8] and mohammadi et al. [10] is the most promising methods for detecting and classifying the eeg seizure with high accuracy; the time-frequency flux is the best performing time-frequency feature as it achieved the auc of 0.94 in mohammadi et al. [10]. 5. conclusion a discriminant feature plays a crucial rule in the performance of an automatic epileptic seizure detection system. feature can be extracted from the time, frequency, and joint timefrequency domains. different features such as if, entropy, texture, and statistical features have been proposed by different researchers. in this paper, we proposed a review of eeg features that have been proposed to characterize the epileptic seizure activities for the purpose of eeg seizure detection and classification. the analysis of these features has shown that time-frequency features, especially those based on time-frequency image description, are the most relevant and discriminate features for detecting and classifying the eeg seizure with high accuracy. our future work will focus on adapting the eeg time-frequency features to classify the epileptic seizure activities with their degree of severity. references [1] r. s. fisher, w. v. e. boas, w. blume, c. elger, p. genton, p. lee and j. jr. engel. “epileptic seizures and epilepsy: definitions proposed by the international league against epilepsy (ilae) and the international bureau for epilepsy (ibe).” epilepsia, vol. 46, no. 4, pp. 470-472, 2005. [2] r. s. fisher, c. acevedo, a. arzimanoglou, a. bogacz, j. h. cross, c. e. elger, j. engel, l. forsgren, j. a. french, m. glynn, d. c. hesdorffer, b. i. lee, g. w. mathern, s. l. moshé, e. perucca, i. e. scheffer, t. tomson, m. watanabe and s. wiebe. “ilae official report: a practical clinical definition of epilepsy.” epilepsia, vol. 55, no. 4, pp. 475-482, 2014. [3] b. abou-khalil and k. e. misulis. “atlas of eeg and seizure semiology: text with dvd.” butterworth-heinemann, united kingdom, 2005. [4] h. r. mohseni, a. maghsoudi and m. b. shamsollahi. seizure detection in eeg signals: a comparison of different approaches. international conference of the ieee engineering in medicine and biology society, vol. 2006, pp. 6724-6727, 2006. [5] b. boashash, l. boubchir and g. azemi. “a methodology for timefrequency image processing applied to the classification of nonstationary multichannel signals using instantaneous frequency descriptors with application to newborn eeg signals.” eurasip journal on advances in signal processing, vol. 2012, no. 1, pp. 117, 2012. [6] l. boubchir, s. al-maadeed, a. bouridane and a. a. chérif. “time-frequency image descriptors-based features for eeg epileptic seizure activities detection and classification.” in: 2015 ieee international conference on acoustics, speech and signal processing (icassp), pp. 867-871, 2015. [7] j. gotman. “automatic seizure detection: improvements and evaluation.” electroencephalography and clinical neurophysiology, vol. 76, no. 4, pp. 317-324, 1990. [8] l. boubchir, s. al-maadeed, a. bouridane, and a. a. chérif. “classification of eeg signals for detection of epileptic seizure activities based on lbp descriptor of time-frequency images.” in: 2015 ieee international conference on image processing (icip), 2015, pp. 3758-3762. [9] l. boubchir, s. al-maadeed and a. bouridane. “on the use of time-frequency features for detecting and classifying epileptic seizure activities in non-stationary eeg signals.” in: 2014 ieee international conference on acoustics, speech and signal processing (icassp), 2014, pp. 5889-5893. [10] m. mohammadi, n. a. khan and a. a. pouyan. “automatic seizure detection using a highly adaptive directional time-frequency distribution.” multidimensional systems and signal processing, vol. 29, no. 4, pp. 1661-1678, 2018. [11] s. dong, b. boashash, g. azemi, b. e. lingwood and p. b. colditz. “automated detection of perinatal hypoxia using time-frequencybased heart rate variability features.” medical and biological engineering and computing, vol. 52, no. 2, pp. 183-191, 2014. [12] b. boashash, n. a. khan and t. ben-jabeur. “time-frequency features for pattern recognition using high-resolution tfds: a tutorial review.” digital signal processing, vol. 40, pp. 1-30, 2015. [13] b. boashash, g. azemi and n. a. khan. “principles of timefrequency feature extraction for change detection in non-stationary signals: applications to newborn eeg abnormality detection.” pattern recognition, vol. 48, no. 3, pp. 616-627, 2015. [14] s. sameh and z. lachiri. “multiclass support vector machines for environmental sounds classification in visual domain based on loggabor filters.” international journal of speech technology, vol. 16, no. 2, pp. 203-213, 2013. [15] l. boubchir and b. boashash. “wavelet denoising based on the map estimation using the bkf prior with application to images and eeg signals.” ieee transactions on signal processing, vol. 61, no. 8, pp. 1880-1894, 2013. [16] j. löfhede, m. thordstein, n. löfgren, a. flisberg, m. rosazurera, i. kjellmer and k. lindecrantz. “automatic classification of background eeg activity in healthy and sick neonates.” journal of neural engineering, vol. 7, no. 1, p. 16007, 2010. [17] b. r. greene, s. faul, w. p. marnane, g. lightbody, i. korotchikova and g. b. boylan. “a comparison of quantitative eeg features for neonatal seizure detection.” clinical neurophysiology, vol. 119, no. 6, pp. 1248-1261, 2008. [18] l. boubchir, b. daachi and v. pangracious. “a review of feature extraction for eeg epileptic seizure detection and classification.” in: telecommunications and signal processing (tsp), 2017 40th international conference on, pp. 456-460, 2017. [19] a. aarabi, f. wallois and r. grebe. “automated neonatal seizure detection: a multistage classification system through feature selection based on relevance and redundancy analysis.” clinical neurophysiology, vol. 117, no. 2, pp. 328-340, 2006. [20] f. redelico, f. traversaro, m. garcia, w. silva, o. rosso and m. risk. “classification of normal and pre-ictal eeg signals using permutation entropies and a generalized linear model as a classifier.” entropy, vol. 19, no. 2, p. 72, 2017. [21] m. mohammadi, a. a. pouyan, n. a. khan and v. abolghasemi. “locally optimized adaptive directional time-frequency mohammadi and mahmud: feature extraction methods for eeg epileptic classification uhd journal of science and technology | jul 2019 | vol 3 | issue 2 23 distributions.” circuits, systems, and signal processing, vol. 37, no. 8, pp. 3154-3174, 2018. [22] l. boubchir, s. al-maadeed and a. bouridane. “effectiveness of combined time-frequency imageand signal-based features for improving the detection and classification of epileptic seizure activities in eeg signals.” in: 2014 international conference on control, decision and information technologies (codit), 2014, pp. 673-678. [23] r. g. andrzejak, k. lehnertz, f. mormann, c. rieke, p. david and c. e. elger. “indications of nonlinear deterministic and finite-dimensional structures in time series of brain electrical activity: dependence on recording region and brain state.” physical review e, vol. 64, no. 6, p. 61907, 2001. [24] l. boubchir, s. al-maadeed, and a. bouridane. “haralick feature extraction from time-frequency images for epileptic seizure detection and classification of eeg data.” in: 26th international conference on microelectronics (icm), 2014, pp. 32-35. [25] m. ihle, h. feldwisch-drentrup, c. a. teixeira, a. witon, b. schelter, j. timmer and a. schulze-bonhage. “epilepsiae-a european epilepsy database.” computer methods and programs in biomedicine, vol. 106, no. 3, pp. 127-138, 2012. [26] n. kannathal, m. l. choo, u. r. acharya and p. k. sadasivan. “entropies for detection of epilepsy in eeg.” computer methods and programs in biomedicine, vol. 80, no. 3, pp. 187-194, 2005. [27] k. polat and s. günecs. “classification of epileptiform eeg using a hybrid system based on decision tree classifier and fast fourier transform.” applied mathematics and computation, vol. 187, no. 2, pp. 1017-1026, 2007. [28] a. subasi. “eeg signal classification using wavelet feature extraction and a mixture of expert model.” expert systems with applications, vol. 32, no. 4, pp. 1084-1093, 2007. . uhd journal of science and technology | jul 2019 | vol 3 | issue 2 51 1. introduction today, wireless sensors networks are used in many areas such as environment, military operations, and explorations. since the sensor nodes have low processing, energy, and memory capabilities and it is important to establish security in such networks due to their application in critical environments especially military, this area has attracted the attention of many researchers yick et al. [1] and jamshidi et al. [2]. so far, various attacks [2]-[11] have been proposed against these networks. each type of attack has a different mechanism with a different destructive effect and affects various operations and protocols; thus, each one has a different defense mechanism. two of such dangerous attacks which are ver y common include black hole (bh) and selective forwarding (sf) attacks [10], [11]. to establish these attacks, the adversary enters the network environment, captures one or several legal nodes of the network, reprograms and injects them in the network as a restricted multipath routing algorithm in wireless sensor networks using a virtual cylinder: bypassing black hole and selective forwarding attacks elham bahmanih1, aso mohammad darwesh2, mojtaba jamshidi2*, somaieh bali3 1department of computer engineering, malayer branch, islamic azad university, malayer, iran, 2department of information technology, university of human development, sulaymaniyah, iraq, 3department of computer engineering, kermanshah branch, islamic azad university, kermanshah, iran a b s t r a c t in this paper, a simple and novel routing algorithm is presented to improve the packet delivery in harsh conditions such as selective forwarding and black hole attacks to the wireless sensor networks. the proposed algorithm is based on restricted multipath broadcast based on a virtual cylinder from the source node (sn) to the sink node (sk). in this algorithm, when a packet is broadcast by a sn, a virtual cylinder with radius w is created from the sn to a sk. all the nodes located in this virtual cylinder are allowed to forward the packet to the sink. thus, data are forwarded to sink through multiple paths, but in a restricted manner so that the nodes do not consume a high amount of energy. if there are some compromised nodes in this virtual cylinder, the packets may be forwarded to the sink through other nodes of the virtual cylinder. the proposed algorithm is simulated and evaluated in terms of packet delivery rate and energy consumption. the experiment results show that the proposed algorithm increases packet delivery rate 7 times compared to the single path routing method and reduces energy consumption up to 3 times compared to flooding routing method. index terms: black hole attack, restricted multipath, routing, selective forwarding attack, virtual cylinder, wireless sensor network access this article online doi: 10.21928/uhdjst.v3n2y2019.pp51-58 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2019 jamshidi, et al. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) re v i e w a r t i c l e uhd journal of science and technology corresponding author’s e-mail: mojtaba jamshidi, department of information technology, university of human development, sulaymaniyah, iraq. e-mail: jamshidi.mojtaba@gmail.com received: 30-07-2019 accepted: 18-08-2019 published: 22-8-2019 bahmanih, et al.: a restricted multipath routing algorithm in wsns 52 uhd journal of science and technology | jul 2019 | vol 3 | issue 2 compromised nodes. as shown in fig. 1, such compromised nodes are located along data paths to prevent the packets to reach the sink. that is, the received packets are not forwarded to the sink or base station, but they are dropped. in the sf attack, the compromised node only drops some of the received packets, but in the bh attack, the compromised node drops all of the received packets. these two attacks are very destructive for the multi-hop routing algorithms, particularly, if the compromised nodes are located along the data flow paths. on the other hand, establishing these two attacks are very simple and low cost for the adversary. because the compromised nodes which start these two attacks do not need to perform suspicious operations such as injecting incorrect packet to the network, manipulate data, broadcast false or multiple ids, and establish a high-speed link; they only need to refuse to forward some of the received packets to the destination. in most cases, even if there is no compromised node in the network, the packets are dropped due to various reasons such as collision [10], [11]. the algorithms which have been proposed to defend against these two attacks like [10], [12]-[23] employ methods such as multiple flow topologies and nodes with special capabilities and multi-hop verification. in general, problems of the presented algorithms include non-scalability, security, complexity, high cost, and slow reaction. in this paper, a restricted multipath routing algorithm is presented for reliable and low-cost delivery of the data to the destination and defending against sf and bh attacks in wireless sensor networks such that shortcomings of the previous algorithms are resolved and it can be employed in resource-constraint sensor networks. 2. related work sf attack was first presented in karlof and wagner [10], and the first approach to defend against this attack is to use multipath routing protocols. in this method, the packets are forwarded from the source node (sn) to the destination through n independent paths. this method is completely robust against sf attack until maximum of n-1 nodes is captured by the adversary; but, if more than n+1 nodes are captured, this method might not operate properly. it has been mentioned in satyajayant et al. [12] that the bh attack is one of the dangerous attacks against wireless sensor networks which can be easily established by the adversary with a low cost. then, an algorithm has been presented to defend against this attack which is based on multi-sink to protect data flows against bh attacks. in this algorithm, sensor nodes employ a set of control messages throughout the network to explore a subset of the sinks. then, they transmit data to accessible sinks. in abasikeleş-turgut et al. [13], an algorithm has been presented to defend against bh and sinkhole attacks. this algorithm can be applied to low-energy adaptive clustering hierarchy clustering-based sensor networks. in this study, three different models have been considered, and different mechanisms have been presented to handle them. in sheela et al. [14], it has been mentioned that bh attack is one of the denial of service attacks which drops packets. then, an algorithm based on mobile agents has been proposed to defend against this attack. a mobile agent is a program segment with self-controllability which can be applied to distributed applications, especially dynamic networks. the agents proposed in this study are loaded on several sinks and detect the compromised nodes by patrolling the network. in nitesh and diwaker [15], another algorithm has been proposed based on multiple sinks to defend against bh attack. the purpose of this study is to ensure that data are delivered to at least one sink node (sk) by establishing several sinks in different areas of the network and transmitting data to different sinks, simultaneously. in deepali and gupta [16], an adaptive exponential trustbased algorithm has been proposed for calculating the trust factor of each node in each computational cycle for detecting the balckhole (bh) attack. furthermore, it has been claimed that the proposed mechanism not only reduces energy fig. 1. different cases of establishing selective forwarding and black hole attacks [11]. bahmanih, et al.: a restricted multipath routing algorithm in wsns uhd journal of science and technology | jul 2019 | vol 3 | issue 2 53 consumption but also it reduces the time required to detect the bh attack. furthermore, an adaptive threshold has been used to reduce false alarm rates. in baishali [17], an algorithm has been proposed to defend against bh attack in mobile ad hoc networks. this algorithm can detect the compromised nodes in a network which employs the ad hoc on-demand distance vector algorithm. in this algorithm, nodes use a timer to wait for the acknowledgment (acks) to return. if the timer is reset and the ack is not received, bh attack along data transmission path is detected. in shila et al. [18], a channel-aware algorithm has been proposed, which detects compromising behavior of the nodes from the bad behavior of the transmission channel. this algorithm is based on channel estimation and traffic monitoring strategies. if the supervised missing rate is higher than the estimated normal missing rate, those nodes are detected as an adversary. in li et al. [19], an algorithm based on sequential mesh test based has been presented for detection of sf attack in sensor networks. this scheme is centralized and operates on clusterbased networks. the sensor node u transmits its packets to the next hop, node v, and if node v does not transmit the packet in constant time, node u reports to the cluster head that the packet has been dropped. after receiving packet drop reports, the cluster head applies the sequential mesh test based on the suspicious node. in hu et al. [20], a secure routing algorithm based on monitoring node and trust mechanism has been proposed. in this algorithm, trust is adjusted based on the transmission rate of the packets and residual energy of the node. this detection and routing algorithm is general because it considers both the lifetime of the network and its security. in yu and xiao [21], another algorithm has been presented in which a multi-hop acknowledgment scheme is used considering the responses received from the intermediate nodes to broadcast the warning messages in the network. in this algorithm, each inter mediate node along the data transmission route cooperates in detecting the compromised node. if an intermediate node observes a bad behavior from its upstream or downstream node, it generates an alarm packet and transmits it to the sn or the base station. then, the sn and the base station can make a decision and respond using a complicated intrusion detection system. in xiao et al. [22], a technique has been proposed for detecting the compromised nodes in the sf attack. this algorithm is the improved version of the previous technique [21]. in this algorithm, some of the intermediate nodes along the route are selected as checkpoint nodes randomly which have to generate acks for each received packet. furthermore, each node requires a one-way hash key chain to ensure that the packets are authenticated. furthermore, delay mechanisms are used to transmit this one-way hash key. in this algorithm, each intermediate node along the packet transmission route has the potential to explore the packets which are lost abnormally, and if the intermediate node does not receive enough acks from the downstream checkpoints, it can detect the compromised nodes. 3. multi-sink architecture in each wireless sensor network, there is usually one or more sensor nodes called sk which collects total data of the network. destination of all reported packets in sensor networks is the sks. when sensor nodes observe a determined event, they generate the required report packets and deliver them to the sink through multi-hop routing algorithms. considering the number of the sks, their location, and they are being mobile or stationary, various architectures are created in the network which might change mechanism of the routing algorithms [23,24]. one of the common architectures is multi-sink architecture. in this architecture, as shown in fig. 2, there are several sks established on one side of the network. the sks might either communicate directly with each other or communicate with the base station (where the network manager is located). in this architecture, the sufficient condition for data delivery is that data are delivered to at least one of the sinks. in other words, the packet generated by one sensor node does not have to be delivered to a specific sk, but it is sufficient that is delivered to one of the sinks. this architecture has three main advantages [23,24]: 3.1. increasing network lifetime if there is only one sink in the network, its neighboring sensor nodes transmit high traffic, and their energy is discharged very soon, which reduces network lifetime. however, if the multi-sink architecture is used, this problem is resolved. 3.2. load balancing multiple sinks in the network might result in several routing trees in the network which might balance load among nodes of the network. bahmanih, et al.: a restricted multipath routing algorithm in wsns 54 uhd journal of science and technology | jul 2019 | vol 3 | issue 2 3.3. high packet delivery rate it is obvious that when there are several sinks in the network, the probability of delivering the packets to the destination increases because of the probability that there exists a route from the source to the destination (sink), especially in lowdensity networks, increases. considering the features and advantages of the multi-sink architecture, it is used in this paper to design the proposed routing algorithm to defend against sf and bh attacks. 4. system assumptions sensor networks are categorized into three main groups including sks, sns, and forwarding nodes (fns). the sns generate the report packets. the packets generated by the sns are delivered to the sks through fns. for instance, sns might be deployed in the boundary areas or the adversary environment and generate the required reports in case of adversary operations and deliver the reports to the sink through multiple hops. each node has a unique id and is aware of its location. the network area of interest is a 2d environment in which sns, sks, and fns are deployed randomly. all nodes have the same radio range equal to r. furthermore, it is assumed that the nodes communicate with each other through the wireless radio channel and employ the omnidirectional broadcast. furthermore, it is assumed that the sns are not only aware of their location but also they are aware of the location of the sks. the network environment is not safe, and the adversary can capture some sensor nodes and reprogram them as compromised nodes. 5. the proposed algorithm although using single-path approaches for delivering data to the destination in sensor networks imposes low overhead to the nodes, it cannot ensure that parasite generation, etc., is not established in the network particularly in case of sf and bh attacks. also, the flooding method, although more reliable, imposes much overhead on the nodes of the network. thus, they are not cost-effective. in this section, a restricted multi-path approach is proposed for reliable and cost-effective delivery of data to the destination such that it overcomes sf and bh attacks in the sensor network and delivers data to the sinks with high reliability. in the proposed method, a virtual cylinder (with diameter 2w) is created from the sn to a sk and only the nodes in this cylinder are allowed to forward packets to the destination. each sn inserts its spatial coordinate (ls) in the packet while generating a data packet. then, the packet is broadcast so that all of its neighbors receive it. however, only the nodes adjacent to the virtual cylinder forward the received packet. each node which has received the packet, for example, nodev, first extracts the spatial coordinate of the sn which has generated this packet and compares it with its own spatial coordinate, lv, to find out if it is in the virtual cylinder or not. if node v is inside the virtual cylinder, it forwards the packet; otherwise, it drops the received packet. this process is continued until the packet is delivered to the sink through restricted multiple paths. therefore, if there is a compromised node along one path, the packets can be delivered to the sink through adjacent paths using the proposed method. in the following, details of the proposed algorithm are described considering fig. 3. it is assumed that the sn a intends to transmit a report packet to the sink s 1 . in this case, the routing vector is defined as as 1 � ����� . the packet is passed fig. 2. an example of a sensor network with multi-sink architecture. bahmanih, et al.: a restricted multipath routing algorithm in wsns uhd journal of science and technology | jul 2019 | vol 3 | issue 2 55 in the range of the routing vector from a virtual routing cylinder with a predetermined radius of w to reach the destination. assuming that the sk s 1 is located in the spatial coordinate of ( )x ys s− and the sn a is located in the spatial coordinate of ( )x ya a− , then the equation of the line passing through these points and the vector passing center of the virtual cylinder are calculated as equation (1): y y m x x− = −0 0( ) (1) y y m x xs a s a− = −( ) here, m is the slope of the line calculated using equation (2): m y y x y a s a s = − − (2) the sn a calculates the equation of the line passing from itself and the sk s 1 and inserts it along with the report data in a packet and broadcasts it. this packet is forwarded to the sink by the nodes located in the virtual cylinder. assuming that the radio range of the nodes is r, the packet broadcast by each sensor node is broadcast in the radius of r, and each node which is located in this area receives the packet. each node which receives the report packet extracts the equation of the line inserted in this packet and calculates its distance from this line. it should be noted that the line equation is calculated only once by the sn, and it is inserted in the report packets. the fns do not need to calculate the line equation, but it is sufficient to calculate their distance from the line. when a packet is transmitted by a node inside the virtual cylinder, like node u in fig. 3, all nodes located in its radio range, receive the packet. node u has four neighbors where two of them are outside the virtual cylinder and two other nodes are inside the virtual cylinder. each node adjacent to node u, like node v, calculates its distance from the central line of the virtual cylinder, p, according to theorem 1. theorem 1: assuming that  a and  b are two points on the 2d space of the line  l and  p is an independent point which is not on this line, the line vector equation is as (3):   a m+t (3) here,  m is the direction vector obtained using (4):    m b a= − (4) and t is a running parameter calculated according to (5): t p a m m m = −( ) � � � � � i i (5) now, the distance of  p from line  l is calculated as (6): p p a m= − +    ( )t (6) proof of this theorem is given in equation [25]. for each node v which receives the report packet, if its distance from the central line of the virtual cylinder is smaller than w, means ≤p w , the receiver node v is inside the virtual cylinder and it is allowed to forward the received packet. otherwise, p w> , node v is outside the virtual cylinder and should not forward the received packet. fig. 3. the proposed virtual cylinder. bahmanih, et al.: a restricted multipath routing algorithm in wsns 56 uhd journal of science and technology | jul 2019 | vol 3 | issue 2 furthermore, in the proposed algorithm, each sensor node has a buffer which stores the id of the last packet forwarded from each sn as a result of which the sensor node refuses to repetitive packets. since all the sns located in the virtual cylinder forward the packets, a packet moves toward the corresponding sink through multiple paths (inside the cylinder). thus, if some of the nodes located inside the cylinder are compromised by an adversary and drop the report packets (sf and bh attacks), it is still probable that the packets are delivered to the sink through other paths. on the other hand, multipath is controlled by the virtual cylinder to prevent a packet from being broadcast in the whole network and prevent high energy consumption. 6. simulation results in this section, the proposed algorithm is evaluated and its simulation results are presented. matlab is used to simulate the proposed algorithm. 6.1. simulation model in the simulation model, the network is comprised of n = 300 sensor nodes which are deployed in a 100 × 100 m area, randomly. the network includes three sns where each sn generates a packet at each simulation instant and transmits it to the sink. the nodes have gps and are aware of their location. sns are located in a fixed and specific location of the boundary area of the network. the network includes m = 3 sks which are located in the network environment. sns are aware of the location of the sks. the radio range of the nodes is r = 10 m. the radius of the virtual cylinder is w meters. furthermore, it is assumed that the network is comprised of sf compromised nodes which establish the sf attack and bh compromised nodes which establish the bh attack. the compromised nodes which establish the bh attack drop all the received packets but the compromised nodes which establish the sf, drop only 50% of the received packets. the initial energy of every node is set to be 10 joules. energy consumption for transmission and reception of the packet is 0.016 and 0.0016 joules. each experiment is repeated 50 times, and the final result is the average of these 50 runs. 6.2. evaluation metrics and compared algorithms the evaluation metrics are as follows: 6.2.1. packet delivery rate packet delivery rate is the number of the packets received by the sinks to the number of the packets generated by the sns. 6.2.2. average residual energy average residual energy is the average residual energy of sensor nodes after simulation time. to calculate this metric, the energy of the compromised nodes and sks is not considered. it is assumed that the compromised nodes and sks have no energy limitation. since the proposed algorithm is a restricted multipath algorithm; that is, packets are only transmitted to the sinks through the nodes located on the virtual cylinder. thus, its efficiency is compared with the single-path and flooding methods. in the single-path algorithm, the sn transmits data to a sk through one path (shortest path). in the flooding method, packets are broadcasted in the whole network to reach the sks. it is clear that in the single-path method, the minimum energy is consumed, but the packet delivery rate might be low especially if the compromised nodes of the sf or bh attack exist along the data path. however, in the flooding method, the data delivery rate is high even when there is a large number of compromised nodes in the network, but the energy consumption of this algorithm is very high. the proposed algorithm is between these two algorithms. that is, it tries to control energy consumption and keep the packet delivery rate at an acceptable level when there exist sf and bh attacks, by adopting a restricted multipath method. 6.3. experiment results 6.3.1. experiment 1 in this experiment, the efficiency of the proposed algorithm is evaluated in terms of packet delivery rate and energy consumption, and the results are compared with singlepath and flooding methods. in this experiment, sf = 25, bh = 25, and w=15, this experiment is executed for periods of 25–100 s. at each simulation instant, each sn generates and broadcasts a packet. the results of this experiment in terms of packet delivery rate and average residual energy are given in figs. 4 and 5, respectively. the results show that the packet delivery rate of the flooding algorithm is almost 100% while it is 90% and 12% for the proposed algorithm and the single-path algorithm, respectively. in the flooding algorithm, since each packet is broadcast in the network, it is forwarded to the sinks through different paths. thus, despite sf and bh attacks, at least one version of this packet reaches one of the sinks. however, as shown in fig. 5, the flooding algorithm discharges energy of the nodes significantly because there are a large number of transmission and reception operations for one packet in the network. bahmanih, et al.: a restricted multipath routing algorithm in wsns uhd journal of science and technology | jul 2019 | vol 3 | issue 2 57 in the single-path algorithm, packets are forwarded to a sk through the shortest path greedily. thus, if there is a compromised node along this path, it prevents the packet to reach the destination as a result of which the packet delivery rate is reduced significantly. however, as shown in fig. 5, this method is very optimal in terms of energy consumption because no additional or repetitive packet is broadcast in the network. however, the proposed algorithm tries to forward the packets through restricted multiple paths and the virtual cylinder toward a sk. hence, even if there exist compromised nodes, the packets can be delivered through different paths. therefore, the packet delivery rate is acceptable. on the other hand, energy consumption is much less than the flooding algorithm. 6.3.2. experiment 2 in this experiment, the effect of the radius of the virtual cylinder, w, on the efficiency of the proposed algorithm in terms of packet delivery rate and the energy consumption is investigated. in this experiment, sf = 25, bh =25, and w = 5~20 and its effect on the efficiency of the proposed algorithm are investigated. the simulation time is considered to be 50 s. the results of this experiment are given in fig. 6 and fig. 7 for the packet delivery rate and average residual energy of the nodes, respectively. as can be seen, w affects the efficiency of the proposed algorithm, significantly. by increasing w, the radius of the virtual cylinder becomes larger and covers more nodes as a result of which, packets are forwarded to the sink through more paths which increase packet delivery rate and energy consumption. 7. conclusion in this paper, a novel and simple routing algorithm is presented to reduce the destructive effects of these attacks. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 25 50 75 100 p ac ke t d el iv er y r at e simulation time (sec) proposedalgorithm singlepath flooding fig. 4. comparing the efficiency of the proposed algorithm with the single-path and the flooding algorithms in terms of the packet delivery rate. 0 1 2 3 4 5 6 7 8 9 10 25 50 75 100a ve ra ge r es id ua l e ne rg y (j ou ls ) simulation time (sec) proposedalgorithm singlepath flooding fig. 5. comparing the efficiency of the proposed algorithm with the singlepath and the flooding algorithm in terms of the average residual energy. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 5 10 15 20 p ac ke t d el iv er y r at e w (meter) fig. 6. the effect of the radius of the virtual cylinder, w, on the packet delivery rate of the proposed algorithm. 7.5 8 8.5 9 9.5 10 5 10 15 20 a ve ra ge r es id ua l e ne rg y (j ou ls ) w (meter) fig. 7. the effect of the radius of the virtual cylinder, w, on the average residual energy of the proposed algorithm. bahmanih, et al.: a restricted multipath routing algorithm in wsns 58 uhd journal of science and technology | jul 2019 | vol 3 | issue 2 the proposed algorithm is based on restricted multipath broadcast based on a virtual cylinder from the sn to the sk. in this algorithm, when a packet is broadcast by a sn, a virtual cylinder is created from the sn to one of the sks of the network. all the nodes located on this virtual cylinder are allowed to forward packets to the sink. thus, data are forwarded to the sinks through multiple restricted and adjacent paths. the proposed algorithm is simulated and evaluated in terms of packet delivery rate and energy consumption. the results of the proposed algorithm are compared with single-path and flooding algorithms. the comparison results show that the proposed algorithm is 7 times better than the single-path algorithm in terms of packet delivery rate, and it is 3 times better than the flooding algorithm in terms of energy consumption. references [1] j. yick, b. mukherjee and d. ghosal. wireless sensor network survey. computer networks, vol. 52, no. 12, pp. 2292-2330, 2008. [2] m. jamshidi, m. esnaashari, a. m. darwesh and m. r. meybodi. detecting sybil nodes in stationary wireless sensor networks using learning automaton and client puzzles. iet communications, vol. 13, no. 13, pp. 1988-1997, 2019. [3] m. jamshidi, e. zangeneh, m. esnaashari and m. r. meybodi. a lightweight algorithm for detecting mobile sybil nodes in mobile wireless sensor networks. computers and electrical engineering, vol. 64, pp. 220-232, 2017. [4] m. jamshidi, s.s.a. poor, n.n. qader, m. esnaashari and m.r. meybodi. a lightweight algorithm against replica node attack in mobile wireless sensor networks using learning agents. ieie transactions on smart processing and computing, vol. 8, no. 1, pp. 58-70, 2019. [5] m. jamshidi, e. zangeneh, m. esnaashari, a. m. darwesh and m. r. meybodi. a novel model of sybil attack in cluster-based wireless sensor networks and propose a distributed algorithm to defend it. wireless personal communications, vol. 105, no. 1, pp. 145-173, 2019. [6] m. jamshidi, m. ranjbari, m. esnaashari, a.m. darwesh and m.r. meybodi. a new algorithm to defend against sybil attack in static wireless sensor networks using mobile observer sensor nodes. adhoc and sensor wireless networks, vol. 43, pp. 213-238, 2019. [7] m. jamshidi, m. ranjbari, m. esnaashari, n. n. qader and m. r. meybodi. sybil node detection in mobile wireless sensor networks using observer nodes. joiv: international journal on informatics visualization, vol. 2, no. 3, pp. 159-165, 2018. [8] a. andalib, m. jamshidi, f. andalib and d. momeni. a lightweight algorithm for detecting sybil attack in mobile wireless sensor networks using sink nodes. international journal of computer applications technology and research, vol. 5, no. 7, pp. 433-438, 2016. [9] m. jamshidi, a.m. darwesh, a. lorenc, m. ranjbari and m.r. meybodi. a precise algorithm for detecting malicious sybil nodes in mobile wireless sensor networks. ieie transactions on smart processing and computing, vol. 7, no. 6, pp. 457-466, 2018. [10] c. karlof and d. wagner. secure routing in wireless sensor networks: attacks and countermeasures. ad hoc networks, vol. 1, pp. 299-302, 2003. [11] l. k. bysani and a. k. turuk. a survey on selective forwarding attack in wireless sensor networks. proceedings of the international conference on device and communications (icdecom), mesra, india, feb. 2011. [12] m. satyajayant, k. bhattarai and g. xue. bambi: blackhole attacks mitigation with multiple base stations in wireless sensor networks. in 2011 ieee international conference on communications (icc), ieee, 2011, pp. 1-5. [13] i. abasikeleş-turgut, m. n. aydin and k. tohma. a realistic modelling of the sinkhole and the black hole attacks in clusterbased wsns. international journal of electronics and electrical engineering, vol. 4, no. 1, pp. 74-78, feb. 2016. [14] d. sheela, v. r. srividhya, b. a. asma and g. m. chidanand. detecting black hole attacks in wireless sensor networks using mobile agent. in international conference on artificial intelligence and embedded systems (icaies, 2012, pp. 15-16. [15] g. nitesh and c. diwaker. detecting blackhole attack in wsn by check agent using multiple base stations. american international journal of research in science, technology, engineering and mathematics, vol. 3, no. 2, pp. 149-152, 2013. [16] v. deepali and p. gupta. adaptive exponential trust-based algorithm in wireless sensor network to detect black hole and gray hole attacks. in: emerging research in computing, information, communication and applications. springer, singapore, 2016, pp. 65-73. [17] g. baishali. a novel intrusion detection system for detecting blackhole nodes in manets. networks (graph-hoc), vol. 8, no. 2, pp. 1-13, 2016. [18] d. m. shila, y. cheng, t. anjali. mitigating selective forwarding attacks with a channel-aware approach in wmns. ieee transaction on wireless communications, vol. 9, no. 5, pp. 1661-1675, 2010. [19] g. li, x. liu and c. wang. a sequential mesh test based selective forwarding attack detection scheme in wireless sensor networks. in: proceeding of the international conference on networking, sensing and control (icnsc), 2010, pp. 554-558. [20] y. hu, y. wu and h. wang. detection of insider selective forwarding attack based on monitor node and trust mechanism in wsn. wireless sensor network, vol. 6, pp. 237-248, 2014. [21] b. yu and b. xiao. detecting selective forwarding attacks in wireless sensor networks. in: proceeding of the second international workshop on security in systems and networks (ipdps workshop), 2006. [22] b. xiao, b. yu and c. gao. chemas: identify suspect nodes in selective forwarding attacks. journal of parallel and distributed computing, vol. 67, no. 11, pp. 1218-1230, 2007. [23] w. k. seach, h. x. tan. multipath virtual sink architecture for underwater sensor networks. in: proceeding of oceans, 2006. [24] m. jamshidi, a. a. shaltooki, z. d. zadeh and a. m. darwesh. a dynamic id assignment mechanism to defend against node replication attack in static wireless sensor networks. joiv: international journal on informatics visualization, vol. 3, no. 1, pp. 13-17, 2019. [25] available from: https://www.monkeyproofsolutions.nl/how-tocalculate-the-shortest-distance-between-a-point-and-a-line. [last accessed on 2019 jul 10 july]. tx_1~abs:at/tx_2:abs~at 40 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 1. introduction in recent years, enabling computers to read the text in natural images [1]-[3], [4] have been gaining increased attention. it can be used in optical character detection (optical character recognition), photography, and robot navigation. in this research, the two primary activities that concerned us were reading text and understanding text. our research is based primarily on text identification in natural images, which is far more complex than text detection on a well-maintained text file. in the previous text, detection works the majority of the works employs bottom-up pipelines, which often include the following steps: grouping or filtering of characteristics and the configuration of the line text. some common issues exist in all these processes. first, the effects of character detection using sliding window methods or related component-based approaches largely depend on their efficiency. mainly lowlevel characteristics (e.g., based on stroke width transform [5], maximally stable extremal regions [6], or histogram of oriented gradients [7]) are studied in these approaches. without background knowledge, it is difficult to define each stroke or character separately. at the same time, it is easy to result in a low recall where ambiguous characters are easily discarded, causing more difficulties for handling them in the following steps. second, there are several incremental phases with a bottom-up strategy, which makes the method very complex. these difficulties, therefore, reduce the power and efficiency of the program. deep learning technolog y has greatly increased the efficiency of target detection [1], [2], leading to advances text detection on images using region-based convolutional neural network hamsa d. majeed department of information technology, university of human development, sulaymaniyah, iraq a b s t r a c t in this paper, a new text detection algorithm that accurately locates picture text with complex backgrounds in natural images is applied. the approach is based primarily on the region-based convolutional neural network anchor system, which takes into account the unique features of the text area, compares it to other object detection tasks, and turns the text area detection task into an object sensing task. thus, the proposed text to be observed directly in the neural network’s convolutional characteristic map, and it can simultaneously predict the text/non-text score of the proposal and the coordinates of each proposal in the image. then, we proposed an algorithm for the construction of the text line, to increase the text detection model accuracy and consistency. we found that our text detection operates accurately, even in multiple language detection functions. we also discovered that it meets the 2012 and 2014 international conference on document analysis and recognition thresholds of 0.86 f-measure and 0.78 f-measure, which clearly shows the consistency of our model. our approach has been programmed and implemented using python programming language 3.8.3 for windows. index terms: text detection, region-based convolutional neural network, text images corresponding author’s e-mail: hamsa d. majeed, department of information technology, university of human development, sulaymaniyah, iraq. e-mail: hamsa.al-rubaie@uhd.edu.iq received: 02-05-2020 accepted: 27-07-2020 published: 02-08-2020 access this article online doi: 10.21928/uhdjst.v4n2y2020.pp40-45 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2020 majeed. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l re se a rc h a rt i c l e uhd journal of science and technology hamsa d. majeed: text detection on images using r-cnn uhd journal of science and technology | jul 2020 | vol 4 | issue 2 41 in text detection. a variety of recent approaches has been used to create pixel predictions of text or non-text in fully convolutional networks (fcns). in addition, segmentation of text semantic can lead to greater skill in exploiting rich field context data to identify ambiguous content, which leads to less erroneous detections. two fully coevolutionary networks fell behind the paradigm to make the findings more robust. the second fcn produces word-level or character-level predictions on text region detected by the first fcn. all these steps just lead to a much more complex method. many techniques are being used to forecast the limits of text in natural images through sliding windows by coevolutionary features, such as the state-of-the-art regionbased convolutional neural network (r-cnn) technique where a region proposal network (rpn) is proposed to generate high-quality object proposals directly from convolutional feature maps. and then, these region proposals are fed into a faster r-cnn model for further classification. in object detection, each object has a well-defined closed boundary, while it is difficult to find one in-text, which makes it more challenging to predict the text line accurately. the region proposal network presented in [8] is extending in this paper to localize the text lines accurately. we have put into target detection the issue of text line detection. in the meantime, we use the benefits of profound convolution and computer networking systems. in fig. 1, the results are shown on our network architecture and text proposal identification. first, we break the function of text identification into a series of fine text proposals. to forecast the proposal position and text and non-text data together, we refine faster r-cnns anchor regression method. this can lead to better localization accuracy. second, the text line construction algorithm has been proposed to integrate with the fine-scale proposal into a text line area. the proposed method is to join and single process the multiscale and multilingual text. third, with using international conference on document analysis and recognition (icdar) 2012 and icdar 2014, our model showed reliable and accurate results with text detection. 2. related work the past text on image detections is primarily utilized bottomup methods [9]-[11] or sliding window methods [12]-[15] to detect characters or screen components. the methods used for sliding windows detect text propositions by glancing through an image in a multiscale window and cascading behind the device a classifier to locate text proposals using manually built software or recent cnn features [13], [16], [17]. the methods are based on the related components primarily implement a fast filter to differentiate text from non-text using low-level properties, such as gradient, stroke distance, and color [18].these bottom-up methods do not perform well in character detection and the following steps have produced cumulative false answers. in addition, the bottom-up method is complex and computationally expensive; particularly the sliding windows approach that needs a ranking on several sliding windows). the efficiency of text detection [1], [2], [4], [5], mostly sliding window system, has recently been advanced by deep learning technologies. they are all have enhanced, primarily using very deep cnn and sharing a coevolutionary mechanism [3], also used to minimize computational costs, to benefit from highlevel deep characteristics. many fcn-based methods were, therefore, provided and findings in text detection tasks were promising. ren [10] recently introduced a faster r-cnn object detection method that achieved state-of-the-art object detection tests. they proposed an area proposal network (rpn) that produces highly credible entity proposals directly from the coevolutionary functional maps, fast enough to exchange coevolutionary details. these works are inspiring for our model. 3. methodology there are two modules in our text detection system. in particular, the first module uses a very large, convolutional neural network to create fine-scale proposal regions. the second module is a text line construction that can complete text lines through the text proposal regions given in the fig. 1. the output of our model. hamsa d. majeed: text detection on images using r-cnn 42 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 previous module. in the natural picture, the machine will correctly predict the text line. section a provides us with the concept of faster area cnn and how it fits well in the role of text detection. we propose our text line algorithm in section b. fig. 2 shows the block diagram of the proposed method. 3.1. fine-scale proposal network concerning object detection, due to the many cutting-edge tests on object detection benchmark, faster r-cnn has proved an effective and reliable object detection platform. its central segment is the area proposal network, which slides a small window in the coevolutionary software, which takes arbitrary formats as input and generates a series of rectangular proposals. faster r-cnn considers that both the network predicting proposals and classifier networks share a common set of convolutional. rcnn architecture is shown in fig. 3. in comparison to general sliding window techniques, (rpn) fig. 4. applies an effective anchor regression system to identify multiscale items with a single sliding window. this, in turn, reduces to a certain degree the estimated costs for the whole network. the reason that a single-window will forecast mixed objects is primarily that a window maps the multiscaled objects in the original image into multiple anchors with different aspects ratios. we also apply this anchor function in our model in this paper for text detection. the task of text detection is not quite the same as the process of detection of objects. there could be no visible closed boundary in the text field in the picture as can also be used in object detection tasks. it consists of multilevel elements, such as character, text line, stroke, and text area that are not easy to discern. in all respects, we optimize the anchor function to anticipate components in the text detection process in various stages. we note that a text line can be seen as a series of fine-scale fig. 3. the architecture of the region-based convolutional neural network. fig. 2. method block diagram. hamsa d. majeed: text detection on images using r-cnn uhd journal of science and technology | jul 2020 | vol 4 | issue 2 43 text recommendations that can be used to some degree as an object detection function. we assume that a series of text recommendations from a text line will operate with small scale text detection and different aspect ratios. the anchor function was strengthened as follows: each text proposal is designated as a 16-pixel fixed-width equivalent to the detector through the conv5 maps, the vgg 16’s lasts coevolutionary map that can extract deeper input image features. moreover, k anchors are established to predict the height of the proposals. (in our experiment, we use k = 10). they are all 16 pixels in width and not the same height to correspond to different scale text regions. input an image of some size to the vgg 16 model and a wh hhc functional maps on the conv5 convolution layer (which is the room structure, c is several channels). the transmission detection is as follows: and our model then rolls a 3h3 window through the conv5 function maps, making every window predictable. moreover, each forecast shall contain k proposals with input picture coordinates and values. all the proposals are compiled and screened, and they do not exceed 0.7 in the next phases. 3.2. construct the text line following the development of the text detection network from the faster r-cnn encoding, a series of fine text proposals is introduced, as shown in fig. 5a. our primary goal is to break all these implicit text proposals into various sections of the text field and exclude proposals that do not belong to any text field. it is possible, for some non-text objects with identical composition to text patterns to generate false detection. the non-maximum suppression (nms) algorithm that was recently widely used in computer vision activities was proposed by alexander neubeck and luc van gool. nms algorithm deletes non-maximum objects that can be called a local maximal search. this is generally used to determine the top scoring range, which normally indicates the tremendous potential for object detection to be the target object. based on this work, we find the nms algorithm to be used to eliminate proposals with low scores and create a specific text field for our text detection model. the findings from the results also supported our conjecture. the picture processed after nms, as shown in fig. 5b. there are a variety of simple text areas that are composed of a series of adjacent fine text recommendations after each of the above steps are completed. to recreate text lines in input imagery using adjacent text proposals, we suggest an algorithm for text line construction. the following words are laid down for lines of text: 1. the text line is defined to appear as a quadrilateral in the original image; 2. as the vertical texture region written at the top-tobottom is not protected in our paper, it is laid down in the text lines that the aspect ratio should not be <2.1; the next task is to decide where the text region is located. the left and right sides of the quadrilateral text region should be calculated first. text line field consists of a series of fine-scale text proposals which are all small, vertical rectangular boxes, the left boundary of the build text line is a left-most boundary of the rectangular text line, and the right border is the rightmost rectangular border of all the rectangular boxes that make up the text line. we used the linear regression with the other two parameters to evaluate the commonly used linear regression is often used to predict a sequence of discrete points to evaluate a regression curve that can represent as precisely as possible discrete points. we use the least squares approach to evaluate a regression line as the top and bottom limits of the text field based on our statement above. linear regression is used as a tool in the mathematical field specifically to perform curve modification at several discrete points. this allows an understanding of understanding how discrete points are represented as errors as possible. the text area is completed after the above measure. moreover, from all text areas built as the true text line blocks fig. 4. loss function of the regressor. fig. 5. (a). results with no non-maximum. (b). results with nonmaximum. ba hamsa d. majeed: text detection on images using r-cnn 44 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 of the original image, we then pick all text areas with a score of at least 0.9. 3.3. implementation details the normal backpropagation and stochastic gradient descent are a qualified method for final text detection. our samples were obtained from the icdar 2014 multilingual text detection competition training results and icdar 2012 language localization process competition. we are not larger than 1200 on all training pictures and shorter than 600 on the short side, until the training of the pattern. we have not used the fully icdar 2014 training data. we initialize the new fully convolutional layer utilizing the gaussian distribution of the random 0 means and 0.01 standard variance with the vgg 16 model prepared on imagenet results. we use 0.9 impulses and weight loss of 0.0005. the analysis limit in 50 k iterations is set to 0.00001. moreover, our model was implemented in the tensorflow framework. 4. results in the results section, we test our model in both icdar 2012 and 2014, with a comparison with the previous researches results in both icdar 2012 and 2014 our model had a better result like the following: 4.1. icdar 2012 experiments the icdar 2012 dataset consists of 230 training and 252 sample images in their original. predominantly, these images are deduced from born-digital images and real-scene photographs. in this article, we check our 2014 version dataset models that were revised in 2014 to optimize the initial version margin. furthermore, in this case, our model reaches an f-measurement of 0.86, which is a higher than expected result. our approach has major advantages over the other as table 1 reveals. our model supports a more reliable text line identification that enhances efficiency under the icdar 2012 and 2014. the results showed that our proposed method and model had a better recall and precision rate better than previous models. 4.2. icdar 2014 experiments the icdar 2014 dataset has 242 training pictures, and 251 evaluation pictures, mainly of real scene photographs, close to the icdar 2012. in table 2, we equate the performance of our model with other comparative outcomes. the icdar 2014 has to be developed for the near-horizontal identification of text, but cannot completely cover the actual text in slanted text images. in this case, our model will perform well, leading to improved results in inclined text detection tasks. 5. conclusion in the original picture, we have provided an effective and reliable model for text detection that predicts bounding line boxes. using the quicker region-cnn with our anchor function, we predict a series of fine text proposals. in our experiment, we use an extremely deep network to quantify the image to obtain the deep characteristics of the image to boost the prediction performance. instead, with the fine-scale texting proposals, we suggested a new structure algorithm for the text rows with the aid of a linear regression method. such main strategies give the identification of the text line a valuable skill for the identification of text. both icdar 2012 and icdar 2014 produced excellent results. references [1] w. tao, d. j. wu, a. coates and a. y. ng. “end-to-end text detection with convolutional neural networks”. pattern detection (icpr), 2012 21st international conference on ieee, 2012. [2] j. max, a. vedaldi and a. zisserman. “deep features for text spotting”. in: european conference on computer vision. springer, cham, 2014. [3] n. lukáš and j. matas. “efficient scene text localization and detection with local character refinement”. document analysis and detection (icdar), 2015 13th international conference on ieee, 2015. [4] m. rodrigo, n, thome, m. cord, j. fabrizio and b. marcotegui. “snoopertext: a multiresolution system for text detection in table 2: experiments results of the icdar 2014 method recall precision yon 0.61 0.81 fatext 0.73 0.80 sdd 0.66 0.84 neumen 0.70 0.82 the proposed method 0.79 0.88 international conference on document analysis and recognition table 1: experiments results of the icdar 2012 method recall precision yao 0.75 0.73 textflow 0.76 0.78 ching 0.71 0.83 huang 0.66 0.84 lai 0.68 0.81 the proposed method 0.89 0.86 international conference on document analysis and recognition hamsa d. majeed: text detection on images using r-cnn uhd journal of science and technology | jul 2020 | vol 4 | issue 2 45 complex visual scenes”. image processing (icip), 2010 17th ieee international conference on ieee, 2010. [5] k. dimosthenis, f. shafait, s. uchida, m. iwamura, l. g. bigorda, s. r. mestre, j. mas, d. f. mota, j. a. almazan and l. p. de las heras. “icdar 2013 robust reading competition”. document analysis and detection (icdar), 2013 12th international conference on ieee, 2013. [6] h. weilin, z. lin, j. yang and j. wang. “text localization in natural images using stroke feature transform and text covariance descriptors”. computer vision (iccv), 2013 ieee international conference on ieee, sydney, nsw, australia, 2013. [7] w. huang, y. qiao, and x. tang. “robust scene text detection with convolutional neural networks induced mser trees”. vol. 1. european conference on computer vision (eccv), 2014. [8] y. xu-cheng, x. yin, k. huang and h. w. hao. “robust text detection in natural scene images”. ieee transactions on pattern analysis and machine intelligence, vol. 36, no. 5, pp. 970-983, 2014. [9] e. boris, e. ofek and y. wexler. “detecting text in natural scenes with stroke width transform”. computer vision and pattern detection (cvpr), 2010 ieee conference on ieee, 2010. [10] t. zhi, w. huang, t. he, p. he and y. qiao. “detecting text in natural image with connectionist text proposal network”. in: european conference on computer vision. springer, cham, 2016. [11] t. shangxuan, y. pan, c. huang, s. lu, k. yu and c. l. tan. “text flow: a unified text detection system in natural scene images”. proceedings of the ieee international conference on computer vision, 2015. [12] z. zheng, c. zhang, w. shen, c. yao, w. liu and x. bai. “multioriented text detection with fully convolutional networks”. arxiv, 2016. [13] n. alexander and l. van gool. “efficient non-maximum suppression”. vol. 3. pattern detection. 18th international conference on ieee, 2006. [14] y. cong, x. bai1, w. liu, y. ma and z. tu. “detecting texts of arbitrary orientations in natural images”. computer vision and pattern detection (cvpr), 2012 ieee conference on ieee, 2012. [15] h. pan, w. huang, y. qiao, c. c. loy and x. tang. “reading scene text in deep convolutional sequences”. proceedings of the thirtieth aaai conference on artificial intelligence (aaai-16), 2016. [16] h. tong, w. huang, y. qiao and j. yao. “accurate text localization in natural image with cascaded convolutional text network”. arxiv, 2016. [17] l. minghui, b. shi, x. bai, x. wang and w. liu. “textboxes: a fast text detector with a single deep neural network”. proceedings of the thirtieth aaai conference on artificial intelligence, 2017. [18] r. shaoqing, k. he, r. girshick and j. sun. “faster r-cnn: towards real-time object detection with region proposal networks”. ieee transactions on pattern analysis and machine intelligence, vol. 39, no. 6, pp. 1137-1149, 2017. . uhd journal of science and technology | jul 2019 | vol 3 | issue 2 87 1. introduction there has been a growing interest in the study of medicinal plants and their traditional use in different parts of the world over the past few decades as this can lead to new discoveries about plant agents. therefore, traditional medicine remains as an integral part in the most arabic countries and the flora of these countries with their use somehow is similar [1]. hypericum species are herbaceous plants known to have medicinal properties and are widely used in phototherapy in many countries. the hypericum genus is known to contain a wealth of secondary metabolites and many of which are biologically active. the main constituents are naphthodianthrones (hypericin, pseudohypericin, protohypericin, and protopseudohypericin), phloroglucinols (hyperforin, adhyperforin, hyperfirin, and adhyperfirin), and a broad range of flavonoids (hyperoside and rutin) [2]. it has been used in traditional arab herbal medicine to treat various inflammatory diseases and as sedative, astringent, antispasmodic, for intestine and bile disorders and poisonous antioxidant, antiviral, antimicrobial, and antinociceptive activities have also been reported in the literature for hypericum triquetrifolium [3]. along with the medicinal plants, iraq also has a rich and diverse flora including a wide variety of plants with the potential to cause animal and human poisoning [2]. animal-plant poisoning is usually accidental and occurs most often due to unfavorable conditions when pasture is poor in drought, overstocking, and trampling of grazing. plant poisons consist of toxic compounds that can actually be fatal even in small doses while others may cause a protective effect of eruca sativa leaves extract on sperm abnormalities in mice exposed to hypericum triquetrifolium aqueous crude extract chro ghafoor raouf, mahmood othman ahmad department of biology, college of science, university of sulaimani, sulaimani, kurdistan, iraq a b s t r a c t this study was carried out to evaluate the effect of eruca sativa on the cytotoxic effect of hypericum triquetrifolium on sperm abnormalities in albino mice. leaves of e. sativa and aerial parts of h. triquetrifolium were dried in shade and grinded and their aqueous extracts were used for the treatments to study their effect on sperm morphology. treated groups were injected with a single dose of 38 mg/kg body weight (bw) of hypericum subcutaneously, while the eruca groups were orally administered with 250 mg/kg bw twice/week for 2 weeks. after the exposure of h. triquetrifolium, the frequency of abnormal sperms showed a highly significant induction of sperm abnormalities; separated head from tail sperms, swollen head, hookless, defective head, and hook. however, the eruca group showed no obvious abnormalities in sperm morphology, while in cotreatment with both eruca and hypericum (h/e) groups, there was an extremely significant decrease (p < 0.0001) in the abnormal sperms. in conclusion, it appeared that e. sativa could prevent or at least minimize the damages that hypericum toxins would make on the sperm morphology significantly. index terms: eruca sativa, hypericum triquetrifolium, sperm morphology access this article online doi: 10.21928/uhdjst.v3n2y2019.pp87-92 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2019 raouf and ahmad. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) re v i e w a r t i c l e uhd journal of science and technology corresponding author’s e-mail: mahmood.ahmed@univsul.edu.iq received: 04-09-2019 accepted: 12-11-2019 published: 14-11-2019 chro ghafoor raouf and mahmood othman ahmad: effect of eruca on sperm morphology 88 uhd journal of science and technology | jul 2019 | vol 3 | issue 2 reduction in performance as weight loss, weakness, diarrhea, or rapid pulse rate [3]. as previous studies have shown, hypericum is one of the most common poisonous plants in iraq and the genus has more than 400 species [3], but only sixteen is observed in iraq, and the most abundant types are h. triquetrifolium and hypericum perforatum [4]. h. triquetrifolium turra are a perennial herbaceous plant and one of the iraqi wild species of hypericaceae distributed in the north and northwest of the country, the local arabic name of the species is roja and the kurdish name is swrnatik [5]. it contains a mixture of poisonous pigments referred to as hypericin that is able to cause many deleterious effects in livestock including hyperthermia and acute photodermatitis when consumed by grazing animals [2]. animals that are most likely to be poisoned are sheep, goats, horses, cattle, and swine. symptoms that result of poisoning are first to appear in unpigmented or lightly pigmented areas of skin that damaged and may become necrotic, and never recover and regrowth of hair in those areas are uncommon [6]. then, lactating animals suffering from decline or shutoff milk completely, even in severe case animals loss appetites and die from starvation and dehydration [7]. many researchers conducted to review the effect of h. perforatum and compared its toxicity to h. triquetrifolium in each rabbit and sheep [8]-[10]. however, an investigation accomplished in 2010 in iraq/kurdistan, studied the cytotoxic and genotoxic effect of h. triquetrifolium tested on male albino mice to come up the result with that, indeed in a particular dose, hypericum is noxious to both sperm cells and chromosomes that influence sperm morphology and chromosomal aberrations [11]. a similar study performed in 2012 in iraq/kurdistan, using different doses and duration revealing the same result of cytotoxicity and genotoxicity of h. triquetrifolium on albino mice [12]. however, there is no study shows an effect of a medicinal plant in animals that have been exposed to h. triquetrifolium as a common toxic plant on sperm morphology; thus, this study is aimed to achieve exactly that goal to compare both results before and after the treating with both plants, toxic, and medicinal. eruca sativa known as jarjir, rocket, or arugula plant belongs to the brassicaceae family, is a minor oil crop and medicinal plant in several parts of middle east since ancient it has been used in traditional medications as remedies for different diseases [13]. hence, phytochemical composition and corresponding biological activities are crucial to apprehending the therapeutic potential of medicinal plants. numerous studies infer that pharmacological action of any medicinal plants is attributed to the presence of secondary metabolites; these generally consist of the phenolic compounds, alkaloids, tannins, saponins, carbohydrates, glycosides, flavonoids, steroids, etc. among the others, phenolic compounds are the universally discovered phytochemicals for the sake of therapeutic potential in a different medicinal plant [14]. all of these secondary metabolites and particularly phenolic compounds have been reported as scavengers of free radicals and also have been considered as good therapeutic candidates for free radical related pathologies as it has been reported that e. sativa seed extracts are potent antioxidants, exhibit diuretic effects, and provide renal protection [14]. the previous phytochemical studies of e. sativa showed that leaves and seeds contain glucosinolates. three new quercetins have been isolated and identified from e. sativa leaves [15]. however, there is not sufficient information in the form of scientific analysis about detailed phytochemical composition of e. sativa and their respective bioactivities [16]. on the other hand, many studies revealed the intense aphrodisiac effect of jarjir since ancient roman times [17], [18], in a way that the seed oil enhance increasing fertility and sexual activity through dilation of seminiferous tubules, proliferation of spermatogenic cells, increasing mitotic activity, number of sperms, epididymis weight, elevating level of testosterone, and hyperplasia of interstitial leydig cells have also been noticed [19]. in addition, barillari et al. [20] proposed that the presence of saponin and alkaloid extract is responsible for increasing sperm activity. 2. materials and methods the aerial parts of h. triquetrifolium were harvested by hand at moghagh/piramagrwn/sulaimanya during the early stage of flowering time in june since the plant shows its most toxicity at this stage while e. sativa was obtained from a local market in sulaimanya. the same procedure was used for both plants, as they aqueous extracted. the plants were air-dried indoors at room temperature for about 1 week, then ground to obtain a powder using an electric grinder. the powder suspended in distilled water for about 24 h at the rate 50 g/400 ml, and then the solution was filtered twice using whatman filter paper. the final crude extract was dried using an oven at 42°c temperature for about 12 h to obtain powder crude extract, then kept in dark bottles at 4°c until preparing the treatment doses [10]. chro ghafoor raouf and mahmood othman ahmad: effect of eruca on sperm morphology uhd journal of science and technology | jul 2019 | vol 3 | issue 2 89 2.1. solution preparation the powder of the plants crude extracts was used for preparing the solution as follows: • −35 mg/kg body weight (bw) of h. triquetrifolium were mixed with 1 ml of distilled water shaking until dissolving the powder completely and injected subcutaneously to the mice • −250 mg/kg bw of e. sativa powder was mixed with 1 ml of distilled water that administered orally using gavage needle. 2.2. experimental design twenty-five albino male mice were divided into five groups designated as c, t1, t2, t3, and t4. each group consisted of 5 mice and subjected to the following treatments: control: mice were treated with 1 ml of distilled water. • t1: mice were treated with a single dose of 1 ml hypericum extract subcutaneously at the dose 38 mg/kg bw • t2: mice were orally treated with 1 ml eruca extract using gavage needle at the dose 250 mg/kg bw twice a week for 2 weeks • t3: mice were treated with a single dose of 1 ml hypericum extract then injected with 1 ml of eruca by oral administration twice a week for 2 weeks • t4: mice were treated with 1 ml of eruca • then, injected with 1 ml of hypericum. 2.3. collection and preparation of sperms after sacrificing the animals with cervical dislocation, the sperm morphology of the treated mice was examined 2 weeks after the first treatment. vas deference of each mouse of the five groups was removed from the testes and put into a small petri dish filled with normal saline. using a scissor and disposable blades, vas deference was cut from the testes and sperm were transferred (semen extracted) on to a clean slide, preparing a smear for each. the smear was stained with hematoxylin for 15 min and washed, then stained with eosin and washed again. at the final step, the slides left to dry, then the results were read, counting about 100 sperm from each slide/animal (500 sperm for each treatment) to determine sperm morphology abnormalities [11]. 2.4. statistical analysis the values of the investigated parameters were analyzed using a statistic program graphpad (prism 2019). the experimental results were expressed as mean ± standard error of the mean. groups were compared by analysis of variance using one-way, two-way anova, and dunnett’s test for multiple comparisons test. p < 0.05 was regarded as statistically significant. 3. results and discussion table 1 summarized the results of sperm morphology observations among treated groups and the control group. the data obtained from 100 sperm/replication (500 sperm/ group) in semen samples collected from vas deferens of each mouse. the data show that there is a significant difference between control and hypericum treated group (t1). the most frequent aberrant shapes were sperm without a head, sperm without a tail, swollen head, hookless, defective head, and defective hook. while in the eruca treated groups (t2), there was no significant difference in the aberrant types comparing to control group. however, in the creating groups, (groups treated with both eruca and hypericum), in compare with t1 (hypericum treated group) and the results exhibited that all forms of sperm abnormalities were significantly lowered in t3 and t4 comparing to t1, and the total number of normal sperm increased significantly. while t3 and t4 comparing to each other were not significantly different, mentioning that the (hypericum-eruca) treated groups (t3 and t4), differed in the subsequent of the treatments, wherein t3 the injection started with hypericum and ended with eruca while in t4 the reverse was applied; hence, the results showed no significant difference in none of the parameters comparing to one another. fig. 1 showed the aberrations resulted in this study. the whole process of developing spermatogonia to haploid spermatids takes around 35 days in mice [13], [14] through a complex process is known as spermatogenesis. three crucial events are the major steps of this process: (i) mitosis which is the multiplication of spermatogonia; (ii) reducing chromosome number from diploid to haploid by meiosis and begins with the entry of type b spermatogonia into the prophase of the first meiotic division. these cells now called primary spermatocytes, divide to form secondary spermatocytes and next to form round spermatids; and (iii) the successful transformation of the round spermatid into the complex structure of the spermatozoon is called spermiogenesis. each of these steps represents a key element in the spermatogenic process, defects in any of them can fail in the entire process and lead to the production of defective spermatozoa or reduction in sperm production [17]. as previous researchers have confirmed its genotoxic effect on sperms and the male reproductive system in general, hypericum toxicity attributed to the active principles chro ghafoor raouf and mahmood othman ahmad: effect of eruca on sperm morphology 90 uhd journal of science and technology | jul 2019 | vol 3 | issue 2 (hypericin, hyperforin, quercetin, and flavonoids) [18]. this study attempted to repair or reduce its effect on germ cells. it is believed that any abnormalities in sperm morpholog y may be due to a change in the genetic component and these are classified as defects in the head, midpiece, and tail. the results in this study agreed with mohammed and kheravii [19] as they stated that the frequency of abnormal mice sperms examined for 35 days injected with h. triquetrifolium but using different doses revealed a significant induction of sperm abnormalities in all concentrations of plant extract comparing with untreated animals. the most frequent types of sperm abnormalities of the treated groups were irregular head defect, pseudodroplet defect, bent midpiece defect, and corkscrew midpiece defect. mohammed and mohammed [19] illustrated that the mixture of the compounds found in the aqueous extract of hypericum caused cytotoxicity and induced different cytogenic effects in both somatic and germ cells of male albino mice. similar to this conclusion, mohammed and ali [21] have also verified that at dose 17 mg/kg bw, the total abnormal spermatozoa has increased compared with the control group, indicating that the most frequent aberrant occurrence was bent midpiece and coiled tail as sperm morphology was always an indicator of toxicity and mutagenicity in mammals [21]. the results in this study revealed that eruca has reduced the toxicological effects of hypericum on sperm morphology in a marvelous way as it’s obvious in t3 and t4 groups, this is in agreement with many researches as [22] reported that exposed rats to cadmium chloride treated with e. sativa seeds extracts, improved the hormonal profile concentrations, serum testosterone, follicle-stimulating hormone, and luteinizing hormone (lh) and number of leydig cells by increasing them parallel with alleviating the testicular toxicity induced by cadmium chloride [22]. they accredited the result to its antioxidant and free radicals scavenging activity [20] of many phytochemical compounds including glucosinolate and flavonoids [23] against oxidative damage induced by cd and thereby improving the pituitarytestes axis and fertility [24], [12]. e. sativa seeds extract exhibited an evidence for a stimulatory effects on reproductive gonadal system through androgenic activities through increasing the number of leydig cells could be due to the free radical scavenging ability through excluding the fe3+ [15] or may be due to an increasing the number and/or the sensitivity of receptors t a b l e 1 : c yt o g en et ic e ff ec t o f e ru ca s at iv a an d h yp er ic u m t ri q u et ri fo liu m w it h t h ei r in te ra ct io n o n s p er m tr ea tm en ts a be rr an t n or m al sp er m s pe rm w ith ou t t ai l s pe rm w ith ou t h ea d d ef ec tiv e he ad h oo kl es s d ef ec tiv e ho ok s w ol le n he ad tw ota il c on tro l 87 .0 0± 3. 16 2a 3. 20 0± 0. 48 99 a 1. 60 0± 0. 40 00 a 0. 20 00 ±0 .2 00 0a 0. 40 00 ±0 .2 44 9a 0. 20 0± 0. 20 0a 0. 40 00 ±0 .2 44 9a 0. 00 ±0 .0 0a h yp er ic um (t 1) 57 .0 0± 1. 44 9b 5. 20 0± 0. 86 02 a 9. 40 0± 2. 29 3b 4. 80 0± 0. 73 48 b 4. 40 0± 1. 12 2b 2. 60 0± 0. 67 82 b 4. 80 0± 1. 39 3b 0. 60 00 ±0 .4 00 0a e ru ca (t 2) 90 .8 0± 1. 24 1a 3. 60 0± 0. 60 00 a 3. 20 00 ±0 .5 47 7a 1. 20 0± 0. 37 42 a 1. 00 0± 0. 44 72 a 0. 00 ±0 .0 0a 0. 20 00 ±0 .2 00 0a 0. 00 0± 0. 00 0a h /e (t 3) 92 .8 0± 1. 49 7a 4. 80 0± 1. 06 8a 0. 80 00 ±0 .5 83 1a 1. 00 0± 0. 44 72 a 0. 40 00 ±0 .2 44 9a 0. 00 0± 0. 00 0a 0. 20 00 ±0 .2 00 0a 0. 00 0± 0. 00 0a e /h (t 4) 93 .6 0± 0. 50 9a 2. 60 0± 0. 50 99 a 0. 60 00 ±0 .4 00 0a 0. 40 00 ±0 .2 44 9a 0. 60 00 ±0 .2 44 9a 0. 60 0± 0. 40 0a 0. 00 0± 0. 00 0a 0. 00 0± 0. 00 0a u si ng d un ne tt t es t an al ys is , w e co m pa re d t he m or ph ol og ic al a b no rm al it ie s th at h av e b ee n ob se rv ed b y m ic ro sc op ic e xa m in at io ns a m on g th e tr ea tm en t gr ou ps . e : e ru ca , h : h yp er ic um , l et te rs , a an d b r ep re se nt si gn ifi ca nt d iff er en ce v al ue (p ˂0 .0 5) , s am e le tt er : n o si gn ifi ca nt d iff er en ce , d iff er en t le tt er : s ig ni fi ca nt d iff er en ce chro ghafoor raouf and mahmood othman ahmad: effect of eruca on sperm morphology uhd journal of science and technology | jul 2019 | vol 3 | issue 2 91 of the leydig cells to lh which led to increase testosterone biosynthesis [16]. these findings cooperate with the researches of mona and nehal [25] and [26] that concluded the capability of e. sativa to improve healthy sperm characteristics and fertility. the increase of abnormal sperm morphology and decrease in viability may serve as a useful indicator of potential damage to the sperm by intubation of h 2 o 2 [25] or may be due to impaired of leydig’s cell functions that can lead to enhances alteration of testosterone synthesis [12], [27]. thus, in the present study, an improvement in male spermatogenesis in the cotreatment group has been documented, we suggest that eruca protects and reduces the cytotoxic effects of hypericum. 4. conclusion we concluded that eruca could lower most of the sperm abnormalities significantly, and highly prevent the action of the hypericum's toxins. fig. 1. sperm abnormality types, depending on their morphology (×1000). chro ghafoor raouf and mahmood othman ahmad: effect of eruca on sperm morphology 92 uhd journal of science and technology | jul 2019 | vol 3 | issue 2 references [1] n. k. b. robson. “studies in the genus hypericum l. (clusiaceae). 1. section 9. hypericum sensulato (part3): subsection 1. hypericum series 2. senanensia, subsection 2. erecta and section 9b.” graveolentia. systemic biodiversity, vol. 4, pp. 19-98, 2006. [2] c. a. bourke. “sunlight associated hyperthermia as a consistent and rapidly developing clinical sign in sheep intoxicated by st john’s wort (hypericum perforatum).” australian veterinary journal, vol. 78, no. 7, pp. 483-488, 2000. [3] h. j. jafar, m. j. mahmood, a. m. jawad, a. naji and a. al-naib. “phytochemical and biological screening of some iraqi plants.” fitoterapialix, vol. 18, p. 299, 1983. [4] j. a. h. al-mukhtar. “hypericum plant. directorate plant.” bulletin no. 231. ministry of agriculture and agrarian reform, iraq, pp. 2-15, 1975. [5] h. l. al-rawi. “medicinal plants of iraq”. chakravarty, ministry of agriculture and irrigation, state board for agricultural and water resources research, national herbarium of iraq, baghdad, vol. 54, pp. 67-78, 1988. [6] g. mradu, s. saumyakanti, m. sohini and m. arup. “hplc profiles of standard phenolic compounds present in medicinal plants”. international journal of pharmacognosy and phytochemical research, vol. 4, pp. 162-167, 2012. [7] s. alqasoumi, m. al-sohaibani, t. al-howiriny, m. al-yahya and s. rafatullah. “rocket “eruca sativa”: a salad herb with potential gastric anti-ulcer activity.” world journal of gastroenterology, vol. 15, pp. 1958-1965, 2009. [8] m. i. a. al-farwachii. “aran (hypericum crispum) poisoning and its effect on the phagocytosis and antibodies in rabbits.” m.sc. thesis, university of mosul (arabic), 1997. [9] m. d. kako, i. i. al-sultanand and a. n. saleem. “studies of sheep experimentally poisoned with hypericum perforatum.” veterinary and human toxicology, vol. 35, pp. 298-300, 1993. [10] a. pandey and s. tripathi. “concept of standardization, extraction and pre phytochemical screening strategies for herbal drug.” journal of pharmacognosy and phytochemistry, vol. 2, no. 5, pp. 115-119, 2014. [11] a. k. sharma and a. sharma. “chromosome techniques”. 3rd ed. butter worth, london, p. 486, 1980. [12] m. n. ansari, m. a. ganaie, t. h. khan and g. a. soliman. “protective role of eruca sativa extract against testicular damage in streptozotocin-diabetic rats.” international journal of biology, pharmacy and allied sciences, vol. 3, no. 7, pp. 1067-1083, 2014. [13] e. f. oakberg. “differential spermatogonial stem-cell survival and mutation frequency.” mutation research/fundamental and molecular mechanisms of mutagenesis, vol. 50, no. 3, pp. 327-340, 1978. [14] d. m. de kretser, k. l. loveland, a. meinhardt, d. simorangkir and n. wreford. “spermatogenesis.” human reproduction, vol. 13 suppl 1, pp. 1-8, 1998. [15] m. emtenan, e. m. hanafi, r. m. hegazy and h. a. riad. “bioprotective effect of eruca sativa seed oil against the hazardous effect of aflatoxin b1 in male rabbits.” international journal of academic research, vol. 2, pp. 67-74, 2010. [16] m. koubaa, d. driss, f. bouaziz, r. e. ghorbel and s. e. chaabouni. “antioxidant and antimicrobial activities of solven extract obtained from rocket (eruca sativa l.) flowers.” free radicals and antioxidants, vol. 5, no. 1, pp. 29-34, 2015. [17] p. j. chenoweth. “genetic sperm defects.” theriogenology, vol. 64, no. 3, pp. 457-468, 2005. [18] r. r. ondrizek, p. j. chan, w. c. patton and a. king. “an alternative medicine study of herbal effects on the penetration of zona-free hamster oocytes and the integrity of sperm deoxyribonucleic acid.” fertility and sterility, vol. 71, no. 3, pp. 517-522, 1999. [19] b. m. mohammed and s. k. kheravii. “evaluation of genotoxic potential of hypericum triquetrifolium extract in somatic and germ cells of male albino mice.” research opinions in animal and veterinary sciences, vol. 1, no. 4, 231-239, 2011. [20] j. barillari, d. canistro, m. paolini, f. ferroni, g. f. pedulli, r. iori and l. valgimigli. “direct antioxidant activity of purified glucoerucin, the dietary secondary metabolite contained in rocket (eruca sativa mill.) seeds and sprouts.” journal of agricultural and food chemistry, vol. 53, no. 7, pp. 2475-2482, 2005. [21] b. m. a. mohammed and j. a. h. ali. “synergistic effects of bleomycin and hypericum triquetrifolium extract on bone marrow and germ cells of albino mice.” journal of global pharma technology, vol. 10, no. 5, pp. 203-212, 2009. [22] b. n. al-okaily and z. m. al-shammari. “the impact of eruca sativa seeds on leydig’s cells number and hormonal profile in cadmium exposed rats.” kufa journal for veterinary medical sciences, vol. 7, no. 2, pp. 241-253, 2016. [23] j. jin, o. a. koroleva, t. gibson, j. swanston, j. magan, y. zhang, i. r. rowland and c. wagstaff. “analysis of phytochemical composition and chemoprotective capacity of rocket (eruca sativa and diplotaxis tenuifolia) leafy salad following cultivation in different environments.” journal of agricultural and food chemistry, vol. 57, no. 12, pp. 5227-5234, 2009. [24] a. j. nwofal. “the role of ethanolic extract of salad rockets (eruca sativa) leaves on the performance of male reproductive system in oxidative stressed rats.” phd thesis/college of veterinary medicine university, baghdad, 2014. [25] a. r. s. mona and a. m. nehal. “histological and quantitative study of the effect of eruca sativa seed oil on the testis of albino rat.” the egyptian journal of hospital medicine, vol. 2, pp. 148-162, 2001. [26] d. ates and o. erdogrul. “antimicrobial activates of various medicinal and commercial plant extracts.” turkish journal of biology, vol. 27, pp. 157-162, 2003. [27] z. f. hussein. “study the effect of eruca sativa leaves extract on male fertility in albino mice.” al-nahrain journal of science, vol. 6, no. 1, pp. 143-146, 2013. tx_1~abs:at/tx_2:abs~at 96 uhd journal of science and technology | jan 2020 | vol 4 | issue 1 1. introduction in recent years, indoor localization became very popular due to the extensive range of applications [1]. global navigation satellite systems and global positioning systems are used to enable accurate location, but they failed in indoor environments because of the low received signal power and satellite visibility in such places, underwater, inside building, caves, and tunnels [2]. these technologies need an open environment to work properly [3]. signal strengths of wireless fidelity (wi-fi) apps to fingerprint the location can be used in wi-fi fingerprint-based localization systems, these signals are collected in the location and become the mainstream results for indoor localization. wi-fi fingerprint-based localization methods have two main phases offline fingerprinting phase and online localization phase fig. 1. the offline phase is utilized in building a wi-fi fingerprint map by a site survey and save it at a database, and the online localization phase is used to locate the mobile devices by examining the received wi-fi signals with the fingerprint map [4]. recently, users utilize mobile devices (e.g., smartphones) to access wi-fi networks in indoor environments (e.g., shopping malls). the investigation of indoor localization methods utilizing signals has increased widely [5]. moreover, these methods are profitable because it does not require extra tools. one of the best advantages of location fingerprinting is capable of taking the benefits of multipath and non-line of sight problems in an indoor environment, as they truly assistant rss to be distinct at dissimilar points of the area [3]. while there are several valuable features in fingerprint-based enabling accurate indoor localization using a machine learning algorithm haidar abdulrahman abbas1*, kayhan zrar ghafoor2 1department of computer¸ college of science, university of sulaimani, sulaymaniyah, iraq, 2department of software engineering, university of salahaddin, erbil, iraq a b s t r a c t in this paper, fingerprint referencing methods based on wireless fidelity wi-fi received signal strength (rss) have used for indoor positioning. more precisely, naïve bayes, decision tree (dt), and support vector machine (svm) one-to-one multi-classes and error-correcting-output-codes classifier are to enable accurate indoor positioning. then, normalization is used to reduce positioning error by reducing the fluctuation and diverse distribution of the rss values. different devices are used in this experiment; the training dataset is not included in the main dataset. nonetheless, the learned model by the svm algorithm cannot be affected by the elimination of train datasets of the test device. the efficiency of dt is lower than the other machine learning algorithms, because it performs by boolean function, and it provides the low accuracy of prediction for dataset than the algorithms. naïve bayes technique based on bayes theorem is better than dt and close to svm for positioning approves that 1–1.5 m positioning accuracy for indoor environments can be achieved by the proposed approach which is an excellent result than traditional protocol. index terms: received signal strength, wireless access points, wireless fidelity fingerprinting, indoor localization, decision tree, naïve bayes, support vector machine corresponding author’s e-mail: haidar abdulrahman abbas, department of computer¸ college of science, university of sulaimani, sulaymaniyah, iraq. e-mail: haidar.abbas@univsul.edu.iq received: 08-05-2020 accepted: 18-06-2020 published: 27-06-2020 o r i g i n a l r e s e a r c h a r t i c l e uhd journal of science and technology access this article online doi: 10.21928/uhdjst.v4n1y2020.pp96-102 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2020 abbas and ghafoor. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) haidar abdulrahman abbas and kayhan zrar ghafoor: indoor localization using a machine learning algorithm uhd journal of science and technology | jan 2020 | vol 4 | issue 1 97 localization, building fingerprint landmarks for a huge area requires an important amount of time and human resources. database fingerprint can be altered by environmental influences, time, and different devices, so it is necessary to update frequently. important studies have been dedicated to the online localization phase; decimeter-level localization efficiency can be obtained by utilizing advanced algorithms which are used to collect online wi-fi signals with the fingerprint map [6]. this field of study has been concerned by the researcher in both industry and academia. to collect rrs and fingerprint to assessment, the target location machine learning algorithms can be used such as deep learning and k-nearest-neighbor (k-nn) [7]. in this study, fingerprint methods utilizing the wi-fi strength signal is presented for indoor positioning. to decrease the positioning errors, naive bayes, decision tree (dt), and support vector machine (svm) one-to-one multi-classes and error-correcting output codes (ecoc) classifiers are proposed and the contrast among these methods. normalization is used to reduce errors positioning in the values of rss because of instability and diverse distributions values of rss. different devices are used in this experiment when the train data set is not involved in the main dataset. nonetheless, the learned model by the svm algorithm cannot be affected by the elimination of train datasets of the test device. the efficiency of dt is lower than the other machine learning algorithms, because it performs by boolean function, and it provides the low accuracy of prediction for dataset than the other algorithms. naïve bayes techniques based on bayes theorem is better than dt and close to svm for positioning accuracy. svm error positioning approves that 1–1.5 m positioning accuracy for indoor environments can be achieved by the proposed approach which is an excellent result than traditional protocol. 2. related work recently with the development of computing and the popularity of location-based services, many types of research have considered the improvement of the indoor localization fig. 1. constructing the wireless fidelity fingerprinting map [7]. haidar abdulrahman abbas and kayhan zrar ghafoor: indoor localization using a machine learning algorithm 98 uhd journal of science and technology | jan 2020 | vol 4 | issue 1 system. some of these researches focus on the designing system for specific applications that requires a high efficiency (e.g., in the order of centimeters) [8]. normally, developing these systems need devoted hardware with a huge application cost. contrarily, several kinds of research have focused on general location-based services where the necessity of accuracy in the form of meters. wi-fi strength signal is used by the fingerprinting method for indoor positioning. to decrease the positioning errors, the improved form of nearest neighbor algorithm is suggested which is called nk-nn, multipath, and rss variations created the new form of nk-nn, which are utilized the basic knn and it is variant. in the rss testing sample, the noise can be removed by compared each testing sample to each fingerprint and based on the minimum distance, the sample is chosen for the position’s calculation. after that, the process of classification is operated on the kth-nearest training sample of diverse reference points which assistance to trim the noise of rss training and preventing them from the localization. in the experimental outcome, the nk-nn method has better performance than other similar methods [7]. other studies used convolutional neural network (cnn)based wi-fi fingerprinting for indoor localization. it can be seen that the achievement in image classifications, the suggested method can be potent to minor changes of received signals as it uses the radio map topology as well as signal strengths. in the suggested method, based on the onedimensional wi-fi signals, the two-dimensional (2-d) virtual radio map is built (e.g., received signal strength indicator values) and later a cnn utilizing 2-d radio map is designed as inputs. consequently, the proposed method is learned the signal strengths as well as the topology radio map. to enhance the efficiency of the suggested method utilizing different improving techniques as feature scaling, dropout, data balancing, and ensemble [6]. to enhance the accuracy of positioning systems, many approaches have been studies and focused on long short-term memory (lstm) networks. a deep neural network is utilized to improve the efficiency of positioning methods which is acceptable for handling sequential datasets. therefore, lstm modes are used because they can recognize the dependency of long-term that is existing in the wi-fi data that can be seen from the deep recurrent model’s performance. the architecture of rnn and lstm can recognize the dependency of long-term and utilizing them for later prediction. it is good to examine the previous landmark position to an exact estimate of points on the radio map. the main aim of implementing rnn is to guarantee for providing better performance of recurrent networks on the wi-fi dataset. vanilla lstm is the primary model that has a good enhancement by 47.8% over the knn and 10.2 enhancements over rrn utilizing the complete dataset. the efficiency of vanilla lstm is even developed after updated to 3-stacked lstm. the improvement of 3-stacked lstm is 74.4% over the knn and 18.1% over vanilla lstm [9]. there is a rich theoretical basis that is prepared by the statistical learning theory for developing the model, starting a set of examples. in a specific wi-fi, the wireless has a signal strength measurement for standard functioning mode so that no particular hardware is desired. svm is designed and compared to other approaches examined in the scientific literature on the equivalent data set. experiments executed in the real-world environment illustrate that the outcomes are comparable, with the benefits of low algorithms complication in the standard functioning phase. further more, the algorithm performed better than the other techniques which are mainly appropriate for classification [10]. 3. methodology the localization algorithm is fundamentally used for making rss related radio maps in a designated indoor environment as well as converting localization problem into an optimization problem: obtaining rss value measurements of an undisclosed location, the function can help in estimating the location when used in reverse order. while using a fingerprinting technique in the online phase, identical smartphones should be used in both buildings, the rss dataset as well as testing. using different smartphones would worsen the accuracy of the calculated position. to eliminate this problem, we propose a dt, naive base, and svm model or adapt the nature of the calculated rss values among multi-smartphones. the model is directed at various types of smartphone measurements by adopting a machine learning algorithm. first, gathered rss values at all the identified close positions are normalized, the normalization is achieved by subtracting the mean value of the gadgets engaged in training the model and then dividing the results by the standard deviation of the aforementioned gadgets. before normalizing the rss of wireless access points (ap), the error positioning high because of fluctuation and heterogeneous distribution of the rss values and applying normalization to decrease the variation of the value and rescale the rss value within a haidar abdulrahman abbas and kayhan zrar ghafoor: indoor localization using a machine learning algorithm uhd journal of science and technology | jan 2020 | vol 4 | issue 1 99 uniform distribution. rss vector will be filled with zeroes for those aps. the normalized rss values can then be applied to train and test the algorithms. 3.1. dt dts are a non-parametric method that belongs to the supervised learning algorithms family. it is for classification and regression [11]. in this algorithm, a binary dt is developed from the training data set. in the beginning, basic decision rules derived from the data features are learned. it operates in three nodes; root node, internal node, and a leaf or terminal node. the terminal node has a single receiving edge and zeroes an outgoing edge. the internal has two edges, one for incoming and one for outgoing, while the root node can have zero or more outgoing edges but does not have any incoming edges. each leaf node is given a class label. each node is related to a decision performed on the inputs. next, the node is split into new subsets, one for each of the node’s sub-trees, in such a way that the same target location is in the same subsets [11]. the algorithm halts upon finding a pure decision meaning each node’s data subset has a single target location and when uncertainty is inefficient. 3.2. naïve bayes the crux of this theorem is derived from the bayes theorem. according to the naïve bayes theorem, all features are independent of each other. while this assumption is usually not true in real-world applications, yet natives bayes have had positive results in certain scenarios, mostly when there is a small number of training samples [12]. with an end goal to accomplish high accuracy while decreasing pre-deployment trials, we select this strategy for processing the probabilities of the locations’ given measurements. the event with the most elevated likelihood is considered as the candidate. naive bayes classifier depends on two fundamental assumptions: (1) the features do not affect each other and (2) the prominence of all the features is equal [13]. 3.3. svm svms [14], [15] are non-parametric supervised learning models with related learning algorithms that analyze data used for pattern recognition problems. svms are applied in the localization system by training the support vectors on a radio map that consists of grid points. svms study the association between the trained fingerprints and their grid points by taking into account each grid point as a class. this method can be expanded to multiple class classification rather than just two classes. in our training dataset, we have 105 classes, so we used the ecoc one-to-one svm which is used to classification when classes are more than two after representing the training data by mapping the data to the feature space. the svm algorithms identify hyperplane, which separates the support vector trained with a distance [14]. 4. performance evaluation in this section, designing radio maps of rss in the studied indoor environment and positioning the difficulty as an optimization problem is presented which is the main idea behind the localization algorithm: providing rss value measurements of a new position, the function is reversed to determine the evaluated position. localization fingerprinting methods utilized two main phases (online positioning phase and offline training phase). afterward, the fingerprint is collected and builds our dataset. the dataset consists of the true location of pre-selected positions and equivalent rss of nearby ap [16]. the approach that we proposed illustrates the decision tree, naïve bayes, and svn model and compares those models, normalization was applied to fingerprint landmarks. 4.1. dataset and simulation setup our rss data are collected from kios research center which is a 560 m2 office environment. this center has many open cubicle-style and special offices, labs, and conference rooms. wireless lan standard has been used to install nine local apps and offer full coverage all over the floor. we utilize five diverse mobiles to collect our data including, hp ipaq hw6915 personal digital assistant with windows mobile, an asus eeepc t101mt laptop running windows 7, and htc flyer android tablet and two other android smartphones (htc desire, samsung nexus s). we use fingerprinting for our training data, documenting these fingerprints have rss measurement from entire existing aps, at 105 separate reference positions by carrying all five devices concurrently. we utilize each device to collect fingerprints, 2100 training fingerprints are available, equivalent to 20 fingerprints per reference position. for building device-specific radio maps, these data are utilized by computing the measure of mean values rss that analogous to each reference position. we indicate that the device-specific radio maps are only required to estimate purposes. after 2 weeks, we utilized a predefined router to gather more test data by walking forward to the router. the router contains two segments and 96 positions; most of them do not concur with the mention position. each router is tested 10 times using all devices concurrently, while one fingerprint was documented at all test positions [17]. matlab toolbox is used to estimate the performance of models. in the first scenario, we tested the dt, naïve bayes, haidar abdulrahman abbas and kayhan zrar ghafoor: indoor localization using a machine learning algorithm 100 uhd journal of science and technology | jan 2020 | vol 4 | issue 1 and svm as matching algorithms without applying the rss normalization (mean and standard deviation) parameters while the device is tested, the training dataset of the device was excluding. the second scenario was using the machine algorithms by applying the rss normalization, like the previous scenario, the testing device, the training dataset was excluding. 4.2. simulation results we used root mean square error between the estimated and the true locations to evaluate the localization accuracy dt, naïve bayes, and svm algorithms, while there are many methods to evaluate the accuracy. the average positioning accuracy of the first scenario shown in table 1 which contains all the devices tested. the dt is generally represented by boolean function and gives a dataset low prediction accuracy compared to other machine learning algorithms. the svm has better positioning of the accuracy than other algorithms fig. 2. the positioning accuracy for all algorithms is higher in the second scenario after normalization exercised on the dataset than before normalization, caused by fluctuation and heterogeneous distribution of the rss values. the findings are listed in table 2. compared to the svm with dt and naïve bayes, we can see that the svm exhibits more accuracy in positioning. svm has the most elegant maths behind them and uses the kernel trick in the dual problem. the results of naïve bayes have a second degree in positioning accuracy at each scenario fig. 3. phone 4’s positioning accuracy is weaker than most phones, its return to phone 4’s rss values, phone 4 read the signal strength from −11 to −90 db, and most are between −11 and −40 db, unlike other phones. there is a big gap in rss values where we have categorized phone4 and others, so the positioning accuracy is worse than some, so the error is large. table 2: positioning accuracy of dt, naïve bayes, and svm algorithms when normalization applied decision tree positioning accuracy when normalization applied phone 1 phone 2 phone 3 phone 4 phone 5 phone 1 2.2143 2.6063 2.3258 2.5884 2.2332 phone 2 2.0923 2.1043 2.1062 2.3519 2.2207 phone 3 2.1431 2.2638 1.9752 2.4954 2.213 phone 4 2.2618 2.5557 2.4014 2.5941 2.4816 phone 5 2.2688 2.5763 2.3769 2.6545 2.5061 naïve bayes positioning accuracy when normalization applied phone 1 phone 2 phone 3 phone 4 phone 5 phone 1 1.9256 2.1813 1.787 2.3594 1.8646 phone 2 1.9088 1.8985 1.834 2.2155 1.9758 phone 3 1.7655 1.9025 1.5323 2.0569 1.7883 phone 4 1.8462 2.009 2.0197 1.8725 1.7213 phone 5 1.7523 1.9018 1.9706 2.0426 1.7213 svm positioning accuracy when normalization applied phone 1 phone 2 phone 3 phone 4 phone 5 phone 1.7554 1.982 1.8261 2.2321 1.9163 phone 1.7485 1.6998 1.6029 1.8001 1.7596 phone 1.4364 1.853 1.4621 2.2092 1.7774 phone 1.8266 1.9311 1.8866 1.8432 1.6975 phone 1.6622 1.963 2.0058 2.0025 1.6215 dt: decision tree, svm: support vector machine table 1: positioning accuracy of dt, naïve bayes, and svm algorithms when normalization not applied decision tree positioning accuracy when normalization not applied phone 1 phone 2 phone 3 phone 4 phone 5 phone 1 2.203 2.8795 2.5406 4.4574 2.4796 phone 2 3.4706 2.1509 2.7549 3.7918 2.9944 phone 3 2.8008 2.5236 1.9669 3.9633 2.3818 phone 4 8.4164 7.2392 7.9662 2.4191 8.0881 phone 5 2.5555 2.6016 2.479 5.1455 2.6042 naïve bayes positioning accuracy when normalization not applied phone 1 phone 2 phone 3 phone 4 phone 5 phone 1 1.8598 2.6695 2.0898 4.9985 1.8697 phone 2 2.7418 1.8852 2.1934 5.0275 2.4532 phone 3 1.7627 2.2526 1.5297 5.6689 1.8885 phone 4 7.4514 5.1601 6.2018 1.8415 6.4267 phone 5 2.0015 2.1527 1.9874 4.3407 1.7289 svm positioning accuracy when normalization not applied phone 1.748 2.3896 1.9668 4.4945 2.0451 phone 2.3941 1.7739 1.9912 3.2393 2.083 phone 1.7156 2.3555 1.4941 4.344 1.9079 phone 5.0985 4.2347 4.6483 1.8507 4.7268 phone 1.791 2.0599 1.9353 4.1604 1.6278 dt: decision tree, svm: support vector machine fig. 2. decision tree, naïve bayes, support vector machine positioning accuracy when normalization not applied. haidar abdulrahman abbas and kayhan zrar ghafoor: indoor localization using a machine learning algorithm uhd journal of science and technology | jan 2020 | vol 4 | issue 1 101 each of the algorithms has strengthens and weaknesses, if we compare dt to other algorithms, it has less requirement for data pre-processing and not affected by missing values in data set, but any changes in data set impact the structure of it which is lead to instability. nb needs a smaller amount of training data to evaluate the test data, and implementation is easy. however, the main problem of nb is the assumption of independence. the svm algorithm has more effective when the number of dimensions is greater than the sample number, and it comparatively memory efficient. svm has a disadvantage like it is not the best option for the large data set and does not execute well with a data set that has more noises. in general, our proposed approach has many advantages; it does not need extra hardware to be installed, and high performance was achieved. our proposed concept’s disadvantages, require more computational work (especially svm) compared to others by the system. to compare our results with other works, this system has more positioning accuracy. this is a very different outcome than conventional protocol. 5. conclusion in this research article, rss fingerprint-based wi-fi localization was assessed in regards to the in-operation infrastructure of an indoor environment. we review the modern resolutions for very accurate localization in indoor schemes. next, we outline the rise in positioning error when dissimilar platform-devices are used in the fingerprinting technique for training and testing the dataset. in addition, rss measurements produce different values for the same position and time when dissimilar platfor m-devices are used. we implement the most popular and reliable machine learning algorithms, namely, dts, naïve bayes, and svm learning algorithms. examine ensemble estimators that apply multiple algorithms to fig. 3. decision tree, naïve bayes, support vector machine positioning accuracy when normalization applied. estimate the position and then we choose a combination the leads to the most efficient performance. svm error positioning shows that 1–1.5 m positioning accuracy for indoor environments can be accomplished by the presented technique which is an obvious improvement compared to existing approaches. thus, fingerprinting localizations can utilize rss data to minimize the notable amount of time and energy. references [1] j. xiao, z. zhou, y. yi and l. m. ni. “a survey on wireless indoor localization from the device perspective,” acm computing surveys, vol. 49, no. 2, p. 2933232, 2016. [2] a. s. paul and e. a. wan. “rssi-based indoor localization and tracking using sigma-point kalman smoothers,” ieee journal of selected topics in signal processing, vol. 3, no. 5, pp. 860-873, 2009. [3] n. alikhani, s. amirinanloo, v. moghtadaiee and s. a. ghorashi. “fast fingerprinting based indoor localization by wi-fi signals,” 2017 7th international conference on computer and knowledge engineering, vol. 2017 janua, pp. 241-246, 2017. [4] s. dai, l. he and x. zhang. “autonomous wifi fingerprinting for indoor localization”. in: 2020 acm/ieee 11th international conference on cyber-physical systems (iccps), pp. 141-150, 2020. [5] f. li, m. liu, y. zhang and w. shen.“a two-level wifi fingerprintbased indoor localization method for dangerous area monitoring,” sensors (basel), vol. 19, no. 19, p. 4243, 2019. [6] j. w. jang and s. n. hong. “indoor localization with wifi fingerprinting using convolutional neural network,” international conference on ubiquitous and future networks, vol. 2018, pp. 753-758, 2018. [7] m. alfakih and m. keche. “an enhanced indoor positioning method based on wi-fi rss fingerprinting,” the journal of communications software and systems, vol. 15, no. 1, pp. 18-25, 2019. [8] c. chen, y. chen, y. han, h. q. lai and k. j. r. liu, “achieving centimeter-accuracy indoor localization on wifi platforms: a frequency hopping approach,” ieee internet of things journal, vol. 4, no. 1, pp. 111-121, 2017. [9] a. sahar and d. han. “an lstm-based indoor positioning method using wi-fi signals,” acm’s international conference proceedings, 2018. [10] m. brunato and r. battiti. “statistical learning theory for location fingerprinting in wireless lans,” computer networks, vol. 47, no. 6, pp. 825-845, 2005. [11] y. li. “predicting materials properties and behavior using classification and regression trees,” materials science and engineering a, vol. 433, no. 1-2, pp. 261-268, 2006. [12] n. gutierrez, c. belmonte, j. hanvey, r. espejo and z. dong. “indoor localization for mobile devices,” proceeding. 11th ieee international conference on sensing control, pp. 173-178, 2014. [13] z. wu, q. xu, j. li, c. fu, q. xuan and y. xiang. “passive indoor localization based on csi and naive bayes classification,” ieee transactions on systems, man, and cybernetics systems, vol. 48, no. 9, pp. 1566-1577, 2018. [14] b. schölkopf. “slides learning with kernels,” journal of the electrochemical society, vol. 129, p. 2865, 2002. haidar abdulrahman abbas and kayhan zrar ghafoor: indoor localization using a machine learning algorithm 102 uhd journal of science and technology | jan 2020 | vol 4 | issue 1 [15] t. joachims. “transductive inference for text classification using support vector machines,” proceeding 20th international conference on machine learning, 2000. [16] z. zhong, z. tang, x. li, t. yuan, y. yang, m. wei, y. zhang, r. sheng and n. grant. “xjtluindoorloc: a new fingerprinting database for indoor localization and trajectory estimation based on wi-fi rss and geomagnetic field,” proceeding 2018 6th internationl symposium computer netwwork, pp. 228-234, 2018. [17] a. h. salamah, m. tamazin, m. a. sharkas and m. khedr. “an enhanced wifi indoor localization system based on machine learning,” 2016 international conference indoor position indoor navigation, pp. 4-7, 2016. . uhd journal of science and technology | may 2018 | vol 2 | issue 2 7 1. introduction monitoring the mechanical and electrical dynamics of the heart is essential to fully characterize and understand cardiac functions [1], [2]. most of the attention has focused on the evaluation of the biophysical properties of the components of the heart through the use of conventional measurement approaches such as the ecg [3], [4]. the electrocardiogram (ecg) is usually used to obtain measurements for different cardiac parameters [5], [6]. it is usually used in a procedure that facilitates the recording of the electrical activity of the heart muscle during a specific time interval [7], [8]. in this procedure, several probes are placed in certain positions to define places in a bare chest [9], [10]. these probes generate electrical current as a result of measuring the electrical activity of each heartbeat on the surface of the chest [11], [12]. an ecg is widely used in medicine to control small electrical changes in the skin of a patient’s body that arise from the activities of the human heart [13], [14]. this simple and non-invasive measure easily indicates a variety of heart conditions [15], [16]. the medical industry builds dedicated device that helps with diagnosis [17], [18]. this device requires high-resolution oscilloscope to get the waveform on its screen [19], [20]. this approach tries to design an efficient ecg waveform classification based on the p-qrs-t wave recognition. this approach will be concentrated on recognizing the ecg waveform that related to heart functionality. 2. ecg signal an ecg signal consists of three main parts such as p-wave, qrs wave, and t-wave [21]. these waveforms are electrocardiogram waveform classification based on p-qrs-t wave recognition muzhir shaban al-ani department of information technology, college of science and technology, university of human development, sulaimani, kurdistan region, iraq a b s t r a c t electrocardiogram (ecg) is a periodic signal reflects the activity of the heart. ecg waveform is an important issue to define the heart function, so it is helpful to recognize the type of heart diseases. ecg graph generates a lot of information that is converted into an electrical signal with standard values of amplitude and duration. the main problem raised in this measurement is the mixing between normal and abnormal, in addition, sometimes, there are overlapping between the p-qrs-t waveform. this research aims to offer an efficient approach to measure all parts of p-qrs-t waveform in order to give a correct decision of heart functionality. the implemented approach including many steps as follows: preprocessing, baseline process, feature extraction, and diagnosis. the obtained result indicated an adequate recognition rate to verify the heart functionality. the proposed approach depends mainly on the classifier process that based mainly on the extracted ecg waveform features that achieved from exact baseline detection. index terms: electrocardiogram, electrocardiogram signal, feature extraction, qrs wave corresponding author’s e-mail: muzhir shaban al-ani, department of information technology, college of science and technology, university of human development, sulaimani, kurdistan region, iraq. e-mail: muzhir.al-ani@uhd.edu.iq received: 12-05-2018 accepted: 11-06-2018 published: 25-07-2018 access this article online doi: 10.21928/uhdjst.v2n2y2018.pp7-14 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2018 al-ani. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l r e s e a r c h a r t i c l e uhd journal of science and technology muzhir shaban al-ani: ecg waveform classification based on p-qrs-t wave recognition 8 uhd journal of science and technology | may 2018 | vol 2 | issue 2 corresponding to the electrical activities of various parts of the human heart [22]. during the analysis of the ecg signal, the data includes the positions or magnitudes of the qrs, pr, qt and st intervals, pr and st segments, and so on [23]. ecg waveform can be divided into the following parts (fig. 1) [24]: • the p-wave is a small deflection wave that represents atrial depolarization [25]. • the pr interval is the time between the first deflection of the p-wave and the first deflection of the qrs complex [26]. • the three waves of the qrs complex represent ventricular depolarization [27]. • the small q-waves correspond to depolarization of the interventricular septum. • the r-wave reflects depolarization of the main mass of the ventricles. • the s-wave signifies the final depolarization of the ventricles • the st segment is the time between the end of the qrs complex and the start of the t-wave [28]. it reflects the period of zero potential between ventricular depolarization and repolarization [29]. • the t-waves represent ventricular repolarization [30]. the normal ecg waveform contains many standard signals p, r, q, and t waves [31], [32]. the amplitudes of these signals are shown in table 1 [33]. in addition, these waves have duration time as shown in table 2 [34]. 3. literature review many literatures are published related to the subject recognition of ecg waveform. below some of these updated works: almahamdy et al. described the ecg as an important tool for measuring health and disease detection. due to many sources of noise, this signal must be eliminated and presented in a clear waveform. the noise sources can be the interference of the electric line, the external electromagnetic fields, the random corporal movements, or the breathing. this research implemented five common noise elimination methods and applied to real ecg signals contaminated by different noise levels. these algorithms are as follows: discrete wavelet transformation (universal and local threshold), adaptive filters (lms and rls), and savitzky–golay filtering. its noise elimination performance was implemented, compared, and analyzed using matlab environment [35]. yapici et al. (2015) proposed electrode by immersing a nylon fabric in a reduced graphene oxide solution, followed by a subsequent heat treatment to allow the conformal coating of conductive graphene layers around the fabric. the application of the electrode has been demonstrated by successful measurements of the ecg. the performance of the textile-based electrodes was compared with conventional silver/silver chloride (ag/agcl) electrodes in terms of cutaneous electrode impedance, ecg signal quality, and noise levels. an excellent compliance and a 97% cross-correlation were obtained between the measured signals with the new graphene-coated textile electrodes and the conventional electrodes [36]. table 1 amplitude of waves of normal ecg wave type amplitude p-wave 0.25 mv r-wave 1.6 mv q-wave 25% of r wave t-wave 0.1–0.5 mv ecg: electrocardiogram table 2 timing values of waves of ecg wave interval amplitude pr wave 0.12–0.2 s qt wave 0.35–0.44 s st wave 0.05–0.15 s p-wave 0.11 s ecg: electrocardiogram fig. 1. electrocardiogram wave. muzhir shaban al-ani: ecg waveform classification based on p-qrs-t wave recognition uhd journal of science and technology | may 2018 | vol 2 | issue 2 9 wang et al. designed an ecg noise elimination method based on adaptive fourier decomposition (afd). the afd decomposes a signal according to its energy distribution, which makes this algorithm capable of separating the pure ecg signal and noise with overlapping frequency ranges but different energy distributions. a stopping criterion for the iterative decomposition process in the afd is calculated based on the estimated signal-to-noise ratio of the noisy signal. the proposed afd method is validated with the synthetic ecg signal using an ecg model, as well as the actual ecg signals from the mit-bih arrhythmia database with additive white gaussian noise. results of the simulation of the proposed method showed a better performance in qrs detection and noising compared to the main ecg schemes of noise elimination based on the wavelet transform, transform the empirical mode of stockwell decomposition, and empirical decomposition mode [37]. zou et al. (2017) performed a method to detect the entire qrs complex and eliminates noise between two qrs complexes while recovering the p and t waves. as verified in the simulated noise ecg signal tests, the qrsmr outputs with severely contaminated ecg signals have an increase in the correlation with its original cleaning signals from 40% to almost 80%, demonstrating the improved qrsmr noise elimination capability. in addition, in the tests of the real ecg signals measured in volunteers with a flexible ecg control device developed at fudan university, qrsmr is able to recover p and t waves from the contaminated signal, which shows its improved performance in the reduction of artifacts comparing with the adaptive filtering method and other methods based only on empirical decomposition [38]. yu et al. introduced a method called peak-to-peak entropy, the entropy of the r-r interval, correlation coefficient, and heart rate (prch) for automatic identification. this method defines four types of characteristics, which include the amplitude, the instantaneous heart rate (hr), the morphology, and the average hr, to characterize a signal and determine certain decision parameters through automatic learning. experiments and comparisons were given with the other three existing methods. taking the f1 metric for the evaluation, it showed that the proposed prch method has the highest accuracy of identification and generalization capacity [39]. most of the ecg waveform recognition methods are concentrated statistical measures. this approach will be concentrated on recognizing the ecg waveform through a correct measure of the baseline detection that can be considered as cross detection approach compared with the amplitude of ecg waveform. 4. implemented approach the methodology of this research can be concentrated on the design and implementation of the procedure steps to achieve the overall approach. the first step is preparing ecg data with different types to be ready for the system implementation. the implemented approach for ecg signal classification passed into four main steps as follows (fig. 2): preprocessing, baseline process, feature extraction, and diagnosis. 4.1. preprocessing this is an important step that covers the following: selecting the ecg region of interest, converting ecg image into grayscale, converting ecg image into binary image, noise reduction, and ecg thinning. • ecg region of interest, in which deter mine the rectangular with both vertical and horizontal directions to restrict ecg waveform. • converting ecg image into grayscale, in which converting the color ecg image into grayscale image in order to avoid the interference of colors. • converting ecg image into binary image, in which converting ecg image into black and white (ecg waveform is black, and the background is white) depending on a certain threshold. • noise reduction, in which remove the unwanted noise as possible in order to achieve the real ecg waveform. this operation is implemented through a simple median filter. • thinning process, in which is implemented in order to eliminate the redundant unwanted data. this process fig. 2. implemented approach for electrocardiogram. muzhir shaban al-ani: ecg waveform classification based on p-qrs-t wave recognition 10 uhd journal of science and technology | may 2018 | vol 2 | issue 2 is performed through convolution process of ecg waveform image with the mask. 4.2. baseline process the baseline voltage of the ecg waveform is known as the isoelectric line. this process can be implemented through two steps as follows: • baseline detection, in which the baseline is detected depending on the horizontal line that contains more than the number of black points in ecg image. then, draw the baseline in another color in order to distinct it from ecg waveform. • baseline adjustment, in which baseline is modified and connect the waves. the baseline adjustment divide waveform image into blocks and compare each block with the baseline to make a decision for shift up or shift down or keep it on the baseline. 4.3. feature extraction this process deals with the extraction of features from ecg waveform. this step aims to find the smallest set of features that enable acceptable diagnosis rate to be achieved. calculating the width and height of each rectangular established on the baseline in order to match these values to ecg waveform that indicated the values of pqrst. calculate ecg regular or irregular rhythm that indicated the hr by the help of human expert.the regular rhythms can be quickly determined by counting the number of large graph boxes between two r-waves that number is divided into 300 to calculate beats per minute. 4.4. diagnosis this process depends directly on the human experts (doctors) whom have the knowledge in order to help the user to take a decision. the implemented diagnosis process depends on the expert knowledge collected from doctors to identify the disease according to the obtained ecg data. the implemented diagnosis process is performed through applying four steps as follows: ecg waveform as an input, factors observed from doctors expertise about ecg waveform characteristics, designed model comparing the received data, and the ecg database that contains the ecg waveform characteristics. 5. implementation and discussion the implemented approach for ecg waveform classification passed into four main steps as follows: preprocessing, baseline process, feature extraction, and diagnosis, in addition, each step divided into other substeps. this section will demonstrate the shape and effect of each step on the ecg waveform. the implementation of this approach is done by programming matlab package version 2016. preprocessing step starts with ecg region of interest, converting ecg image into grayscale, and converting ecg image into binary image; these are covered in fig. 3. in general, ecg waveform has different type of noise, which may affect the shape of the waveform. median filter is applied in this case to eliminate the noise as possible this waveform is illustrated in fig. 4. ecg waveform may have some thickness according to the output device. the thinning process is performed through skeleton operation in which eliminates the redundant of data as shown in fig. 5. the baseline voltage of the ecg is the continuous part of the t-wave tracing and preceding to the next p-wave. this baseline level detection is required because ecg amplitude at different locations in the beat is measured relative to this level. the output of this process is shown in fig. 6. modify the baseline and connecting wave is performs by waving. this process divided the image into blocks and each block compares it with the baseline in order to generate the blocks as shown in fig. 7. detecting the type of wave begins by calculating the maximum peak height of the waveform. the maximum top amplitude value referred to r-wave. any detecting of r-wave leading to calculate the q-wave that placed before r-wave. this operation is illustrated in fig. 8. after detecting of waveform, starts the drawing stage to create rectangle around each waveform in the image. this waveform is illustrated in fig. 9. the decision step start after ending all above steps, in this step, compares the obtained waveform with the stored ecg data. fig. 10 indicated that the tested ecg waveform having sinus normal rhythm. three types of heart diseases are tested using this approach in order to evaluate the matching rate. fig. 11 shown ecg signal having sinus tachycardia heart that was recognized correctly. muzhir shaban al-ani: ecg waveform classification based on p-qrs-t wave recognition uhd journal of science and technology | may 2018 | vol 2 | issue 2 11 fig. 3. electrocardiogram gray scale binary image. fig. 4. electrocardiogram noise removal. fig. 5. electrocardiogram thinning. fig. 6. electrocardiogram baseline detection fig. 7. electrocardiogram baseline adjustment. fig. 8. electrocardiogram crossing detecting another test implemented for ecg waveform that having sinus bradycardia heart as shown in fig. 12, and it was matched correctly. last test implemented for ecg waveform that having sinus arrhythmia heart as shown in fig. 13, and it was matched correctly. muzhir shaban al-ani: ecg waveform classification based on p-qrs-t wave recognition 12 uhd journal of science and technology | may 2018 | vol 2 | issue 2 6. conclusions ecg waveform gives very important information about heart diseases patients. combing the ecg experts for both doctors and users expert in order to generate an efficient approach used for ecg waveform analysis and diagnosis. this research provides support for medical diagnosis based on the ecg information retrieved from the patients. in order to recognize the ecg waveform, it passed through many steps such as follows; preprocessing, baseline process, feature extraction, and diagnosis. the obtained result indicated a good recognition for detection all parts of p-qrs-t waveform. all the tested ecg waveform such as; sinus normal rhythm, sinus tachycardia, sinus bradycardia, and sinus arrhythmia are high accuracy matched. this approach can be applied in clinical center to help the ecg reader to take a correct decision. the main finding in this research is through applying a simple method of baseline detection in which a comparison of ecg waveform amplitude values can be achieved and measure correctly. references [1] m.s. al-ani and a.s. abdulbaqi. “the role of m-healthcare and its impact on healthcare environment”. international journal of business and ict, vol. 2, no. 3-4, dec. 2016. fig. 9. electrocardiogram drawing blocks. fig. 10. normal electrocardiogram diagnosis. fig. 11. electrocardiogram sinus tachycardia fig. 12. electrocardiogram sinus bradycardia. fig. 13. electrocardiogram sinus arrhythmia muzhir shaban al-ani: ecg waveform classification based on p-qrs-t wave recognition uhd journal of science and technology | may 2018 | vol 2 | issue 2 13 [2] m.s. al-ani. “efficient architecture for digital image processing based on epld”. iosr journal of electrical and electronics engineering (iosr-jeee), vol. 12, no. 6, pp. 1-7, 2017. [3] s.k. berkaya, a.k. uysal, e.s. gunal, s. ergin and m.b. gulmezoglu. a survey on ecg analysis. biomedical signal processing and control, vol. 43, pp. 216-235, may. 2018. [4] h. sharma and k.k. sharma. ecg-derived respiration using hermite expansion. biomedical signal processing and control, vol. 39, pp. 312-326, jan. 2018. [5] r.l. lux. basis and ecg measurement of global ventricular repolarization. journal of electrocardiology, vol. 50, no. 6, pp. 792797, dec. 2017. [6] l. mesin. heartbeat monitoring from adaptively down-sampled electrocardiogram. computers in biology and medicine, vol. 84, 217-225, may. 2017. [7] d. poulikakos and m. malik. challenges of ecg monitoring and ecg interpretation in dialysis units. journal of electrocardiology, vol. 49, no. 6, pp. 855-859, dec. 2016. [8] g. zhang, t. wu, z. wan, z. song and f. chen. a new method to detect ventricular fibrillation from cpr artifact-corrupted ecg based on the ecg alone. biomedical signal processing and control, vol. 29, pp. 67-75, aug. 2016. [9] p.w. macfarlane, s.m. lloyd, d. singh, s. hamde and v. kumar. normal limits of the electrocardiogram in indians. journal of electrocardiology, vol. 48, no. 4, pp. 652-668, aug. 2015. [10] f. gargiulo, a. fratini, m. sansone and c. sansone. subject identification via ecg fiducial-based systems: influence of the type of qt interval correction. computer methods and programs in biomedicine, vol. 121, no. 3, pp. 127-136, oct. 2015. [11] m. r. homaeinezhad, m.e. moshiri-nejad and h. naseri. a correlation analysis-based detection and delineation of ecg characteristic events using template waveforms extracted by ensemble averaging of clustered heart cycles. computers in biology and medicine, vol. 44, pp. 66-75, jan. 2014. [12] r.j. martis, u.r. acharya and h. adeli. current methods in electrocardiogram characterization. computers in biology and medicine, vol. 48, pp. 133-149, may. 2014. [13] m.s. al-ani and a.a. rawi. “ecg beat diagnosis approach for ecg printout based on expert system”. international journal of emerging technology and advanced engineering, vol. 3, no. 4, apr. 2013. [14] k.n.v.p.s. rajesh and r. dhuli. classification of imbalanced ecg beats using re-sampling techniques and ada boost ensemble classifier. biomedical signal processing and control, vol. 41, pp. 242-254, mar. 2018. [15] m.s. al-ani and k.m.a. alheeti. “precision statistical analysis of images based on brightness distribution”. advances in science, technology and engineering systems journal, vol. 2, no. 4, 99104, 2017. [16] a.k. dohare, v. kumar and r. kumar. detection of myocardial infarction in 12 lead ecg using support vector machine. applied soft computing, vol. 64, pp. 138-147, mar. 2018. [17] x. dong, c. wang and w. si. ecg beat classification via deterministic learning. neurocomputing, vol. 240, pp. 1-12, may. 2017. [18] p. xiong, h. wang, m. liu, s. zhou and x. liu. ecg signal enhancement based on improved denoising auto-encoder. engineering applications of artificial intelligence, vol. 52, pp. 194202, jun. 2016. [19] c.g. raj, v.s. harsha, b.s. gowthami and r. sunitha. virtual instrumentation based fetal ecg extraction. procedia computer science, vol. 70, pp. 289-295, 2015. [20] a. ebrahimzadeh, b. shakiba and a. khazaee. detection of electrocardiogram signals using an efficient method. applied soft computing, vol. 22, pp. 108-117, sep. 2014. [21] m.s. al-ani and a.a. rawi. “a rule-based expert system for automated ecg diagnosis”. international journal of advances in engineering and technology (ijaet), vol. 6, no. 4, pp. 1480-1493, 2013. [22] m.s. al-ani. study the characteristics of finite impulse response filter based on modified kaiser window. uhd journal of science and technology, vol. 1, no. 2, pp. 1-6, aug. 2017. [23] k.k. patro and p.r. kumar. effective feature extraction of ecg for biometric application. procedia computer science, vol. 115, pp. 296-306, 2017. [24] r.e. gregg, s.h. zhou and a.m. dubin. automated detection of ventriculr pre-excitation in pediatric 12-lead ecg. journal of electrocardiology, vol. 49, no. 1, pp. 37-41, jan. 2016. [25] s. yazdani and j.m. vesin. extraction of qrs fiducial points from the ecg using adaptive mathematical morphology. digital signal processing, vol. 56, 100-109, sep. 2016. [26] a. r. verma and y. singh. adaptive tunable notch filter for ecg signal enhancement. procedia computer science, vol. 57, pp. 332337, 2015. [27] r. rodríguez, a. mexicano, j. bila, s. cervantes and r. ponce. feature extraction of electrocardiogram signals by applying adaptive threshold and principal component analysis. journal of applied research and technology, vol. 13, no. 2, pp. 261-269, apr. 2015. [28] r. salas-boni, y. bai, p.r.e. harris, b.j. drew and x. hu. false ventricular tachycardia alarm suppression in the icu based on the discrete wavelet transform in the ecg signal. journal of electrocardiology, vol. 47, no. 6, pp. 775-780, dec. 2014. [29] a. awal, s.s. mostafa, m. ahmad and m.a. rashid. an adaptive level dependent wavelet thresholding for ecg denoising. biocybernetics and biomedical engineering, vol. 34, no. 4, pp. 238249. 2014. [30] s.s. al-zaiti, j.a. fallavollita, y.w.b. wu, m.r. tomita and m.g. carey. electrocardiogram-based predictors of clinical outcomes: a meta-analysis of the prognostic value of ventricular repolarization. heart and lung: the journal of acute and critical care, vol. 43, no. 6, pp. 516-526, dec. 2014. [31] a. cipriani, g. d’amico, g. brunello, m.p. marra and a. zorzi. the electrocardiographic “triangular qrs-st-t waveform” pattern in patients with st-segment elevation myocardial infarction: incidence, pathophysiology and clinical implications. journal of electrocardiology, vol. 51, no. 1, pp. 8-14, jan. 2018. [32] s. kota, c. b. swisher, t. al-shargabi, n. andescavage and r. b. govindan identification of qrs complex in non-stationary electrocardiogram of sick infants. computers in biology and medicine, vol. 87, pp. 211-216, aug. 2017. [33] p.r.e. harris. the normal electrocardiogram: resting 12-lead and electrocardiogram monitoring in the hospital. critical care nursing clinics of north america, vol. 28, no. 3, pp. 281-296, sep. 2016. [34] d. saini, a.f. grober, d. hadley and v. froelicher. normal computerized q wave measurements in healthy young athletes. journal of electrocardiology, vol. 50, no. 3, pp. 316-322, may-jun. 2017. [35] m. almahamdy and h.b. riley. performance study of different muzhir shaban al-ani: ecg waveform classification based on p-qrs-t wave recognition 14 uhd journal of science and technology | may 2018 | vol 2 | issue 2 denoising methods for ecg signals. procedia computer science, vol. 37, pp. 325-332, 2014. [36] m.k. yapici, t. alkhidir, y.a. samad and k. liao. graphene-clad textile electrodes for electrocardiogram monitoring. sensors and actuators b: chemical, vol. 221, pp. 1469-1474, dec. 2015. [37] z. wang, f. wan, c.m. wong and l. zhang. adaptive fourier decomposition based ecg denoising. computers in biology and medicine, vol. 77, pp. 195-205, oct. 2016. [38] c. zou, y. qin, c. sun, w. li and w. chen. motion artifact removal based on periodical property for ecg monitoring with wearable systems. pervasive and mobile computing, vol. 40, pp. 267-278, sep. 2017. [39] q. yu, h. yan, l. song, w. guo and y. zhao. automatic identifying of maternal ecg source when applying ica in fetal ecg extraction. biocybernetics and biomedical engineering, vol. 38, no. 3, pp. 448455, 2018. tx_1~abs:at/tx_2:abs~at 66 uhd journal of science and technology | jan 2021 | vol 5 | issue 1 1. introduction the coronavirus disease (covid-19) is the family of viruses including sars, ards. w.h.o declared this outbreak as a public health emergency [1] and mentioned the following; the virus is being transmitted through the respiratory tract when a healthy person comes in contact with the infected person. in december 2019, wuhan, hubei region, china, has been accounted for as the focal point of the covid-19 episode [2]. a quarter of a year later, that outbreak was pronounced as a worldwide pandemic by the world health organization (who) [3]. more than 54.40 million confirmed covid-19 cases and more than 1.32 deaths worldwide have been officially reported in 16 november, 2020. therefore, it has been considered as the most critical universal crisis since the world war-ii [4]. the coronavirus has spread in kurdistan – iraq like all the country in the world, and it has expanded fast in sulaymaniyah city. the mortality of this disease expands day by day and this infection becomes as a major danger to the mankind of whole world. alongside the clinical explores, the examination of related information will support the humanity. recent studies identified that machine learning (ml) and artificial intelligence (ai) are promising technology employed by various health-care providers as they result in better scale-up, speed-up processing power, reliable, and even outperform human in specific health-care tasks [5]. in this paper, we established three ml algorithm for the prediction of coronaviruses’ diseased patients’ mortality. the models forecast when covid-19 infected patients would be death or recovered. the proposed algorithms are designed with the dataset found from sulaymaniyah city for coronavirus and dataset cases of the death and recovery records of the infected coronavirus’s pandemic. ml algorithm which includes decision tree (dt), support vector prediction of covid-19 mortality in iraqkurdistan by using machine learning brzu t. muhammed1, ardalan h. awlla2, sherko h. murad3, sabah n. ahmad4 1department of computer science, kurdistan technical institute, sulaymaniyah, iraq, 2department information technology, university of human development, sulaymaniyah 0778-6, iraq, 3department of computer science, kurdistan technical institute, sulaymaniyah, iraq, 4general manager of health in sulaymaniyah, iraq a b s t r a c t this research analyzed different aspects of coronavirus disease (covid-19) for patients who have coronavirus, for find out which aspects have an effect to patient death. first, a literature has been made with the previous research that has been done on the analysis dataset of coronavirus using machine learning (ml) algorithm. second, data analytics is applied on a dataset of sulaymaniyah, iraq, to find factors that affect the mortality rate of coronavirus patients. third, classification algorithms are used on a dataset of 1365 samples provided by hospitals in sulaymaniyah, iraq to diagnose covid-19. using ml algorithm provided us to find mortality rate of this disease, and detect which factor has major effect to patient death. it is shown here that support vector machine (svm), decision tree (dt), and naive bayes algorithms can classify covid-19 patients, and dt is best one among them at an accuracy (96.7 %). index terms: coronavirus disease, coronavirus, forecasting, machine learning, kurdistan-iraq corresponding author’s e-mail: brzu t. muhammed, department of computer science, kurdistan technical institute, sulaymaniyah, iraq. e-mail: brzu.tahir@kti.edu.krd received: 22-11-2020 accepted: 19-05-2021 published: 23-05-2021 o r i g i n a l r e s e a r c h a r t i c l e uhd journal of science and technology access this article online doi: 10.21928/uhdjst.v5n1y2021.pp66-70 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2021 muhammed, et al. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) uhd journal of science and technology | jan 2021 | vol 5 | issue 1 67 machine (svm), and naive bayes (nb) was implemented directly on the dataset using weka tool which is a data mining tool. 2. literature review development of ai changed the world in all fields. ml a subset of ai causes the human to discover answers for exceptionally complex issues and furthermore assumes an imperative part in making human life refined. the application zones of ml incorporate business applications, clever robots, medical services, atmosphere demonstrating, picture handling, natural language preparing, and gaming [6]. according to al sadig et al. [7], depend on the dataset as given by the various site developed by digital science in cooperation with over 100 leading research organizations all over the world. create a model using j48 algorithm to predict the most common symptoms causing death is acute kidney injury and coronary heart disease. arun and iyer [8] propose some of the ml techniques such as rough set (svm), bayesian ridge and polynomial regression, sir model, and rnn to examination of the transmission of covid 19 disease and predict the scale of the pandemic, the recovery rate as well as the fatality rate. according to bullock et al. [9], ml and deep learning can replace humans by giving an accurate diagnosis. the perfect diagnosis can save radiologists’ time and can be cost-effective than standard tests for covid-19. x-rays and computed tomography scans can be used for training the ml model. wang and wong [10] created covid-net, which is a profound convolutional neural network, which can analyze covid-19 from chest radiography pictures. alibaba cloud 2020 [11] exploit ml to set up an adjusted susceptible exposed infectious recovered model to anticipate the commonness of covid-19 and evaluate the expanded danger of defilement in a particular territory. kemenkes [12] finding diabetes utilizing ai and ml methods result demonstrated that ensemble technique guaranteed exactness of 98.60%. these reasons can be advantageous to analyze and foresee covid-19. according to muhammad et al. [13] use several ml algorithms which includes dt, svm, nb, lr, rf, and k-nn are applied directly on the dataset which include covid-19 infected patients’ recovery, the model invented with dt algorithm was discovered to be the most precise with 99.85% exactness which has all the earmarks of being the most noteworthy among others. 3. data preparation collection of data is the vital step to induce data over corona virus. the information was collected from the distinction health care center in sulaymaniyah city in the kurdistan region of iraq. the dataset comprises 1376 patients which have appeared side effects of crown infection. the data collection comprises seven factors (gender, age, status, o 2 , ventilate, day of hospitalization, and death patient). the informational index contained data about hospitalized patients with covid-19. after informational index start another stage data preprocessing. information preparing is a significant cycle being developed of ml model. the information gathered is frequently approximately controlled with out-of-range esteems, missing values; and so, on such information can deceive the consequence of the examination. weka, one of the expansively utilized data mining computer program, is utilized for the classification. the processing of data preparation illustrated in figure 1. fig. 1. the workflow diagram for coronavirus disease. 68 uhd journal of science and technology | jan 2021 | vol 5 | issue 1 after preprocessing stage of the dataset, the collected variables are divided into two classes “death” assigned as “yes” and “no death” assigned as “no.” the selected data samples are transferred to a spreadsheet file for further processing to be suitable for data mining approaches. the dataset were normalized to minimize the effect of scaling on the data and saved as a commas separated value file format. in table 1, all the attributes explained with their description. 4. methodology recently, ml techniques have been used to medical prediction; there are different types of ml algorithms that can be applied to different types of applications in various fields [14]. many different types of research have demonstrated that algorithms of ml had given better help to clinical backings moreover for decision-making on the basis of the patient information. in the medical services field, illness predictive examination is one of the valuable and strong uses of ml forecast algorithms. in this paper proposed a machine-learning algorithm to analyze unusual covide-19 disease datasets. in this paper, rate of death across the region analyzed based on the factors explained in table 1. 4.1. svm svm is one supervised classification algorithm which is commonly utilized for linear classification and regression problem. it means svm can solve both linear and nonlinear problems. svm provides unique and optimal solution, the kernel function is selected based on the points of the variables in the hyperplane. the best separating hyper plane can be written as, w.x + b = 0, where w is a weight vector, the value of the attributes is referred as x, and b is scalar often referred as bias [15]. 4.2. dt dt is a supervised learning algorithm that can be utilized for both classification and regression issues, however generally it is ideal for attempting classification issues. it is a dt classifier; the structure of this algorithm is divided by three parts: internal node which is features of dataset, branches are demonstrating rules, and leaf is represent outcome for each leaf. 4.3. naïve bias nb classifier is the simple and powerful supervised machinelearning algorithm used for predictive modeling. it considers all variables contribute in the direction of arrangement and they are equally connected [16]. the algorithm is based on a theorem called bayesian theorem and used when the coordination of the inputs is high, which assumes that features are statistically independent. 5. experiment results for the experiment weka tool have been used, the dataset was collected used to train the above algorithms using the weka tool. in this paper, the dataset is divided for two parts, for the classification algorithms first part which is 80% used for training the classification algorithms and the second part which is 20% used as a test set and the results are illustrated in table 2. the achievement of every algorithm was assessed at phases of the training set. every algorithm was trained with the record sets having 1100 records. this examination is carried out to achieve which algorithm can be the most appropriate for the prediction of covid-19. the accuracy of the forecast algorithm in almost all of the research work has exploited like one of the regular measurability while working on the forecast algorithm. in table 1: attribute’s description used for predication does the patient recovered or died variables description possible values gender it is a social definition of men and women. male, female age patient age date status situation of the patients’ status. bad, severe, good, critical o2 it indicates does the patient need oxygen or not. yes, no ventilate a ventilator uses pressure to blow air or air with extra oxygen yes, no date of hospital admission day of hospitalization date death does the patient died or recovered? yes, no table 2: accuracy classification algorithms s. no. classification algorithms accuracy 1 decision tree 96.07 2 support vector machines 95.27 3 naive bayes 94.47 uhd journal of science and technology | jan 2021 | vol 5 | issue 1 69 table 3: error metrics for the classification algorithms s. no. algorithm kappa statistics mean absolute root mean square 1 decision tree 0.29 0.07 0.19 2 support vector machine s 0.42 0.10 0.21 3 naive bayes 0.32 0.12 0.22 fig. 3. a decision tree generated by the c4.5 algorithm for predicting covid-19. this paper, the accuracy forecast is whether the patient is recovered or deceased while the patient infected by the covid-19. base on the above algorithms mentioned in table 2 and in figure 2. each classification algorithm has an alternate expectation precision dependent on its hyper parameters. table 3. describe the performance error measurement for each algorithm; the error metrics which are kappa statistics, mean absolute error, and root mean square error for each algorithm is assessed. as shown in table 3. the decision tree has lowest error rates compared to other algorithms. according to figure 3, which is the visualization tree for the dt algorithm and figure 4, the main factor which is the ventilator has the maximum effect on patients, which made it the beginning tree and this cause has the most effect on patients to recover or not. if the patient is not recovered depend on the most second-factor attribute which is status and the rest of the other attributes showed in the dt has a type of impact on the patient is recovery or died. however, the main factor attributes are ventilator, status as shown in figure 3. this means that if a patient attribute ventilator is yes and the status are bad the patient died otherwise status factors severe and critical mostly recovered. in addition, the interesting status attribute is good which is depending on the patient’s date of hospital admission, as shown in figure 3. it means some patients are in a good status, but they are dead. because epidemic covid-19 has more influence in cold weather as shown in figure 5, it means weather conditions can increase cause death because of covid-19 in the winter season. fig. 2. accuracy classification algorithms. 70 uhd journal of science and technology | jan 2021 | vol 5 | issue 1 6. conclusion the covid-19 pandemic lay-down medical care systems in the entire world into a difficult situation. computer algorithms and ml can help humanity to finding best solution to overcome the coronavirus epidemic. in this paper, data mining technique was used for the predication of coronaviruses infected patient’s using dataset of coronaviruses patients of iraqi-kurdistan region. dt, support vector machine and nb were used directly on the dataset using weka ml tool. to identify the accuracy suggested algorithms, the accuracy of the algorithms has been calculated based on the dataset features that have been used. the experiment result showed that the dt has the highest percentage of accuracy which is 96.7% followed by support vector machine which is 95.27 accuracy and naïve bayes which is 94.47% accuracy. the experiment result showed that the most effective reason for the patient to recover or not is ventilator and other factors have their effect on the patients to recover or not. in addition, the weather condition means with the coming of the cold weather the virus’s effects will increase. references [1] medscape medical news. the who declares public health emergency for novel coronavirus, 2020. available from: https:// www.medscape.com/viewarticle/924596. [2] m. c. collivignarelli, c. collivignarelli, m. carnevale miino, a. abbà, r. pedrazzani and g. bertanza. “sars-cov-2 in sewer systems and connected facilities”. process safety and environmental protection, vol. 143, pp. 196-203, 2020. [3] p. shi, y. dong, h. yan, c. zhao, x. li, w. liu, m. he, s. tang and s. xi. “impact of temperature on the dynamics of the covid-19 outbreak in china”. science of the total environment, vol. 728, p. 138890, 2020. [4] s. boccaletti, w. ditto, g. mindlin and a. atangana, a. “modeling and forecasting of epidemic spreading: the case of covid-19 and beyond”. chaos solitons fractals, vol. 135, p. 109794, 2020. [5] t. davenport and r. kalakota. “the potential for artificial intelligence in healthcare”. future healthcare journal, vol. 6, no. 2, pp. 94-98, 2019. [6] p. theerthagiri, i. j. jacob, a. u. ruby and y. vamsidhar. an investigation of machine learning algorithms on covid-19 dataset, 2020. [7] m. al sadig and k. n. abdul sattar. “developing a prediction model using j48 algorithm to predict symptoms of covid-19 causing death”. international journal of computer science and network security, vol. 20, no. 8, p. 80, 2020. [8] s. s. arun and g. n. iyer. “on the analysis of covid19-novel corona viral disease pandemic spread data using machine learning techniques. 4th international conference on intelligent computing and control systems, pp. 1222-1227, 2020. [9] j. bullock, a. luccioni, k. h. pham, c. s. n. lam and m. luengooroz. mapping the landscape of artificial intelligence applications against covid-19”. journal of artificial intelligence research, vol. 69, pp. 807-845, 2020. [10] l. wang and a. wong. “covid-net: a tailored deep convolutional neural network design for detection of covid-19 cases from chest radiography images”. scientific reports, vol. 10, p. 19549, 2020. [11] alibaba cloud e-magazine. “alibaba cloud helps fight covid-19 through technology”. alibaba cloud, 2020. [12] kementerian kesehatan ri. pedoman pencegahan dan pengendalian coronavirus disease (covid-19). in: l. aziza, a. aqmarina and m. ihsan (eds.), revisi ke4. kementerian kesehatan ri, direktorat jenderal pencegahan dan pengendalian penyakit (p2p), 2020. available from: https://www.infeksiemerging.kemkes. go.id. [13] l. j. muhammad, m. m. islam, u. s. sharif and s. i. ayon. “predictive data mining models for novel coronavirus (covid-19) infected patients recovery”. sn computer science, vol. 1, no. 4, p. 206, 2020. [14] sirwan. m. aziz and ardalan. h. awlla. “performance to build effective student using data mining techniques”. uhd journal of science and technology, vol. 3, no. 2, p. 10, 2019. [15] r. sukanya and k. prabha. “comparative analysis for prediction of rainfall using data mining techniques with artificial neural network”. international journal of computational science and engineering, vol. 5, pp. 1-5, 2017. [16] s. d. jadhav and h. channe. “comparative study of k-nn, naive bayes and decision tree classification techniques”. international journal of science and research, vol. 5, pp. 1842-1845, 2016. fig. 4.: factors with a significant effect on patient mortality. fig. 5. the role of weather condition on transmission rates of the coronavirus. tx_1~abs:at/tx_2:abs~at uhd journal of science and technology | jan 2020 | vol 4 | issue 1 81 1. introduction the term pollution was defined as exposing to the harmful pollutants or products in the environment that appeared to have a measurable effect on the man or other animal health as well as on vegetation or other materials [1], [ 2]. there are five types of pollutants that are hydrocarbon, carbon monoxide, particulate matter, nitrogen dioxide, and sulfur oxides. these tend to be the worst quality content found in iraqi fuel, which are emitted from the combustion of sulfur containing fossil fuels such as coal, metal smelting, motor vehicle operations, and other industrial processes. urban air pollution is a significant cause of global mortality, pre-mature deaths, which are the causes of seven hundred thousand deaths worldwide according to data from the who [3], [4]. several reports have indicated that exposure to aromatic hydrocarbons such as benzene, toluene, and styrene-butadiene has significant alterations in different hematologic parameters. the noticeable effects include a decline in circulating erythrocytes, hemoglobin (hgb), platelets, total white blood cells (wbcs), and absolute numbers of lymphocytes, as well as neutrophils [4]-[7]. the adverse effects may be on bone marrow and stem cells at both production and differentiation levels. moreover, it may have effects on hepcidin’s sustained and chronic upregulation that is an iron regulatory protein, which may lead to hgb and red blood cell (rbc) production diminishing. consequently, anemia can occur [8]-[12]. leukopenia, thrombocytopenia, and reduction in bone marrow-derived mesenchymal stem cells also may be common side effects [13]. hematological impacts of the traffic emissions in sulaymaniyah governorate dunya hars bapir1, salih ahmed hama1,2 1department of biology, college of science, university of sulaimani, kurdistan region, sulaymaniyah, iraq, 2department of medical laboratory sciences, college of health sciences, university of human development, kurdistan region, sulaymaniyah, iraq a b s t r a c t the current study was achieved to evaluate the essential hematologic impacts of traffic emission. ninety-six cases were studied that included both exposures and controls. the focal point was on raparin district in sulaymaniyah governorates. a questioner form was depended for collecting the information about each case. fresh venous blood (5 ml) was collected aseptically from both exposures and controls. hematologic autoanalyzer (coulter-automated counter) was used for hematologic investigations. it was appeared that the mean leukocyte counts were higher among exposures in comparison to controls; the period of exposure and smoking was significantly effective on total white cells. lymphocyte counts were significantly declined among exposures. it was appeared that the distance from the emission gas sources, smoking, and period of exposure was significantly effective on the total lymphocyte counts (p < 0.05). no valuable effects of traffic emission were noticed on granulocytes in general (p > 0.05), although the neutrophil counts were significantly higher among exposure. moreover, the study revealed that there were noticeable effects of traffic emission, on the total platelet counts between exposures and controls. finally, the distance from the emission sources was significantly effective on platelet counts among exposures themselves (p < 0.05). index terms: traffic emission, hematology, lymphocytes, complete blood count corresponding author’s e-mail: dunyabiologist064@gmail.com/dunya.bapir94@gmail.com received: 07-03-2020 accepted: 23-04-2020 published: 25-04-2020 access this article online doi: 10.21928/uhdjst.v4n1y2020.pp81-86 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2020 dunya hars bapir, salih ahmed hama. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l re se a rc h a rt i c l e uhd journal of science and technology dunya hars bapir and salih ahmed hama: hematological impacts of the traffic emissions in sulaymaniyah governorate 82 uhd journal of science and technology | jan 2020 | vol 4 | issue 1 the aims of the current study were to investigate the significant traffic emission impacts of various hematological parameters and to study the effect of some risk factors and their relations to traffic emissions hematologic consequences. 2. materials and methods ninety-six persons (males, and females) were studied included (48 exposes and 48 control cases), both sexes involved. the laboratory investigations were done from may 15, 2019, to september 25, 2019. the hematological tests were performed in azadi laboratory in ranya city. five fresh venous blood were collected from all cases and directly transferred to the lab for investigations. the hematologic examinations were included in the study; leukocyte profiles (total wbc, granulocyte, neutrophil, and lymphocyte) counting. red cell profiles (rbc, red cell distribution width [rdw], hematocrit [hct], hgb, mean cell hgb mch, mean cell hgb concentration mean corpuscular hemoglobin concentration [mchc], and mean cell volume [mcv]) counting, as well as the platelet profiles (platelet [plt], mean platelet volume [mpv], platelet distribution width [pdw], plateletcrit [pct], and large platelet cell ratio [lpcr]) counting. three factors were studied for their relation with the emission impacts on the studied cases, which are exposure period (short-term – <10 years – and long-term – more than 10 years); distance from the emission sources (<500 m and more than 500 m); and finally smoking (smokers and non-smokers). an automated hematologic analyzer (coulter; kt6200, of oem) was depended in achieving the above tests. the obtained data were tabulated and statistical analyses were done using graphpad prism 6 software (mann–whitney t-test). 3. results and discussion from the current study, it was appeared that different hematologic parameters were affected negatively by exposure to the chemical compounds produced from the traffic emissions. 3.1. wbc measurement wbc studying total wbc counts revealed that the mean value of wbc counts of exposures was 7100 cells/µl ±1.5, whereas the mean value among control cases was (6780 cells/µl ±1.8), which was lower than that of exposures (fig. 1). statistical analysis showed that there were no significant differences between the total wbc counts from exposures and controls (p > 0.05). due to further analysis, there were significant differences between those with (5–10 years) exposure histor y (mean=6700 cells/µl ±1), and those with more prolonged exposure (10–20 years), (mean=7400 cells/µl ±1.9) (p < 0.05). furthermore, significant differences were observed among smoker exposures (mean=7500 cells/µl ±2.1) and non-smoker exposures (mean=6800 cell/µl ±1.2) (p < 0.05). moreover, it was reported by the current study that the total numbers of lymphocytes were also affected by exposure to traffic emissions. it was noticed that the mean value of the lymphocyte counts among exposures was lower (1100 cells/µl ±3.7), when compared with controls (1900 cells/µl ±3.9). there was a significant difference between exposures and controls regarding the mean value of lymphocyte count levels (p < 0.05) (fig. 1). the mean value of lymphocytes among exposures whose home distances about 100–200 m far from traffic contamination sources were 1052 cells/µl ±3.6, while it was slightly higher (1107 cells/µl ±3.7) among exposures whose home distance was far from the first group (500–1000 m) from the sources of traffic gases. it was appeared that the distance from the emission fig. 1. leukocyte measurements for traffic emission exposures and controls. dunya hars bapir and salih ahmed hama: hematological impacts of the traffic emissions in sulaymaniyah governorate uhd journal of science and technology | jan 2020 | vol 4 | issue 1 83 gas sources has a significant effect on the mean values of lymphocytes among exposures themselves (p < 0.05). the total lymphocyte counts among exposed smokers were 1002 cells/µl ±3.5, whereas among non-smoker exposures were higher (1161 cell/µl ±3.8), the statistical analyses indicated that there were valuable effects of smoking on the lymphocyte counts especially when integrated with traffic emission gases (p < 0.05). in addition, the effects of the duration of emission exposure on lymphocytes counts showed that the mean value for exposures with about 5–10 years of exposing history was 1137 cells/µl ±3.4. for those with more prolonged exposure history (10–20 years) lymphocytes were relatively lower (1056 cells/μl ±3.7), which indicated that the exposure duration plays a significant effect on the total lymphocyte counts (p < 0.05). the lower levels of lymphocyte count among exposures may be due to the toxic effects of the chemical contents of the traffic emissions. similar observations were recorded by other investigators who found that the mean value of lymphocyte counts was reduced as a result of exposure to chemicals raised from fuel-burning [9]. integration of the smoking effects with emission gases among exposures confirmed the impact of traffic emissions of the wbcs in general and on lymphocyte numbers, especially the mean value of lymphocytes was declined among non-smokers and significantly different from controls. the observations reported by the current study were parallel to the results mentioned by other investigators [10] who noticed a decline in the total numbers of white cells and lymphocytes among mice, which were exposed to the traffic emissions. changes in granulocyte counts also were studied. the mean value of granulocytes was 5100 cells/µl ±7.8 among exposures, and 5200 cells/µl ±9.7 among controls. no significant difference was seen between exposures and controls considering granulocytes (p > 0.05) (fig. 1). no valuable effects of smoking and exposure duration were reported (p > 0.05), which may indicate that any decline in the granulocyte numbers was not due to the smoking effects, as in the case of lymphocytes. unlike the above observations, the mean value of neutrophil counts was significantly higher among exposed cases (5100 cells/µl ±1) when compared to that of control cases (4000 cells/µl ±9.4) (fig. 1). smoking and exposure periods showed no noticeable effects among exposures themselves (p > 0.05). the current observations relatively confirm the impact of chemical products of the traffic emission, especially when the effects of smoking were adverse, as other investigators talked about the negative effects of smoking on blood parameters, including granulocyte. different investigators reported that the neutrophil count was raised among emission exposures when they compared their observations to the control groups [5]. moreover, the results of the current study were parallel to the observations recorded by other study that showed higher neutrophil counts exposures when compared to controls [6]. 3.2. rbcs measurement rbc in general, the total numbers of rbcs were almost similar between exposures (5700 × 102 cells/µl ±0.83) and controls (5600 × 102 cells/µl ±1), and no valuable variations were seen between them (p > 0.05) (fig. 2). the rbc counts for those whose home was (100–200 m) far from the sources of traffic emissions was (5400 × 102 cells/µl ±0.5), while it was higher for those whose home was more far (500–1000 m) that was (5900 × 102 cells/µl 0.9). smoker exposures showed lower rbc counts (5500 × 102 cells/µl ±0.7) when compared with non-smoker exposures (5800 × 102 cells/µl ±1.2), although it was not significant (p > 0.05). it was noticed that rbc count for exposures with (5–10 years) exposure history was higher (5600 × 102 cells/µl ±1.1), in compared fig. 2. mean values of red blood cells for traffic emission exposures and controls. dunya hars bapir and salih ahmed hama: hematological impacts of the traffic emissions in sulaymaniyah governorate 84 uhd journal of science and technology | jan 2020 | vol 4 | issue 1 to those with more prolonged exposure history (10–20 years) (5400 × 102 cells/µl ±0.81). however, the differences were not valuable (p > 0.05). among the factor that may explain the above observation may be due to the sample collection season (summer), where the traffic gases may be less effective on exposures compare to cold and dry weather. however, other scientific works reported significant effects of traffic gases on rbc counts and showed elevated rbc counts among exposures does not agree with the current observations [8]. moreover, another factor may play a role, which is the presence of relatively low levels of pm 2.5 in the iraq fuel, as previously noticed that the high pm 2.5 may be responsible for elevations in rbc counts [8], [9], [15]. the rdw for exposures was lower (11.7%, ±0.74) when compared with that of controls (12.3%, ±0.94), although it was not significant (p > 0.05) (fig. 2). it was noticed that smoking, home distance, and exposure duration have no significant effects on rdw for exposures themselves (p > 0.05). the low levels of rdw in the current study may be due to the slight reductions in rbc counts, especially the rdw can be considered as a marker for rbc counts and sizes. the results of the current study not agreed with the previous reported by other investigators who showed decline rbc counts significantly [22]. furthermore, it was reported that the hct for exposures (49%, ±0.98) was not different significantly from that of controls (48%, ±0.76) (p > 0.05) (fig. 2). statistical analysis indicated that smoking, home distance, and exposure duration have no significant effect on hct (p > 0.05). the changes in the hct among exposures may be due to the limited effects of traffic emission, as mentioned earlier, especially hct that can reflect alterations in red cell count and functions. the current observations were not agreed with the results reported by other studies that showed the elevated hct among traffic emission exposures [8], [15]. moreover, results showed that emission exposure has no significant effects on hb of exposures (14.4 g% ±4) compared to controls (14.8 g% ±2.9) (p > 0.05) (fig. 2). smoking, home distance, and exposure duration showed no valuable effects on hb (p > 0.05). the current results indicated that due to non-valuable decline in hb, the value of hct was not changed significantly (p > 0.05). the above results were not agreed with the studies that reported by other researchers in the past that observed the pollutants could lead to anemic conditions, which consequently cause a reduction in hct [6], [11], [23]. although the results of the current study were supported to the observations that reported by some investigators who studied the effects of emissions on traffic polices and in pakistan, and claimed that the traffic emission has no significant effects of hgb hb. while they reported that smoking was effective on hb, which was not agreeing with the current observation [16], [17]. mch also was among the hematologic parameters which were not significantly varied between exposures and controls (p > 0.05). furthermore, it was noticed that smoking, home distance, and exposure duration have no valuable effects on mch and mcv (p > 0.05). similarly, the observations were recorded for mchc (p > 0.05) (fig. 2), which was agreed with results obtained in a study done on mice in the past considering mchc, where the level was reduced [14]. as our statement, the lack of effects of the traffic of emission on the vast majority of rbc profiles may be due to the saturated environment with o 2 , especially the study area is rich in forest fig. 3. mean value of platelet profile for traffic emission exposures and controls. dunya hars bapir and salih ahmed hama: hematological impacts of the traffic emissions in sulaymaniyah governorate uhd journal of science and technology | jan 2020 | vol 4 | issue 1 85 and green spaces are at an excellent level. low levels of o 2 can negatively affect rbc profiles; especially there is a strong relation between rbc and oxygen transportation. 3.3. platelet measurement plt the mean values of platelet count among exposures were higher (1800 cells/µl ±6.5) than that of controls (1690 cells/µl ±3.9). when the results analyzed, it has appeared that significant differences were found among exposures and controls regarding platelet counts (p < 0.05) (fig. 3). moreover, it was concluded that home distance and exposure duration have significant effects on platelet counts, respectively (p < 0.05). smoking showed no valuable effects, which may confirm that all outcomes are due to the long-term exposure to the chemical components of traffic emission, not to the smoking contents. other researcher found similar results on experimental animals and humans [18], [19], [20], [21]. they suggested elevation in platelet counts concerning emission air pollutants. the current study revealed that there were no noticeable differences between exposures and controls regarding mpv (p > 0.05). furthermore, it has appeared that home distance, smoking, and exposure duration have no significant effects of mpv (p > 0.05) (fig. 3). similarly, no valuable variations were observed between exposures and controls regarding pdw smoking, home distance, and exposure duration showed no noticeable effects on pdw (p > 0.05) (fig. 3), which might be due to the relations of changes in both mpv and pdw [24]. in addition, it was concluded from the current study that traffic emission has no significant effect on pct and lpcr (p > 0.05). this study revealed that smoking, home distance, and exposure duration showed no valuable effects on each of pct and lpcr (p > 0.05) (fig. 3). in a study, it was noticed that the lpcr effects due to chemical exposure have a significant role in the discrimination between hyperdestructive and hypo-productive thrombocytopenia [25]. however, the pct levels were fewer, especially among traffic emission exposures; however, it may increase in acute cholecystitis patients with pdw and lowered mpv [24]. 4. conclusion traffic emission gases showed no significant effects on the vast majority of the hematologic parameters, although, valuable elevation has been seen in neutrophils and platelets due to the traffic emission. the results of the current study suggested links between inflammatory and cardiovascular diseases among emission exposures. future researches must be considered to investigate these relations. references [1] m. franchini and p. m. mannucci. “thrombogenicity and cardiovascular effects of ambient air pollution”. journal of blood, vol. 118, no. 9, p. 2405, 2011. [2] r. khan and a. agarwal. “modulatory effect of vitamine e and c on nitrogen dioxide induced hematotoxicity in both the sexes of wistar rats”. international journal of interdisciplinary research, vol. 3, no. 3, pp. 46-50, 2016. [3] n. boussettaa, s. abedelmalekc, h. mallekd, k. alouie and n. souissiaa. “effect of air pollution and time of day onperformance, heart rate hematologicalparameters and blood gases, following theyyirt-1 in smoker and non-smoker soccerplayers”. science and sports, vol. 33, no. 6, pp. 1-14, 2018. [4] p. ahlawat. “effect of sulphur dioxide exposure on haematological parameters in albino rats”. journal of scientific and engineering research, vol. 3, no. 6, pp. 58-60, 2016. [5] c. tan, y. wang, m. lin, z. wang, l. he, z. li, li, y and k. xu. “longterm high air pollution exposure induced metabolic adaptations in traffic policemen”. environmental toxicology and pharmacology, vol. 58, no. 16, pp. 156-162, 2018. [6] r. m. kartheek and m. david. “modulations in haematological aspects of wistar rats exposed to sublethal doses of fipronil under subchronic duration”. journal of pharmaceutical, chemical and biological sciences, vol. 5, no. 3, pp. 187-194, 2017. [7] c. jephcote and a. mah. “regional inequalities in benzene exposures across the european petrochemical industry: a bayesian multilevel modelling approach”. environment international, vol. 132, no. 104812, pp. 1-17, 2019. [8] bahaoddini and m. saadat. “hematological changesdue to chronic exposure to natural gasleakage in polluted areasof masjid-i-sulaiman (khozestan province, iran)”. ecotoxicology and environmental safety, vol. 58, no. 2, pp. 273-276, 2004. [9] kamal, a. cincinelli, t. martellini and r. n. malik. “linking mobile source-pahs and biological effects in traffic police officers and drivers in rawal pindi (pakistan)”. ecotoxicology and environmental safety, vol. 127, pp. 135-143, 2016. [10] g. m. farris, s. n. robinson, b. a. wong, v. a. wong, w. p. hahn and r. shah. “effects of benzene on splenic, thymic, and femoral lymphocytes in mice”. toxicology, vol. 118, no. 2-3, pp. 137-148, 1997. [11] t. honda, c. v. puna, j. manjourides and h. suhb. “anemia prevalence and hemoglobin levels are associated with long-term exposure to air pollution in an older population”. environment international, vol. 101, no. 4, pp. 125-132, 2017. [12] a. masih, a. lall, a. taneja and r. singhvi. “exposure profiles, seasonal variation and health risk assessment of btex in indoor air of homes at different microenvironments of a terai province of northern india”. chemosphere, vol. 176, no. 2, pp. 8-17, 2017. [13] m. abu-elmagd, m. alghamdi, m. shamy, m. khoder, m. costa, m. assidi, r. kadam, h. alsehli, m. gari, p. n. pushparaj, g. kalamegam and m. h. al-qahtani. “evaluation of the effects of airborne particulate matter on bone marrow-mesenchymal stem cells (bm-mscs): cellular, molecular and systems biological approaches”. international journal of environmental research and public health, vol. 14, no. 4, p. 440, 2017. [14] j. reisa and s. martel. “acute exposure guideline levels for selected airborne. in: acute exposure guideline levels, national academy of sciences/national research council (us) committee, washington dc, usa, p. 178, 2014. dunya hars bapir and salih ahmed hama: hematological impacts of the traffic emissions in sulaymaniyah governorate 86 uhd journal of science and technology | jan 2020 | vol 4 | issue 1 [15] l. ton. “platelet neutrophil interactions as drivers of inflammatory and thrombotic disease”. cell and tissue research, vol. 371, pp. 567-576, 2018. [16] e. wigenstama, l. elfsmarka, a. buchta and s. jonassona. “inhaled sulfur dioxide causes pulmonary and systemic inflammation leading tochemical-induced lung injury”. toxicology, vol. 368-369, no. 4, pp. 28-36, 2016. [17] ö. etlik and a. tomur. “the oxidant effects of hyperbaric oxygenation and air pollution in erythrocyte membranes (hyperbaric oxygenation in air pollution)”. european journal of general medicine, vol. 3, no. 1, pp. 21-28, 2006. [18] p. poursafa, r. kelishadi, a. amini, amini, a. m. amin, m. lahijanzadeh and m. modaresi. “association of air pollution and hematologic parameters in children and adolescents”. jornal de pediatria, vol. 87, no. 4, pp. 350-356, 2011. [19] gorriz, s. llacuna, m. riera and j. nadal. “effects of air pollution on hematological and plasma parameters in apodemus sylvaticus and mus musculus”. archievs of environmental contamination and toxicology, vol. 31, no. 1, pp. 153-158, 1996. [20] g. l. walter. “effects of carbon dioxide inhalation on hematology, coagulation, and serum clinical chemistry values in rats”. toxicologic pathology, vol. 27, no. 2, pp. 217-225, 1999. [21] q. sun, x. hong and l. e. wold. “cardiovascular effects of ambient particulate air”. circulation journal, vol. 121, no. 25, pp. 27552765, 2010. [22] m. kargarfard, a. shariat, b. shaw, i. shaw, t. lam, a. kheiri, a. eatemadyboroujeni and s. m. tamrin. “effects of polluted air on cardiovascular and hematological parameters after progressive maximal aerobic ex”. lung journal, vol. 193, no. 2, pp. 275-281, 2015. [23] m. nikolić, d. nikić and a. stanković. “effects of air pollution on red blood cells in children”. polish journal of environmental study, vol. 17, no. 2, pp. 267-271, 2008. [24] m. zain and s. aitte. “study of changes in blood parameters and calculation of pct, mpv and dpw for the platelets of laboratory females and males of albino mice during exposure to doses of pyrethriodpesticide (alphacypermethrin)”. iosr journal of pharmacy and biological sciences, vol. 14, no. 2, pp. 71-78, 2019. [25] y. budak, m. polat and k. huysal. “the use of platelet indices, plateletcrit, mean platelet volume and platelet distribution width in emergency non-traumatic abdominal surgery: a systematic review”. biochemical medicine (zagreb), vol. 26, no. 2, pp. 178193, 2016. . 18 uhd journal of science and technology | jan 2020 | vol 4 | issue 1 1. introduction nowadays, the use of the internet has become inseparable from our daily routines. social media networks such as facebook and twitter have also been developed to give a right to people to easily share their viewpoints about any big data sentimental analysis using document to vector and optimized support vector machine sozan abdulla mahmood, qani qabil qasim department of computer science, university of sulaimani, sulaymaniyah, iraq a b s t r a c t with the rapid evolution of the internet, using social media networks such as twitter, facebook, and tumblr, is becoming so common that they have made a great impact on every aspect of human life. twitter is one of the most popular micro-blogging social media that allow people to share their emotions in short text about variety of topics such as company’s products, people, politics, and services. analyzing sentiment could be possible as emotions and reviews on different topics are shared every second, which makes social media to become a useful source of information in different fields such as business, politics, applications, and services. twitter application programming interface (twitter-api), which is an interface between developers and twitter, allows them to search for tweets based on the desired keyword using some secret keys and tokens. in this work, twitter-api used to download the most recent tweets about four keywords, namely, (trump, bitcoin, iot, and toyota) with a different number of tweets. “vader” that is a lexicon rulebased method used to categorize downloaded tweets into “positive” and “negative” based on their polarity, then the tweets were protected in mongo database for the next processes. after pre-processing, the hold-out technique was used to split each dataset to 80% as “training-set” and rest 20% “testing-set.” after that, a deep learning-based document to vector model was used for feature extraction. to perform the classification task, radial bias function kernel-based support vector machine (svm) has been used. the accuracy of (rbf-svm) mainly depends on the value of hyperplane “soft margin” penalty “c” and γ “gamma” parameters. the main goal of this work is to select best values for those parameters in order to improve the accuracy of rbf-svm classifier. the objective of this study is to show the impacts of using four meta-heuristic optimizer algorithms, namely, particle swarm optimizer (pso), modified pso (mpso), grey wolf optimizer (gwo), and hybrid of pso-gwo in improving svm classification accuracy by selecting the best values for those parameters. to the best of our knowledge, hybrid pso-gwo has never been used in svm optimization. the results show that these optimizers have a significant impact on increasing svm accuracy. the best accuracy of the model with traditional svm was 87.885%. after optimization, the highest accuracy obtained with gwo is 91.053% while pso, hybrid pso-gwo, and mpso best accuracies are 90.736%, 90.657%, and 90.557%, respectively. index terms: document to vector, grey wolf optimizer, particle swarm optimizer, hybrid particle swarm optimizer_grey wolf optimizer, opinion mining, radial bias function kernel-based support vector machine, sentiment analysis, support vector machine optimization, twitter application programming interface o r i g i n a l re se a rc h a rt i c l e uhd journal of science and technology corresponding author’s e-mail: qani qabil qasim, department of computer science, university of sulaimani, sulaymaniyah, iraq. e-mail: qani.qabil@gmail.com received: 23-10-2019 accepted: 06-01-2020 publishing: 13-02-2020 access this article online doi: 10.21928/uhdjst.v4n1y2020.pp18-28 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2020 mahmood and qasim. this is an open access article distributed under the creative commons attribution noncommercial no derivatives license 4.0 (cc by-nc-nd 4.0) sozan abdulla mahmood and qani qabil qasim: twitter sentiment analysis uhd journal of science and technology | jan 2020 | vol 4 | issue 1 19 product or service in the form of short text. this makes them to be rich sources of data that can be valuable for various organizations and companies to find their fans’ or customers’ opinions about their products and services. in spite of companies, well-known people such as politicians and athletes may need to exploit those opinions and attitudes as well as to help them for making better decision-making in the future. however, data diversity and sparsity make it impossible for human to be able to analyze it. here, the role of machine learning and automation can take a part to solve the problem of big data. sentiment analysis (sa) or opinion mining techniques could be used [1]. sa refers to the task of finding the opinions of authors about specific entities that expressed in a written text [2]. in recent years, twitter has become one of the most popular social media and microblogging platform where it is a convenient way for users to write and share their thoughts about anything within 280-characters length (called tweets). twitter is used extensively as a microblogging service worldwide. tweets consist of misspellings, slangs, and symbolic forms of words, which poses a major challenge for the conventional natural language processing or machine learning systems to be used on tweets [3]. sentiment analyzer model can be built in three main approaches – lexicon-based approach, machine learningbased approach, and hybrid of both lexicon-based and machine learning approach. the machine learning approach is one of the most popular techniques that are widely used to build an automated classification model with the help of algorithms such as support vector machine (svm), naïve bayes (nb), and so on. this is due to their ability to handle a large amount of data [4]. in this study, we propose a technique to promote svm performance for sa by implementing four different metaheuristic optimizers, namely, particle swarm optimizer (pso), modified pso (mpso), grey wolf optimizer (gwo), and hybrid of pso-gwo. the sentiment classification goes through four phases: data collection, data pre-processing, feature extraction, and classification. in the first phase, twitter application programming interface (twitter-api) enables developers to collect tweets about any keyword they desire and then followed by preprocessing phase to remove least informative data such as url, hashtags, numbers, and so on. in the third phase, document to vector (doc2vec) approaches were used for vectorizing cleaned text, which is the numerical representation of text. pso, gwo, and hybrid pso-gwo are used to select the best parameters for the classifier (svm) to classify generated features from the previous step. the rest of the paper is structured as follows: in section 2, some previous related works in this field that has been conducted before being discussed, section 3 describes the material and methods used in this work, section 4 describes the problem statement, section 5 illustrates the proposed system model and methodology of analyzing the datasets, section 6 shows the results obtained from the model and discussed in detail, and finally, the conclusion and future work are stated in section 7. 2. related work many researches and works have been developed in the field of sa. researchers have proposed different solutions to different issues of sa in terms of improving performance of classification models, enhancing topic specific corpus, reducing feature-set size to shrink execution time of algorithms and space usage using different techniques. das et al. [5] review basic stages to be considered in sa, such as pre-processing, feature extraction/selection, and representation along with some data-driven techniques in this field such as svm and nb as well as to demonstrate how they work and the measuring metrics such as (precision, recall, f1-score, and accuracy) to evaluate the model efficiency. they concluded that all the sa tasks are challenging and need different techniques to deal with each stage. naz et al. [6] illustrate the impact of different weighting feature schemes such as term frequency (tf), tf-inverse document frequency (tf-idf), and binary occurrence (bo) to extract features from tweets along with different n-gram ranges such as unigram, bigram, trigram, and their combination, followed by feeding extracted feature from semeval2016 dataset to train svm. the best result they achieved is 79.6% for tf-idf with unigram range. they also used the sentiment score vector package to calculate the score of tweets into positive and negative forms to improve the performance of svm, along with different weighting schemes and n-gram range, the highest accuracy achieved with svc is 81.0% for bo with unigram range. seth et al. [7] proposed a hybrid technique for improving the efficiency and reliability of their model by merging svm with the decision tree. the model performs a classification sozan abdulla mahmood and qani qabil qasim: twitter sentiment analysis 20 uhd journal of science and technology | jan 2020 | vol 4 | issue 1 of tweets on the basis of svm and adaboost decision tree individually. then, a hybrid technique will be applied by feeding the outputs obtained from the two above mentioned algorithms as the input to the decision tree. finally, they compared traditional techniques to the proposed model and obtained the accuracy of 84%, while prior accuracies were 82% and 67%. sharma and kumari [8] applied svm to find the polarity of four smartphone product review texts, whether positive or negative. before applying svm, they used part of speech (pos) tagging with tokens, then used clustering for tf-idf features to find more appropriate centroids. the accuracy of the model was evaluated based on (precision, recall, f-score, and accuracy) metrics, compared to previous studies on the same datasets where no pos and no clustering were performed. they obtained the accuracy of 90.99% while the best previous study accuracy was 88.5%. rajput and dubey [9] made a comparative study between two supervised classification algorithms, namely, nb and svm for making binary classification of customers review about six indian stock market. the results show that svm provides better accuracy, which was 81.647%, while nb accuracy was 78.469%. rane and kumar [10] worked on a six major us airline datasets for performing a multi-class (positive, negative, and neutral) sa. doc-2vec deep learning approach has been used for representing these tweets as vectors to do a phrase-level analysis – along with seven (7) supervised machine learning algorithms (decision tree, random forest, svm, k-nearest neighbors, logistic regression, gaussian nb and adaboost). each classifier was trained with 80% of the data and tested using the remaining 20% data. accuracy of all classifiers was calculated based on (precision, recall, f1-score) metrics. they concluded that the classification techniques used include ensemble approaches, such as adaboost, which combines several other classifiers to form one strong classifier which performs much better. the maximum achieved accuracy was 84.5%. shuai et al. [11], these authors carry out a binary sa on chinese hotel reviews by using doc2vec feature extraction technique and svm, logistic regression and nb as a classifier. after making a performance comparison between classification algorithms based on the precision, recall rate, and f-measure metrics, svm achieved the best accuracy in their experiment as follows: 79.5%, 87.92%, and 81.16% for all three metrics. bindal and chatterjee [3] described two-step method (lexiconbased sentiment scoring in conjunction with svm, point-wise mutual information utilized to calculate sentiment of tweets. they also discussed the efficacy of several linguistic features, such as pos tags and higher-order n-grams (uni + bi gram, uni + bi + tri gram) in sentiment mining. their proposed scheme had better “f-score” average than commonly used one-step methods such as lexicon, nb, maximum entropy, and svm classifier, i.e., for unigram range lexicon-svm outperforms other classification methods with f-score of 84.39% while other methods f-score is 82.44%, 81.85%, 80.18%, and 83.56%, respectively. mukwazvure and supreethi [12] used a hybrid technique which involves lexicon-based approach for detecting “news comments” polarity in (technology, politics, and business) domains. then, the outcome of lexicon-based is then fed to train two supervised machine learning algorithms: svm and k-nearest neighbor (knn) classifiers. investigational results revealed that svm performed better than knn which were 73.6, 61.38, and 58.00 while knn results were 74.24%, 56.27%, and 55.58%. flores et al. [13] made a comparative analysis of svm algorithm-sequential minimal optimization with synthetic minority over-sampling technique (smote) and naive bayes multinomial (nbm) algorithm with smote for classification of two sa datasets gathered by students of university of san carlos. the outcomes have shown that with 10-folds cross-validation sa for their datasets could perform better compared to 70:30 split. performance of nbm with smote was 72.33% and 78.02% and svm with smote were 83.16% and 82.22% in the term of accuracy. 3. materials and methods 3.1. vader vader stands for valence aware dictionary and sentiment reasoner. it is a lexicon and rule-based sa tool that was developed by hutto and gilbert [14] in 2014. it is specifically attuned to do calculate the sentiment scores of texts expressed on social media. vader uses a combination of a sentiment lexicon is a list of lexical features (e.g., words) which are generally labeled according to their semantic orientation as either positive or negative. vader not only tells about the positivity and negativity score but also tells us about how much positive or negative a sentiment is? vader produces four sentiment metrics from these word ratings, the first three, positive, neutral, and negative, represents the sozan abdulla mahmood and qani qabil qasim: twitter sentiment analysis uhd journal of science and technology | jan 2020 | vol 4 | issue 1 21 proportion of the text that falls into those categories and the final metric is compound score which is computed by summing the valence scores of each word in the lexicon, adjusted according to the rules, and then normalized to be between -1 (most extreme negative) and +1 (most extreme positive). according to their experiment, it is more effective than other existing lexicon-based approaches, for example, sentiwordnet. 3.2. word embedding and doc2vec word embedding, also known as (word2vec), is a technique for unique vector representation of each word with its semantic meaning of the word taken into consideration. unlike bag of words, which is one of the most common techniques used for numerical representation of words that convert word to a fixed-length feature vector, it has some shortcomings. first, it does not consider the ordering of the words, ignores semantics of the words. for example, “powerful,” “strong,” and “paris” are equally evaluated and generate a high dimensional feature set, so, it needs a lot of memory space [15]. in word2vec approach, each word is mapped to a vector in a predefined vector space. these vectors are learned using neural networks. the learning process can be done with a neural network model or by using an unsupervised process involving document statistics. word2vec can be implemented in two different architectures, first is continuous bag of word (cbow), as shown in fig. 1 which is designed to predict current words at an input of future words and history words and the second is skipgram (sg) which is used to maximize the probability of surrounding words given the current word being used in word embedding [15], [16]. doc2vec, also called paragraph vector (pv), is a (word2vec) based learning approach that converts entire paragraph to a unique vector which is represented by a column in matrix d and every word is mapped to unique vector mapped in matrix w. the word and pvs are then concatenated to predict the next word. cbow and sg methods have been tuned for doc2vec and converted into two methods, namely, distributed bag of words version of pvs (pv-dbow) and distributed memory of pvs (pv-dm) [10], as shown in figs. 2 and 3. the dbow model ignores the context words in the input, but force the model to predict words randomly sampled from the paragraph in the output [15]. in dm model, to predict the next word in a context, the paragraph and word vectors either being averaged (mean) which is called dm mean (dmm), or concatenated which is called dm concatenation (dmc) [15]. 3.3. pso algorithm pso is a type of meta-heuristic algorithm developed by dr. kennedy and dr. eberhart in 1995 to optimize numeric problems iteratively. pso simulates the behaviors of the animals’ groups searching for food, especially bird flocking or fish schooling. pso starts through a randomly distributed group of agents called particles in a search space; every particle has self-own velocity [17]. each particle has two “best” achieved positions; the first one is its best position or (local best position) referred to as “pbest.” and the second one is (global best position) referred to as “gbest.” at each time the particles will move toward “pbest” and “gbest” based on a new “velocity” and some constant fig. 1. continuous bag of word and skip-gram. sozan abdulla mahmood and qani qabil qasim: twitter sentiment analysis 22 uhd journal of science and technology | jan 2020 | vol 4 | issue 1 coefficient parameters such as c 1 , c 2 , and w (inertia weight) and two random numbers. in d-dimensional space, pso algorithm can be described as follows: x i = (x i1 , x i2 , x i3 ,…, x id ) represents the current position of the “particle,” v i = (v i1 , v i2 , v i3 …. v id ) and it refers to its velocity, the local best location is denoted as pbest,i = (p i1 , p i2 , p i3 …. p id ), and global best position of all particles refers to pgbest,i = (p g1 , p g2 , p g3 …. p gd ). at every iteration, each particle changes its position according to the new velocity. v w*�v c r pbest x c r gbest xi t i t i t i t i t i t+ = + + −( )+ + −( )1 1 1 2 2 (1) in this study, instead of multiplying w to only current velocity, after changing the current particle best position and group best position, we multiplied them all to “w.” the formulated equation is: v w* v �c r pbest �x c r gbest �x i t i t i t i t i t i t + = + + −( ) + + −( )       1 1 1 2 2   (2) x �x vi t i t i t+ += +1 1 (3) where “i” refers to a particle, pbest, and gbest as the best particle position, best group position, and the parameters w, c 1 and c 2 are called inertia weighs. r 1 and r 2 are two random numbers in the range of (0, 1), vi t is a current velocity, vi t+1 indicates new velocity in the next time or iteration. furthermore, xi t is current particle position, xi t+1 indicates the new particle position. the pseudocode of pso is: initialized number of particles (n_particle), d, n_iterations, c1, c2, and w. for each particle i ∈ (n_particle) initialize x i , v i end for for each particle i in n_particle do if ƒ(x i ) <ƒ(pi) pbest i = x i end if if f (pbest i ) < f gbest i gbest = pbest i end if end for for each particle i in n_particle do for each dimension d in d update velocity according to equation (1) for pso and equation (2) for mpso update position according to equation (3) end for end for iteration = iteration +1 until iteration > n_iterations. 3.4. gwo algorithm gwo algorithm is another type of swarm intelligence algorithm, proposed by mirjalili et al. in 2014 [18], that mimics the leadership hierarchy and hunting mechanism of fig. 2. distributed bag of word of paragraph vector. fig. 3. distributed memory of paragraph vector. sozan abdulla mahmood and qani qabil qasim: twitter sentiment analysis uhd journal of science and technology | jan 2020 | vol 4 | issue 1 23 grey wolves in nature. four types of grey wolves, such as alpha, beta, delta, and omega, are employed for simulating the leadership hierarchy. furthermore, the three main steps of hunting, searching for prey, encircling prey, and attacking prey, are implemented. 3.4.1. social hierarchy the social hierarchy in this algorithm consists of four groups of wolves, namely, alpha (α), beat (β), and delta (δ), and the other is called omega (ω). in the gwo algorithm, the hunting (optimization) process is guided by α, β, and δ. the ω wolves follow these three wolves [18]. 3.4.2. encircling prey encircling prey means that grey wolves surround prey during the hunt, the following mathematical equations form the encircling behavior [18]: d c x x → → → → = ( )− ( ). p t � t (4) x x x d → → → → + = ( )−( ) .t p t ��1 (5) where t indicates the current iteration, a → and c → are coefficient vectors, xp → is the position vector of the prey, and x → indicates the position vector of a grey wolf. the vectors a and c are calculated as: a a r a → → → → = −2 1� �� �. (6) c r → → = 2 2� (7) where a → linearly decreased from 2 to 0 throughout iterations and r 1 , r 2 are two random vectors in the range of 0, 1. 3.4.3. hunting grey wolves can identify the location of prey and encircle them. the hunt is usually guided by the alpha. sometimes beta and delta might also get involved in hunting, alpha (best candidate solution), beta, and delta have better knowledge about the potential location of prey. thus, the first three best solutions are selected to update their positions according to the position of the best search agents based on the following mathematical formulas [18]: → → α α → → = −1 . d x c x d c x x → → → → = −β β2 .� � (8) d c x x → → → → = −δ δ3 .� � x x a d → → → → = −1 1α α� � ���.( ) x x a d → → → → = −2 2β β� � ���.( ) (9) x x a d → → → → = −3 3δ δ� � ���.( ) x x x x→ → → → + = + + ( )t � � 1 3 1 2 3 (10) the pseudocode of gwo as follows: initialize the grey wolf population xi (i = 1, 2..., n) initialize a, a, and c c a l c u l a te t h e f i tn es s o f ea ch s e a r ch a g en t u s i n g equations (8) and (9) x α = the first best search agent x β = the second-best search agent x δ = the third best search agent while (t < max number of iterations) for each search agent update the position of the current search agent according to equation (9) end for a=2−t*(2⁄(max_iteration)) calculate a, c using equations (6) and (7) calculate the fitness of each search agent using equations (8) and (9) update position of the current search agents according to equation (10) t=t + 1 end while return x α . 3.5. hybrid pso-gwo in hybrid pso-gwo, the first three agents’ position is updated in the search space by a mathematical equation 8. instead of using common mathematical formulas, the exploration and exploitation of the grey wolf in the search space have been controlled by inertia constant [19]. the modified set of dominant equations is as follows: d c x x → → → → = −α α1.� w*� sozan abdulla mahmood and qani qabil qasim: twitter sentiment analysis 24 uhd journal of science and technology | jan 2020 | vol 4 | issue 1 d c x x → → → → = −β β2 .� w*� (11) d c x x → → → → = −δ δ3 .� w*� where c 1, c 2 , c 3 , and w are constants, to combine pso and gwo variants, the velocity and updated equation are calculated as follows: v w* v c r x �x c r x �x � �c r x �x i t i t i t � i t i t + = + −( ) + −( ) + −( )   1 1 1 1 2 2 2 3 3 3       (12) x �x �vi t i t i t+ += +1 1 (13) the pseudocode of hybrid pso-gwo as follows: initialize c 1, c 2 , c 3, t = 0, w = 0.5 + r⁄2, velocity=random (search agents no. dim), postion=dot (random (search agents no, dim), (ub−lb)) + lb while (t <max_iteration) for each search agent a=2−t*(2⁄max_iteration) calculate a 1 , a 2 , and a 3 according to equation (6) calculate the fitness of each search agent using equations (9) and (11) update velocity and position of the current search agent according to equations (12) and (13) end for t=t + 1 end while. 4. problem statement as described in section 1, machine learning techniques are popular ways of sentiment classification. in this work, to perform sentiment classification, radial bias function kernel-based svm (rbf-svm) has been used. the accuracy and performance of this type of svm mainly depend on the value of two parameters, namely, penalty “c” and “gamma” which known as hyperplane “soft margin” parameters. hence, selecting optimal value for those parameters is a challenge to boost the classification model accuracy. to solve this problem, four meta-heuristic optimizer algorithms: pso, mpso, gwo, and hybrid of pso-gwo have been implemented to select the best values for those parameters. 5. proposed system model in this study, four meta-heuristic optimizer algorithms have been implemented for selecting the best value to “soft margin” penalty “c” and “gamma” parameters to improve the accuracy of the rbf-svm classifier. the work implemented on dell latitude e6540, intel(r) core(tm) i7-4610m cpu at 3.00ghz, 8-gb ram, 64-bit windows-7 operating system. fig. 4 is a flow diagram that displays basic architecture and steps of the proposed sentiment classification model. 5.1. tweet collection to access twitter and reading tweets from it, you have to make a twitter developer account that known as twitter-api. twitter-api is an interface between the developers and twitter that enables them to search for tweets based on their desired keyword through some secret key and tokens. in this work a twitter-api is created called “twitter-sentiment-analysis-20,” to collect the most recent tweets according to some keyword such as trump, bitcoin, iot, and toyota using python code and categorizing to “positive” and “negative” using “vader” [14] lexicon rule-based method then persist in mongo database collection or table. table 1 shows the details about each keyword dataset size, and fig. 5 shows a sample of data. 5.2. pre-processing pre-processing means cleaning the text from the least important data. the datasets will go through the following steps for pre-processing task: removing duplicate tweets, convert the words to the lowercase, and replace emoticons symbols with a positive or negative opinion, according to table 2. the next step is removing urls, slang correction (omg → oh my god), expand contraction (can’t → cannot), stripping punctuation marks, special character and numbers, as well as multiple spaces, clearing from stop words, tokenizing, and table 1: dataset size description keyword positive negative total trump 1339 1626 2965 bitcoin 4923 2341 7264 iot 10,700 1929 12,629 toyota 14,332 6594 20,926 total 31,294 12,490 43,784 table 2: emoticons and their meaning :-), :-d, :-j, =p, :], :3 positive :(, :[, ^o), :^), :@, =/ negative sozan abdulla mahmood and qani qabil qasim: twitter sentiment analysis uhd journal of science and technology | jan 2020 | vol 4 | issue 1 25 fig. 5. sample of collected tweets. fig. 4. flow diagram of the proposed model. lemmatizing and finally, dropping duplicate tweets after preprocessing and protecting them in another mongo database collection. 5.3. feature extraction feature extraction is the most important phase. the purpose of this phase is to normalize the data by converting the words into vectors for the classification process. gensim’s deep learning library has been utilized for the numerical representation of each document. doc2vec is a way of document embedding where each document is mapped to a vector in space. doc2vec is gensim’s extended library of word2vec, which is used to find vector representations for each word [15]. doc2vec was proposed in two models, namely, dbow and dm. dm is divided into two sub-model, namely, dmc and dmm. after preprocessing, the cleaned tweets will be split into two parts, which are training-set, composed of 80% of tweets, and test-set, composed of 20% of tweets, after that doc2vec models has been used to extract features from train-set and test-set. doc2vec models and their combination dbow + dmc and dbow + dmm are used to extract features from pre-processed tweets. 5.4. classification to perform the classification task, the rbf-svm has been used. svm one of the well-known supervised machine learning that broadly use in classification and regression tasks due to the ability to work with large amounts of data. in the first approach, the traditional svm with default value “1”, and “scale” for c and gamma parameters, used to classify tweets. in the second approach, at each iteration, the rbf-svm’s “c” and “gamma” parameters took the position value of each agent. after finishing the last iteration, the best accuracy with respect to the best c and gamma values was presented. finally, the accuracy of both classification approaches has been compared. 6. results and discussion figs. 6-9 show an accuracy comparison between traditional svm and optimized svm with different doc2vec feature extraction models. as it is shown in fig. 6, all optimizers provide a better result for all doc2vec feature extraction methods. the hybrid of pso-gwo provides better results in dbow and dmc sozan abdulla mahmood and qani qabil qasim: twitter sentiment analysis 26 uhd journal of science and technology | jan 2020 | vol 4 | issue 1 fig. 6. results of trump dataset. 73 .5 24 73 .1 87 7 4. 36 7 73 .8 61 73 .3 557 4. 87 3 75 .2 1 74 .3 67 75 .8 85 75 .2 1 75 .5 48 75 .3 79 75 .2 1 75 .7 16 75 .8 85 75 .3 79 75 .0 4 7 6. 55 9 75 .8 85 75 .5 48 74 .7 04 74 .1 98 76 .7 28 73 .0 18 76 .8 97 d b o w d m c d m m d b o w + d m c d b o w + d m m svm gwo-svm hpsogwo-svm modified pso-svm pso-svm 77 .7 77 .2 88 77 .4 94 77 .9 76 78 .2 51 80 .3 85 79 .0 77 79 .7 66 80 .3 85 80 .5 91 79 .7 66 79 .4 21 80 .3 85 76 .8 06 77 .1 5 79 .8 43 79 .7 66 79 .6 28 79 .4 21 79 .1 468 0. 79 8 80 .3 85 79 .4 9 79 .2 84 80 .6 6 d b o w d m c d m m d b o w + d m c d b o w + d m m svm gwo-svm hpsogwo-svm modified pso-svm pso-svm fig. 7. results of bitcoin dataset. 87 .4 1 87 .2 12 87 .5 29 87 .6 88 87 .8 85 90 .4 19 90 .7 75 90 .6 57 90 .8 55 91 .0 53 90 .2 61 90 .2 21 90 .6 57 88 .6 77 88 .5 98 90 .2 61 89 .5 09 89 .7 46 89 .2 31 9 0. 57 7 90 .0 63 89 .6 67 89 .9 04 90 .7 36 90 .3 d b o w d m c d m m d b o w + d m c d b o w + d m m svm gwo-svm hpsogwo-svm modified pso-svm pso-svm fig. 8. results of iot dataset. models. furthermore, mpso-svm outperforms original pso-svm for dbow, dmc, and dbow + dmc models, respectively. by looking at the “bitcoin” dataset results, for dbow, dmc, and dmm models, all optimizers provide a remarkable accuracy compared to traditional svm, except for hybrid pso-gwo that could not get expectable result for dbow + dmc and dbow + dmm models. mpso-svm provides better results than original pso-svm for both doc2vec dmm and dbow + dmc models. the results show that the model accuracy remarkably increased for all optimizers with different doc2vec models and their combinations, especially gwo that achieves the highest accuracy result that is 91.093% in dbow + dmm, followed by mpso and pso. in dbow, dmc, and dmm models hybrid of pso-gwo provides a better result than pso and mpso, but in dbow + dmc and dbow + dmm combinations, it increased the model accuracy by <1%. finally, fig. 9 illustrates that all optimizers outperform svm when used alone, like “iot” dataset, gwo-svm sozan abdulla mahmood and qani qabil qasim: twitter sentiment analysis uhd journal of science and technology | jan 2020 | vol 4 | issue 1 27 outperforms other optimizers in all doc2vec models. except for pso-gwo with svm that could not grant the expected result for dbow + dmc and dbow + dmm the same as the “bitcoin” dataset. 7. conclusion and future work in this work, we have carried out a comparative analysis between classification with traditional rbf-svm and optimized rbf-svm using four meta-heuristic optimizers, namely, pso, mpso, gwo, and hybrid of pso and gwo. these optimizers are implemented for selecting the best values for hyperplane “soft margin” penalty “c” and gamma parameters of the rbf-svm classifier. after testing our model on each dataset and with different doc2vec feature extraction methods. we came to the point that these optimizers have an important role in enhancing the accuracy of the classifier. the results show that with a small dataset, mpso provides a better result than the original pso. in contrast, with increasing the dataset size, svm with gwo achieves better accuracy compared to the rest optimizers. hybrid of pso-gwo is effective in improving svm accuracy in doc2vec dbow, dmc, and dmm models, but it is not work well for combinations of dbow + dmc and dbow + dmm because of feature set nature was generated by merging these two models. in future works, we will try to use these optimizers for parameter optimizing of some deep learning algorithms, i.e., rectified neural network weights to examine whether it performs better results than existing rbf-svm model or not. references [1] a. go, r. bhayani and l. huang. “twitter sentiment classification using distant supervision”. technical report, stanford university. p. 6, 2009. [2] r. feldman. “techniques and applications for sentiment analysis: the main applications and challenges of one of the hottest research areas in computer science”. communication of the acm, vol. 56, no. 4, pp. 82-89, 2013. [3] n. bindal and n. chatterjee. “a two-step method for sentiment analysis of tweets.” in: 15th international conference information technology 2016, bhubaneswar, pp. 218-224, 2017. [4] s. k. jain and p. singh. “systematic survey on sentiment analysis”. in: 2018-1st international conference on secure cyber computing and communication, jalandhar, pp. 561-565, 2019. [5] m. k. das, b. padhy and b. k. mishra. “opinion mining and sentiment classification: a review”. in: proceedings of the international conference on inventive systems and control 2017, malaysia, pp. 4-6, 2017. [6] s. naz, a. sharan and n. malik. “sentiment classification on twitter data using support vector machine”. 2018 ieee/wic/acm international conference on web intelligence, santiago, pp. 676679, 2019. [7] p. seth, a. sharma and r. vidhya. “sentiment analysis of tweets using machine learning approach”. international journal of engineering and technology, vol. 7, no. 3.12, p. 434, 2018. [8] a. k. sharma. and d. s. u. kumari. “sentiment analysis of smart phone product review using svm classification technique”. international conference on energy, communication, data analytics and soft computing, chennai, india, pp.1469-1474, 2017. [9] v. s. rajput and s. m. dubey. “stock market sentiment analysis based on machine learning”. in: 2016 2nd international conference on next generation computing technologies (ngct), dehradun, pp. 506-510, 2017. [10] a. rane and a. kumar. “sentiment classification system of twitter data for us airline service analysis”. international computer software and applications conference, vol. 1, pp. 769-773, 2018. [11] q. shuai, y. huang, l. jin and l. pang. “sentiment analysis on chinese hotel reviews with doc2vec and classifiers”. in: 018 ieee 3rd advanced information technology, electronic and automation control conference (iaeac), pp. 1171-1174, 2018. [12] a. mukwazvure and k. p. supreethi. “a hybrid approach to 80 .5 78 80 .2 19 80 .2 43 81 .2 7 80 .9 682 .1 3 82 .7 28 82 .4 17 83 .0 14 83 .0 62 82 .6 32 82 .1 54 82 .2 5 78 .6 66 78 .3 56 81 .5 57 81 .5 33 82 .2 74 82 .3 45 82 .2 98 81 .6 29 81 .8 2 81 .9 39 82 .6 08 82 .0 59 d b o w d m c d m m d b o w + d m c d b o w + d m m svm gwo-svm hpsogwo-svm modified pso-svm pso-svm fig. 9. results of toyota dataset. sozan abdulla mahmood and qani qabil qasim: twitter sentiment analysis 28 uhd journal of science and technology | jan 2020 | vol 4 | issue 1 sentiment analysis of news comments”. in: 2015 4th international conference on reliability, infocom technologies and optimization, noida, 2015. [13] a. c. flores, r. i. icoy, c. f. pena and k. d. gorro. “an evaluation of svm and naive bayes with smote on sentiment analysis data set”. in: 2018-4th international conference on engineering, applied sciences, and technology, explor innovative smart solutions social, phuket, pp. 1-4, 2018. [14] j. hutto and e. e. gilbert. “vader: a parsimonious rulebased model for sentiment analysis of social media text”. in: 8th international conference on weblogs and social media, michigan, 2014. [15] q. le and t. mikolov. “distributed representations of sentences and documents”. 31st international conference on machine learning, vol. 4, pp. 2931-2939, 2014. [16] m. bilgin and i̇. f. şentürk. “sentiment analysis on twitter data with semi-supervised doc2vec”. in: 2nd international conference on computer science and engineering ubmk 2017, turkish, pp. 661666, 2017. [17] r. eberhart and j. kennedy. “new optimizer using particle swarm theory”. in: proceedings international symposium on micro machine and human science, new york, pp. 39-43, 1995. [18] s. mirjalili, s. m. mirjalili and a. lewis. “grey wolf optimizer”. advances engineering software, vol. 69, pp. 46-61, 2014. [19] n. singh and s. b. singh. “hybrid algorithm of particle swarm optimization and grey wolf optimizer for improving convergence performance”. journal of applied mathematics, vol. 2017, pp. 15, 2017. tx_1~abs:at/tx_2:abs~at 18 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 1. introduction drilling fluid which is simply called mud is one of the most important systems of drilling the borehole to extract hydrocarbon [1], [2]. this kind of fluid is used for providing several functions including bottom hole cleaning, providing a balance between borehole pressures, and controlling the wellbore stability [3]. to achieve these applications, the drilling fluids must be prepared with practicable properties that have a significant role, such as rheological and filtration properties [4]. for this purpose, many chemical additives are used including clays, solvents, and polymers. polymers in different types are widely used with the water-based muds to improve the rheological properties of the mud, and the mud is sometime called polymer mud [5]-[7]. at present, researchers and companies are more focusing on the application of biopolymers which are definitely not poisonous and cheaper while having significantly less influence on formation damage [8], [9]. natural gum polymers are widely used in the industry and studied by many researchers at the modern drilling process due to its low cost and high performance. gums are comparison between the effect of local katira gum and xanthan gum on the rheological properties of water-based drilling fluids bayan qadir sofy hussein1*, khalid mahmood ismael sharbazheri1, nabil adiel tayeb ubaid2 1department of engineering, kurdistan institution for strategic study and scientific research, sulaimani polytechnic university, sulaimani, iraq, 2department of petroleum and energy engineering, faculty of engineering, sulaimani polytechnic university, sulaimani, iraq a b s t r a c t the rheological properties of drilling fluids have an important role in providing a stable wellbore and eliminating the borehole problems. several materials including polymers (xanthan gum) can be used to improve these properties. in this study, the effect of the local katira, as a new polymer, on the rheological properties of the drilling fluids prepared as the bentonite-water-based mud has been investigated in comparison with the conventional xanthan gum. experimental work was done to study of rheological properties of several gums such as, local katira gum, and xanthan gum bentonite drilling mud. different samples of drilling fluids are prepared adding the xanthan gum and local katira to the base drilling fluid at different concentrations using hamilton beach mixer. the prepared samples are passed through rheological property tests including the apparent viscosity, plastic viscosity, and yield point (yp) under different temperature conditions. the obtained results show that the viscosity is increased from 5 to 8.5 cp and yp is increased from 18.5 to 30.5 lb/100 ft2, with increasing the concentration of the xanthan gum from 0.1 to 0.4. however, the effect of the local katira in increasing the viscosity and yp is lower compared with the xanthan gum, which are ranged between 5–6 cp and 18.5–20.5 cp. index terms: drilling mud, polymer, rheological properties, xanthan gum, and katira gum corresponding author’s e-mail: bayan qadir sofy hussein, department of engineering, kurdistan institution for strategic study and scientific research, sulaimani polytechnic university, sulaimani, iraq. e-mail: bayan.sofy@kissr.edu.krd received: 18-04-2020 accepted: 14-07-2020 published: 19-07-2020 access this article online doi: 10.21928/uhdjst.v4n2y2020.pp18-27 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2020 hussein, et al. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l r e s e a r c h a r t i c l e uhd journal of science and technology bayan qadir sofy hussein, et al.: effect of polymer on drilling fluid properties uhd journal of science and technology | jul 2020 | vol 4 | issue 2 19 generally used to provide a good influence on the rheological characteristics of several water-based drilling fluids that prepared from the mixture of water and bentonite [10], [11]. several types of gum polymers that are used in drilling fluid application are suggested in the literature by researchers, such as tragacanth gum [8], xanthan gum [12], [13], tamarind tree gum [14], carboxymethyl cellulose [15], [16], and modified natural gum including diutan gum [17]. among the gum polymers, guar and xanthan gums are most commonly used to modify the rheological and filtration properties of the waterbased drilling fluids [11]. however, the identification and investigation of new types of the natural gums are necessary. in 2018, weikey et al. [9] investigated the effect of different gums including babul, dhawda, katira, and semal gums of the perfor mance of the water-based mud. their results showed that babul, semal, and katira improved the rheological properties of the mud, however, dhawda gum showed the highest performance. the comparison between the influences of the tamarind gum and polyanionic cellulose on the bentonite-water suspensions is investigated experimentally by mahto and sharma [14]; consequently, the tamarind gum produced more favorable rheological properties, optimal fluid loss, and lower formation damage at very low concentrations compared with the polyanionic cellulose. in addition, benyounes et al. [15] investigated the effect of carboxymethyl cellulose and xanthan gum on rheological properties of the water-based mud. their laboratory results revealed that the viscosity and yield point (yp) of bentonite suspension are increased and the flow index is decreased with increasing the xanthan concentration. however, carboxymethyl cellulose caused to increase viscosity along with decreasing the yp. on the other hand, benmounah et al. [16] confirmed that the presence of carboxymethyl cellulose in the bentonite suspension helps to remove the yield stress and increase the viscosity of the mixture, and the xanthan gum induces an increase in the yield stress and in viscosity of the drilling muds. dewangan and sinha [18] developed water-bentonite suspension from the babul tree gum and carboxymethyl cellulose and studied using marsh funnel. the results revealed that better rheological properties of mud can be achieved with using carboxymethyl cellulose compared with the babul tree gum. however, they claimed that the babul gum can be used to improve the rheological properties of mud to a favorite level. more recently in 2019, ali et al. [19] enabled to greatly improve the rheological and filtration properties of the water-based mud by combining the xanthan gum with a nanocomposite of sio 2 /zno. in their work, the viscosity and gel strength are increased from 9 cp and 20 lb/100ft2 to 22 cp and 49 lb/100ft2, respectively. while, the fluid loss and filter cake are reduced by 54% and 92.5%, respectively. the ultimate goal of this work is to investigate the effect of two polymers (xanthan and natural katira gum) on the rheological behaviors of bentonite-water-based drilling fluid. for this purpose, different samples of drilling fluids are prepared at different concentrations of polymer gums. the rheological viscometer is used to study the viscosity and yp of the prepared drilling fluids under the effect of types of gum, concentration of gum, and test temperature. 2. materials and methodology 2.1. materials the drilling mud mixture utilized in this research are freshwater, bentonite, natural katira gum, and xanthan gum. the freshwater is used as the base fluid and for fluid conditioning, the bentonite along with other additives utilized in the drilling fluid formulations. the purpose of adding biopolymers, natural katira gum, and xanthan gum is used to increase viscosity and fluid loss control. the bentonite which is composed mostly of montmorillonite obtained from one of the kurdistan’s drilling companies. the additives utilized are natural katira gum that is purchased from the local market and xanthan gum is supplied from petroleum engineering department of sulaimani polytechnic university. 2.2. xanthan gum xanthan gum is the polysaccharide anionic item within a commercial level gained from the aerobic agitation of xanthomonas campestris bacteria. xanthan gum has a strong stability to the different conditions of heat, salinity, and ph [20]. one of the most attractive characteristics of xanthan is its ability to increase the viscosity for the drilling fluid with very low concentration [21]. xanthan gum has a simple chemical structure as shown in fig. 1 that comes with a typical molecular weight of 3×105-7.5×106 g/mole [22]. 2.3. local katira gum katira gum is mainly exist in rocky locations of kurdistan mountain. high-quality gum was harvested and dried during the spring season from april to june. it has mostly acetylated complex polysaccharide and includes about 80% acetyl groups as well as about 37% uronic acid remains with acidic quantity varying from 17.4 to 22.7. according to weikey et al. [9], the molecular weight of katira is about 9.5×106 g/mol. at low concentrations, katira are capable to create a high gluey gum with a good swelling behavior due to the existence of the acetyl groups [9]. bayan qadir sofy hussein, et al.: effect of polymer on drilling fluid properties 20 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 katira gum is the sap of the thorny plant which mainly presents in middle eastern and west asian countries. this particular gum is odor free, flavorless, and water-soluble mixing extracted from their drain for the plant. the gum through the plant will be gained obviously through the root and through cuts produced in their stem then dehydrated to be able to creates small pieces (fig. 2). 2.4. preparation of drilling fluids the drilling fluids are prepared using 3-speed type hmd200 hamilton beach mixer. the base drilling fluid is prepared from mixing 22.5 g bentonite within 350 ml distilled water within the mixing cup. afterward, to identify the effect of the xanthan and katira polymer gums on the rheological properties of the drilling mud, both polymers are separately added into the prepared base drilling fluids at different concentrations. for preparing the xanthanbased drilling mud, four samples of muds are prepared at different concentrations of 0.1, 0.2, 0.3, and 0.4 g, and four katira drilling fluids are prepared at 20, 40, 60, and 80 mg concentrations (table 1). 2.5. rheological measurements for the prepared drilling fluids shown in table 1, the six-speed rotary viscometer model znn-d6 is used to measure the rheological properties including apparent viscosity, yp, plastic viscosity (pv), and yp/pv proportion dependent on the shear table 1: formulation of the prepared drilling fluids mud sample polymer gum concentration additives, g s0 ------bentonite, 22.5 xs1 xanthan 0.1 g xs2 0.2 g xs3 0.3 g xs4 0.4 g ks1 katira 20 mg ks2 40 mg ks3 60 mg ks4 80 mg fig. 1. xanthan gum chemical structure [13]. fig. 2. the natural katira gum on the tree, after harvesting and as a powder. bayan qadir sofy hussein, et al.: effect of polymer on drilling fluid properties uhd journal of science and technology | jul 2020 | vol 4 | issue 2 21 stress and shear rate. these measurements are performed at different temperature conditions of 25, 45, and 75°c. in general, the following procedural steps in accordance with [23] application programming interface specification 13a are applied: • the mud mixture prepared using 22.5±0.01 g of bentonite added into 350 cm3 of fresh water while stirring in the mixer putting the gums added at various concentrations. • immediately after mixing 5±0.5 min, the container taken from mixer and cleaned using spatula to remove bentonite on container walls, making sure bentonite adhering into the spatula is actually included in the mixture. • replacing the container in the blender following an additional 5 min as the edges scraped to free any clay adhering to container and blending again for 10 min, complete mixing duration will equate to 20±1 min. • put the mixture into viscometer cup supplied with the direct-indicating viscometer. the dial readings at 600 r/min as well as 300 r/min rotor speed settings of the viscometer will be recorded whenever a constant value for every r/min is achieved. the test temperature was 25±1°c. • the pv, yp, and yp/pv ratio are generally calculated using the following equations: pv=r600-r300 (1) yp=r300-pv (2) c � yp pv = (3) av r = 600 2 (4) where: pv: plastic viscosity in cp, centipoises. yp: yield point in ib/100 ft2. c: yp/pv ratio. r600: the viscometer dial reading at 600 rpm/min. r300: the viscometer dial reading at 300 rpm/min. 3. results and discussion 3.1. effect of xanthan gum on rheological properties by adding different concentrations of xanthan gum into the base drilling fluid, four samples (xs1, xs2, xs3, and xs4) are prepared. fig. 3 shows the values of the reordered viscometer readings at different speeds starting from the low shear stresses 3 and 6 rpms to moderate 100 and 200 rpms and high shear stresses of 300 and 600 rpms. with increase the shear stress and the rotation speed per minute, the recorded value of shear rate is increasing. the same effect can be seen with increasing the concentration of the xanthan gum. the measured results of the rheological properties including the apparent viscosity, pv, yp, and pv to yp ratio under the influence of the xanthan gum are shown in table 2. the measurements are carried out under different temperature conditions of 25, 45, and 75°c. fig. 4 illustrates the values of the apparent viscosity measured for the prepared drilling fluids containing the xanthan gum at different concentrations of 0.1, 0.2, 0.3, and 0.4 g. as can be seen, the apparent viscosity of the base mud is increased from 14.25 to 21.25 cp with adding 0.1 g of the xanthan gum and continued to increase to 28.25 cp with increasing the concentration of xanthan to 0.4 g. while, the apparent viscosity with all concentrations of xanthan gum is decreased with increasing the temperature of the experiment from 25 to 45 and 75°c. in general, the apparent viscosity of xs1, xs2, and xs3 samples is reduced by about 9% with increasing temperature from 25 to 75°c. the pv behavior of the mud samples developed from different concentrations of xanthan gum is shown in fig. 5. as can be seen, the pv of the base fluid is increased by 40% from 5 to 8.5 cp by adding 0.4 g of xanthan gum under 25°c. however, the impact of the xanthan in improving the pv is lower when the temperature of the experiment is increased. this is true with benyounes et al. [15], where they demonstrate that the rheological behavior of the polymer solution predominates compared to their clay suspension system. in this situation, the macroscopic behavior for the bentonite-xanthan mixtures is influenced by xanthan. 0 10 20 30 40 50 60 0 100 200 300 400 500 600 vi sc om et er r ea di ng ( cp ) dial reading (rpm) s0 xs1 xs2 xs3 xs4 fig. 3. shear rate versus shear stress at various concentrations of xanthan gum and room temperature. bayan qadir sofy hussein, et al.: effect of polymer on drilling fluid properties 22 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 table 2: measured rheological properties of drilling fluids under the influences of xanthan gum at 25, 45, and 75°c mud sample av (cp) pv (cp) yp (lb/100 ft2) yp/pv 25°c 45°c 75°c 25°c 45°c 75°c 25°c 45°c 75°c 25°c 45°c 75°c s0 14.2 14.2 14.2 5.0 5.0 5.0 18.5 18.5 18.5 3.7 3.7 3.7 xs1 21.2 20.5 18.5 6.5 5.0 4.0 29.5 31.0 29.0 4.5 6.2 7.3 xs2 25.5 24.5 23.0 7.0 5.5 5.0 37.0 38.0 36.0 5.3 6.9 7.2 xs3 27.0 26.0 24.5 8.0 6.5 6.0 38.0 39.0 37.0 4.8 6.0 6.2 xs4 28.2 26.5 25.7 8.5 7.0 6.5 39.5 39.0 38.5 4.6 5.6 5.9 14 .2 5 21 .2 5 25 .5 27 2 8. 25 14 .2 5 20 .5 24 .5 26 2 6. 5 14 .2 5 18 .5 23 24 .5 25 .7 5 s0 xs1 xs2 xs3 xs4 mud samples 25 c 45 c 75 c fig. 4. apparent viscosity measured at different concentrations of xanthan gum at 25, 45, and 75°c. 5 6. 5 7 8 8 .5 5 5 5. 5 6. 5 7 5 4 5 6 6. 5 s0 xs1 xs2 xs3 xs4 mud samples 25 c 45 c 75 c fig. 5. plastic viscosity measured at different concentrations of xanthan gum at 25, 45, and 75°c. bayan qadir sofy hussein, et al.: effect of polymer on drilling fluid properties uhd journal of science and technology | jul 2020 | vol 4 | issue 2 23 in addition, the values of the measured yps for four prepared mud samples of xanthan gum are shown in fig. 6. the used xanthan gum left the highest impact on the yp compared with other properties of the drilling fluids which is about 53%. the value of the yp is increased from 18.5 to 29.5 lb/100 ft2 with adding 0.1 g xanthan, continued to increase to 39.5 lb/100 ft2. furthermore, the ratio of the yp to pv calculated from the measured properties of the drilling fluid samples under the influence of the xanthan gum and different temperature is shown in fig. 7. as is obvious, the ratio between two considered parameters is varied dependents on the concentration of the xanthan gum and temperature. the high ratios between two parameters with xs1 and xs2 samples are obtained, especially at high temperatures. 3.2. effect of natural katira gum on rheological properties to identify the impact of the katira gum on the rheological properties of the drilling mud, the prepared katira in powder is added into the base drilling mud at different concentrations. for the concentrations of 20 mg, 40 mg, 60 mg, and 80 mg, four samples of katira gum (ks1, ks2, ks3, and ks4) are prepared. fig. 8 shows the values of the reordered viscometer readings at different speeds starting from the low shear stresses 3 and 6 rpms to moderate 100 and 200 rpms and high shear stresses of 300 and 600 rpms. with increase the shear stress and the rotation speed per minute, the recorded value of shear rate is increasing. the same effect can be seen with increasing the 18 .5 29 .5 37 3 8 39 .5 18 .5 31 38 3 9 39 18 .5 29 36 3 7 38 .5 s0 xs1 xs2 xs3 xs4 mud samples 25 c 45 c 75 c fig. 6. yield point measured at different concentrations of xanthan gum at 25, 45, and 75°c. 3. 7 4. 5 5. 3 4. 8 4. 6 3. 7 6. 2 6. 9 6. 0 5. 6 3. 7 7. 3 7. 2 6. 2 5. 9 s0 xs1 xs2 xs3 xs4 mud samples 25 c 45 c 75 c fig. 7. ratio of the yield point to plastic viscosity at different concentrations of xanthan gum at 25, 45, and 75°c. bayan qadir sofy hussein, et al.: effect of polymer on drilling fluid properties 24 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 table 3: measured rheological properties of drilling fluids under the influences of katira gum at 25, 45, and 75°c mud sample av (cp) pv (cp) yp (lb/100 ft2) yp/pv 25°c 45°c 75°c 25°c 45°c 75°c 25°c 45°c 75°c 25°c 45°c 75°c s0 14.2 14.2 14.2 5.0 5.0 5.0 18.5 18.5 18.5 3.7 3.7 3.7 ks1 17.0 16.0 15.5 6.5 3.0 2.0 21.0 26.0 27.0 3.2 8.7 13.5 ks2 15.7 15.5 14.7 5.5 4.5 3.5 20.5 22.0 22.5 3.7 4.9 6.4 ks3 15.5 15.0 16.0 5.5 2.0 2.0 20.0 26.0 28.0 3.6 13.0 14.0 ks4 16.2 15.5 15.5 6.0 1.0 1.5 20.5 29.0 28.0 3.4 29.0 18.7 concentration of the katira gum. the measured results of the rheological properties including the apparent viscosity, pv, yp, and pv to yp ratio under the influence of the katira gum are shown in table 3. the measurements are carried out under different temperature conditions of 25, 45, and 75°c. fig. 9 illustrates the values of the apparent viscosity measured for the prepared drilling fluids containing the katira gum at different concentrations of 20 mg, 40 mg, 60 mg, and 80 mg. as can be seen, the apparent viscosity of the base mud is increased from 14.25 to 17 cp with adding 20 mg of the katira gum and continued to increase to 17.75 cp with increasing the concentration of katira to 40 mg. while, the apparent viscosity with all concentrations of xanthan gum is decreased with increasing the temperature of the experiment from 25 to 45 and 75°c. in general, the impact of the katira gum on the apparent viscosity under all concentrations and temperature conditions is low. only small variation in the value of the apparent viscosity is noticed by changing the experiment conditions. the pv behavior of the mud samples developed from different concentrations of katira gum is shown in fig. 10. as can be seen, the pv of the base fluid is increased by 23% from 5 to 6.5 cp by adding 20 mg katira gum under 25°c. however, the impact of the katira in improving 10 15 20 25 30 35 0 100 200 300 400 500 600 v is co m et er r ea di ng (c p ) dial reading (rpm) s0 xs1 xs2 xs3 xs4 fig. 8. shear rate versus shear stress at different concentrations of natural katira gum at 25°c. 14 .2 5 17 15 .7 5 15 .5 16 .2 5 14 .2 5 16 15 .5 15 15 .5 14 .2 5 15 .5 14 .7 5 16 15 .5 s0 mud samples 25 c 45 c 75 c ks1 ks2 ks3 ks4 fig. 9. apparent viscosity measured at different concentrations of katira gum at 25, 45, and 75°c. bayan qadir sofy hussein, et al.: effect of polymer on drilling fluid properties uhd journal of science and technology | jul 2020 | vol 4 | issue 2 25 the pv is lower when the temperature of the experiment and gum concentration are increased. this is true with benyounes et al. [15], where they demonstrate that the rheological behavior of the polymer solution predominates compared to their clay suspension system. in this situation, the macroscopic behavior for the bentonite-katira mixtures is influenced by the gum. in addition, the values of the measured yps for four prepared mud samples of katira gum are shown in fig. 11. the used katira gum left the highest impact on the yp compared with other properties of the drilling fluids which is about 36%. the value of the yp is increased from 18.5 to 27 lb/100 ft2 with adding 20 mg katira, continued to increase to 29 lb/100 ft2. furthermore, the ratio of the yp to pv calculated from the measured properties of the drilling fluid samples under the influence of the katira gum and different temperature is shown in fig. 12. as is obvious, the ratio between two considered parameters is varied dependents on the concentration of the katira gum and temperature. 5 6. 5 5. 5 5. 5 6 5 3 4. 5 2 1 5 2 3. 5 2 1. 5 s0 mud samples 25 c 45 c 75 c ks1 ks2 ks3 ks4 fig. 10. plastic viscosity measured at different concentrations of katira gum at 25, 45, and 75°c. 18 .5 2 1 20 .5 20 2 0. 5 18 .5 26 22 26 29 18 .5 27 22 .5 28 28 s0 mud samples 25 c 45 c 75 c ks1 ks2 ks3 ks4 fig. 11. yield point measured at different concentrations of katira gum at 25, 45, and 75°c. bayan qadir sofy hussein, et al.: effect of polymer on drilling fluid properties 26 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 the high ratio between two parameters with ks4 samples are obtained, especially at high temperatures. 4. conclusion in this study, the rheological properties of the drilling mud under the influences of two polymer gums including xanthan and local katira gums are investigated. in general, both types of gums are used at low concentrations and they left an important impact on the rheological behaviors of the drilling mud. however, xanthan gum presented a better performance in improving all studied rheological properties of the mud; apparent viscosity is increased from 14.25 to 28.35 cp, pv increased from 5 to 8.5 cp, and yp (18.5 to 39.5 lb/100 ft2). while, the effect of the katira gum is lower on apparent viscosity, pv, and yp under all different concentrations and temperatures. references [1] k. l. goyal. “a review of: drilling and drilling fluids, by g. v. chilingarian and p. vorabutr; published in 1981 by elsevier scientific publishing co., p.o. box 211, 1000 ae amsterdam, the netherlands; distributed by el-sevier/north-holland, inc., 52 vanderbilt avenue, new york, n. y. 10017; 767 pp., photos, illustrations, appendices, glossary; $136.50, u.s.: 1228.50, rs”. energy sources, vol. 7, no. 2, pp. 178-179, 1983. [2] i. imuentinyan and e. s. adewole. “feasibility study of the use of local clay as mud material in oil well drilling in nigeria”. in: africa’s energy corridor oppor. oil gas value maximization through integration and global approach, victoria island, lagos, vol. 2, pp. 1476-1491, 2014. [3] a. w. a. al-ajeelt and s. n. mahdi. “sodium activation of iraqi high grade montmorillonite clay stone by dry method”. iraqi bulletin of geology and mining, vol. 9, no. 1, pp. 65-73, 2013. [4] e. s. al-homadhi. “improving local bentonite performance for drilling fluids applications”. journal of king saud university engineering sciences, vol. 21, no. 1, pp. 45-52, 2007. [5] k. a. galindo, w. zha, h. zhou and j. p. deville. “clay-free high performance water-based drilling fluid for extreme high temperature wells”. spe/iadc drilling conference and exhibition, 17-19 march, london, england, uk, pp. 179-188, 2015. [6] s. d. strickland. “polymer drilling fluids”. journal of petroleum technology, vol. 46, no. 8, pp. 691-714, 1994. [7] m. a. tehrani, a. popplestone, a. guarneri and s. carminati. “water-based drilling fluid for hp/ht applications”. international symposium on oilfield chemistry, 28 february-2 march, houston, texas, usa, pp. 83-92, 2007. [8] v. mahto and v. p. sharma. “tragacanth gum: an effective oil well drilling fluid additive”. energy sources, vol. 27, no. 3, pp. 299-308, 2005. [9] y. weikey, s. l. sinha, and s. k. dewangan. “effect of different gums on rheological properties of slurry”. iop conference series materials science and engineering, vol. 310, no. 1, p. 012068, 2018. [10] c. vipulanandan and a. s. mohammed. “hyperbolic rheological model with shear stress limit for acrylamide polymer modified bentonite drilling muds”. journal of petroleum science and engineering., vol. 122, pp. 38-47, 2014. [11] e. c. m. vermolen, m. j. t. van haasterecht, s. k. masalmeh, m. j. faber, d. m. boersma and m. gruenenfelder. “pushing the envelope for polymer flooding towards high-temperature and highsalinity reservoirs with polyacrylamide based ter-polymers”. spe middle east oil and gas show and conference, 25-28 september, manama, bahrain., vol. 2, pp. 1001-1009, 2011. [12] f. garcía-ochoa, v. e. santos, j. a. casas and e. gómez. “xanthan gum: production, recovery, and properties”. biotechnology advances, vol. 18, no. 7, pp. 549–579, 2000. [13] g. f. sancet, m. goldman, j. m. buciak, o. varela, n. d’accorso, m. fascio, v. manzano, m. luong. “molecular 3. 7 3. 2 3. 7 3. 6 3. 4 3. 7 8. 7 4. 9 13 .0 29 .0 3. 7 13 .5 6. 4 14 .0 18 .7 s0 mud samples 25 c 45 c 75 c ks1 ks2 ks3 ks4 fig. 12. ratio of the yield point to plastic viscosity at different concentrations of katira gum at 25, 45, and 75°c. bayan qadir sofy hussein, et al.: effect of polymer on drilling fluid properties uhd journal of science and technology | jul 2020 | vol 4 | issue 2 27 structure characterization and interaction of a polymer blend of xanthan gum-polyacrylamide to improve mobility-control on a mature polymer flood. spe eor conference at oil and gas west asia, 26-28 march, muscat, oman, 2018. [14] v. mahto and v. p. sharma. “rheological study of a water based oil well drilling fluid”. journal of petroleum science and engineering, vol. 45, no. 1-2, pp. 123-128, 2004. [15] k. benyounes, a. mellak and a. benchabane. “the effect of carboxymethylcellulose and xanthan on the rheology of bentonite suspensions”. energy sources, part a: recovery, utilization, and environmental effects, vol. 32, no. 17, pp. 1634-1643, 2010. [16] a. benmounah, k. benyounes, k. chalah and d. eddine. “effect of xanthan gum and sodium carboxymethylcellulose on the rheological properties and zeta potential of bentonite suspensions. 23rd congrès français de mécanique, 2017. [17] e. u. akpan, g. c. enyi and g. g. nasr. “enhancing the performance of xanthan gum in water-based mud systems using an environmentally friendly biopolymer”. journal of petroleum exploration and production technology, vol. 10, pp. 1933-1948, 2020. [18] s. k. dewangan and s. l. sinha. “effect of additives on the rheological properties of drilling fluid suspension formulated by bentonite with water”. international journal of fluid mechanics research, vol. 44, no. 3, pp. 195-214, 2017. [19] j. a. ali, k. kolo, s. m. sajadi, k. h. hamad, r. salman, m. wanli and s. m. hama. “modification of rheological and filtration characteristics of water-based mud for drilling oil and gas wells using green sio2@zno@xanthan nanocomposite”. iet nano biotechnology, vol. 13, no.7, pp. 748-755, 2019. [20] w. xie and j. lecourtier. “xanthan behaviour in water-based drilling fluids”. polym. degrad. stab., vol. 38, no. 2, pp. 155-164, 1992. [21] z. rončević et al., ‘effect of carbon sources on xanthan production by xanthomonas spp. isolated from pepper leaves”. food and feed research, vol. 46, no. 1, pp. 11-21, 2019. [22] p. f. d. cano-barrita and f. m. león-martínez. “biopolymers with viscosity-enhancing properties for concrete”. in: biopolymers and biotech admixtures for eco-efficient construction materials. elsevier. amsterdam, netherlands, 2016. [23] american petroleum institute. “13a specification for drilling fluid materials”. 15th ed. american petroleum institute, washington, dc, united states, 1993. . uhd journal of science and technology | jan 2020 | vol 4 | issue 1 9 1. introduction the image processing techniques have been developed in the various of areas such as pattern recognition, biometrics, image inpainting, medical image processing, image compression, information hiding [1], and multimedia security [2]. medical image processing is a collection of techniques that help the clinician in the diagnosis of different diseases from medical images such as x-ray, magnetic resonance imaging, computed tomography, and ultrasound and microscopic images. thus, various types of cancer can be detected based on medical image processing from medical images such as breast cancer, brain tumor, lung cancer, skin cancer, and blood cancer (leukemia). the advantage of creating a system based on medical image processing techniques is extracting the targeted diseases in higher accuracy, with reducing time consumption as well as decreasing cost, otherwise, the manual processing is taken a lot of time and it also needs a professional staff for detecting of diseases [3]. in general, processing medical images include four main steps which are: pre-processing, segmentation, feature extraction, and classification. this paper is mainly focused on the segmentation step in processing microscopic blood images which help the clinician in identifying various diseases in human’s blood such as blood cancer (leukemia), and anemia. segmentation of white blood cells (wbcs) is the most important step in identifying leukemia. leukemia is a type of cancer that affects blood, lymphocyte system, and bone marrow. thus, the correct segmentation of wbcs has an impact on further steps, such as feature extraction and classification, to access this article online doi: 10.21928/uhdjst.v4n1y2020.pp9-17 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2020 mohammed and abdulla. this is an open access article distributed under the creative commons attribution noncommercial no derivatives license 4.0 (cc by-nc-nd 4.0) thresholding-based white blood cells segmentation from microscopic blood images zhana fidakar mohammed1, alan anwer abdulla2,3 1department of information technology, technical college of informatics, sulaimani polytechnic university, sulaimani, iraq, 2department of information technology, college of commerce, university of sulaimani, sulaimani, iraq, 3department of information technology, university college of goizha, sulaimani, iraq o r i g i n a l a rt i c l e uhd journal of science and technology a b s t r a c t digital image processing has a significant role in different research areas, including medical image processing, object detection, biometrics, information hiding, and image compression. image segmentation, which is one of the most important steps in processing medical image, makes the objects inside images more meaningful. for example, from microscopic images, blood cancer can be identified which is known as leukemia; for this purpose at first, the white blood cells (wbcs) need to be segmented. this paper focuses on developing a segmentation technique for segmenting wbcs from microscopic blood images based on thresholding segmentation technique and it compares with the most commonly used segmentation technique which is known as color-k-means clustering. the comparison is done based on three well-known measurements, used for evaluating segmentation techniques which are probability random index, variance of information, and global consistency error. experimental results demonstrate that the proposed thresholding-based segmentation technique provides better results compared to color-k-means clustering technique for segmenting wbcs as well as the time consumption of the proposed technique is less than the color-k-means which are 70.8144 ms and 204.7188 ms, respectively. index terms: medical image processing, segmentation techniques, thresholding, white blood cells corresponding author’s e-mail: alan anwer abdulla, department of information technology, college of commerce, university of sulaimani and university college of goizha, sulaimani, iraq. e-mail: alan.abdulla@univsul.edu.iq received: 12-01-2020 accepted: 09-02-2020 published: 13-02-2020 mohammed and abdulla: white blood cells segmentation 10 uhd journal of science and technology | jan 2020 | vol 4 | issue 1 obtain results more accurately [3]-[5]. in general, the main components of blood are four types as follows: • red blood cells (rbcs): are responsible for delivering oxygen from the lungs to all the parts of the human’s body [6] • wbcs: are responsible for body’s immune system. the wbcs are larger in size and fewer in number compared to rbcs [6] • platelets: are responsible for blood clotting [6] • plasma: is responsible for carrying nutrients that are required by the cells [3]. it makes up about half of the blood volume. most blood cells are produced in the bone marrow, and there are cells known as stem cells (lymphoid and myeloid) that are responsible for producing different blood cells [7] in general, there are different types of wbcs which are basophil, eosinophil, neutrophil, lymphocyte, and monocyte, as illustrated in fig. 1. in this paper, a segmentation technique based on thresholding is proposed to segment wbcs from the microscopic blood images using public and well-known dataset acute lymphoblastic leukemia-image database (all-idb). the proposed technique includes the following steps: pre-processing step to enhance the image quality, segmentation step to separate wbcs from other blood components, and image cleaning step to remove the unwanted objects inside the segmented image. consequently, the proposed segmentation technique is compared with other technique, which is known as color-k-means clustering, in terms of time consumption as well as performance of the segmentation technique based on probability random index (pri), variance of information (voi), and global consistency error (gce) which are measurements that using for evaluating a segmentation techniques. 1.1. problem statement the manual processing of medical images is time consuming and it requires a professional staff which mainly depends on personal skills, and sometimes it may produce inaccurate results. 1.2. objective of the research the main objective of the research is to segment wbcs from microscopic blood images accurately and quickly because accurate segmentation of wbcs makes other processes (such as feature extraction) easer consequently producing an accurate result of classification. thus, the overall process could help the clinician in identifying different diseases accurately in a faster way, which further helps them to provide treatment for the patients sooner. the rest of the paper is organized as follows: in section 2, the literature review is presented. section 3 presents the proposed approach. the experimental results are illustrated in section 4. finally, our conclusions are given in section 5. 2. literature review nowadays, processing medical images have a crucial role for early identification of diseases. thus, segmenting wbcs from microscopic blood images are the most important step in identifying leukemia. this section focuses on reviewing the most important existing works on segmenting wbcs. in 2009, sadeghian et al. [8] proposed a segmentation technique to segment wbcs as well as their nuclei and a segmentation technique to separate the cytoplasm of the cell. the red, green, and blue (rgb) image is converted into gray image; then canny edge detection is applied followed by gradient vector flow to connect the boundary of the nucleus. consequently, the hole filling technique is applied to get the nucleus. furthermore, zack algorithm is applied into the gray image to get the binary image to extract the fig. 1. bone marrow and blood components. mohammed and abdulla: white blood cells segmentation uhd journal of science and technology | jan 2020 | vol 4 | issue 1 11 cytoplasm of the cell by subtracting the binary image from the gray image. finally, their proposed work is obtained 92% accuracy for nucleus segmentation and 70% accuracy for cytoplasm segmentation. in 2015, marzukia et al. [9] proposed a system to segment the nuclei of wbcs based on the active contour technique. the rgb image is converted to a gray image and then the active contour is applied to the resulted image to get the segmented nucleus. continuously, the obtained image is converted to a binary image to find the roundness value to determine the grouped and individual wbcs. as they claimed, their proposed system can accurately extract the boundary of wbc nuclei. continuously, in 2015 madhloom et al. [10] proposed an approach to segment wbcs and their nucleus. first, it segments the wbcs based on the color transformation as well as mathematical morphology. moreover, the marker-control watershed technique is applied to separate overlapped cells. furthermore, the seeded region growing technique is used to segment the nucleus of the cells. finally, the performance of their approach is evaluated using relative ultimate measurement accuracy and misclassification error to measure the accuracy and it achieves 96% for wbcs segmentation and 94% for nucleus segmentation. in 2016, sobhy et al. [11] proposed two segmentation techniques for segmenting wbcs. first, color correction is applied to extract the mean intensity from the histogram of each rgb channel of the original image. moreover, the image is converted to hue saturation intensity color space, and then the two tested segmentation techniques are applied to the s component of the image. the first technique was otsu’s thresholding and the second technique was marker-control watershed segmentation. furthermore, the exoskeleton algorithm is used to separate the adjusted wbcs. finally, they compared their work with another study which is manually counted the wbcs based on 30 images. as they claimed, their proposed is achieved an accuracy of 93.3% for otsu’s segmentation and 99.3% for marker-control watershed segmentation. in 2017, gowda and kumar [12] proposed a system to segment wbcs based on k-means clustering and gramschmidt orthogonalization. the proposed system is started by pre-processing step which includes: (1) converting image from rgb to gray image, (2) median filter is implemented to remove noise, and (3) image normalization is applied for contrast stretching. consequently, k-means clustering segmentation technique is applied to segment wbc and its subtypes such as: basophil, eosinophil, neutrophil, lymphocyte, and monocyte. furthermore, gram-schmidt orthogonalization scheme is applied to segment the nucleus from the cell. finally, in 2017, nain et al. [13] proposed a system to differentiate individual and overlapped wbcs based on the watershed segmentation scheme to segment the cells. the edge map extraction technique, which is a circle fitted on each cell, is used to identify both individual and overlapped cells. to evaluate the performance of the proposed work, two evaluation measurements which are detection rate (dr) and false alarm rate (far), are used. as they claimed, the proposed system is achieved 97.10% and 7.80%, respectively, for dr and far, the lower value of far and higher value of dr means better segmentation of wbcs. 3. proposed approach a microscopic image of the blood sample contains three components, as discussed before, which are rbcs, wbcs, and background (platelets and plasma). the proposed approach focuses on segmenting wbcs from other components (which are rbcs and background). the segmentation is done based on the thresholding technique. this section concentrates to explain the steps of the proposed work, as illustrated in fig. 2. 3.1. pre-processing the first step of the proposed work is pre-processing, which is an important step to provide a better quality of the input image and make the next step (segmentation step) easier. in this step, the three channels of the input image, which are rgb are separated, and then the median filter (to remove noise) and histogram equalization (to contrast enhancement) are applied on the only green channel of the input image to enhance the image quality, fig. 3. the median filter is a non-liner filter, which is used for removing noise. while, the histogram equalization is a technique that uses image’s histogram to improve low or high contrast of the image which makes better distribution of intensities of the image’s histogram [14]. 3.2. segmentation to segment the wbcs from the other components of the microscopic image, i.e., rbcs and background, the thresholding segmentation technique is applied to the image. the thresholding-based segmentation technique is the simplest and useful segmentation technique that segments an image based on the intensity of the pixels value [15]. this technique is mohammed and abdulla: white blood cells segmentation 12 uhd journal of science and technology | jan 2020 | vol 4 | issue 1 fig. 2. steps of the proposed approach. mohammed and abdulla: white blood cells segmentation uhd journal of science and technology | jan 2020 | vol 4 | issue 1 13 produce a binary image from a gray image (the value of 0, which represents background and it is black and the value of one which represents object inside the image and it is white color). this process has the advantage of reducing complex information and simplifies the classification and recognition processes [16]. the threshold value of this technique can be selected manually or automatically based on the information from the features of the image [17]. there are types of thresholding which are: 3.2.1. global thresholding (single thresholding) this technique use only one threshold (t) to segment the whole image into objects and background; this technique is more appropriate for those images that have bimodal histogram. the global thresholding can be defined as equation (s1) [16], [17]: suppose a sample image of f (x, y) has the following diagram (fig. 4): then, the binary image g (x, y) of the f (x, y) can be defined as following equation: g (x, y) = 1 if f (x, y) >t otherwise 0 if f (x, y) ≤t (1) where t is the threshold value. 3.2.2. local thresholding (multiple thresholding) in this thresholding technique, the image segments based on multiple threshold values and the image partitions into multiple region of interests (rois) and backgrounds [15]. moreover, it segments an image by partitioning an image into (n × n) sub-images and then selects a threshold tij for every sub-image [16], as illustrated in fig. 5. this technique is more appropriate for images that are contained disparate illuminations, this technique can be defined as follows [16]: g (x, y) = 0 if f (x, y) <t (x, y) otherwise 1 if f (x, y) ≥t (x, y) (2) where f (x, y) is an input image and g (x, y) is a binary image produces depending on multiple threshold value t (x, y). in our proposed approach, we segment the microscopic images into roi (which are wbcs) and background based on global threshold value with (t = 50) as clarified in the diagram of the work (fig. 2.). we apply this technique in the resulted image obtained after the pre-processing step is done (i.e., image d from fig. 3). moreover, we obtain a binary image that contains wbcs which are larger in size in the image and some remaining as unwanted objects (rbcs and platelets), as illustrated in fig. 6. fig. 4. histograms of a sample image. fig. 5. segmentation based on local thresholding. fig. 3. pre-processing: (a) input image, (b) green channel, (c) resulted in image after the median filter is applied, (d) resulted image after histogram equalization is applied. a b c d mohammed and abdulla: white blood cells segmentation 14 uhd journal of science and technology | jan 2020 | vol 4 | issue 1 3.3. image cleaning the resulted image obtained in the previous step, as illustrated in fig. 6, includes some unwanted objects such as rbcs or platelets. to overcome this drawback, the morphological opening operator is applied to the resulted image of the segmentation step. the opening operator is a morphological operation that uses to eliminating imperfections in the images that impact on texture and shape of images. thus, morphological operations have a crucial role in image processing especially image segmentation process because they treat with shape extraction within the image and they describe the structure of the image [18]. the most significant morphological operations are dilation and erosion. there are also two other operations which are opening and closing which work depending on the combination of dilation and erosion [18], [19]. the opening operator is a combination of both erosion and dilation. it first erodes an image depends on the structuring element and then dilated it by the same structural element. opening smoothens the boundary of an object and deletes small unwanted objects inside the image. it can be defined as the following equation [18], [19]: a o b = (a–b)+b (3) where a: is an image b: is a structuring element. moreover, some of the wbcs are located in the edge of the image; thus, the border cleaning scheme is used to remove those cells because they have a negative impact on the accuracy rate in further steps such as feature extraction and classification. thus, to make only complete wbcs remains in the image, the border cleaning is applied to the resulted image of the opening operator, as illustrated in fig. 7. the border cleaning is a morphological technique that used to remove an object that touches the edge of the image [11]. 3.4. wbcs cropping to crop each wbcs as an individual image, we used the bounding box technique. this technique can be defined as the smallest rectangle that soured an object of interest, and it extracts the minimum area of the box and it can be represented as follows [20]: area = major axis length×minor axis length (4) where the major axis represents the longest line that can be drawn between two points in the object, and the minor axis is the longest line between two points of the object that can be drawn perpendicularly to the major axis, as illustrated in fig. 8. [20]. we applied the bounding box technique on the original image (input image) based on the location of wbcs in segmented and cleaned images obtained in fig. 7, which further can be used for more processing such as feature extraction and classification. fig. 6. the resulted image after thresholding segmentation is applied. fig. 8. the major axis and the minor axis. fig. 7. image cleaning: (a) resulted image after opening operator is applied, (b) resulted image after border cleaning is applied. a b mohammed and abdulla: white blood cells segmentation uhd journal of science and technology | jan 2020 | vol 4 | issue 1 15 the result of the bounding box is represented in fig. 9. and fig. 10 illustrate cropped wbcs as individual images. 4. experimental results the main purpose of the proposed approach is to segment the wbcs from microscopic blood images. in this section, to evaluate the performance of the proposed approach, experiments are conducted on the dataset (explained in 4.1). 4.1. dataset the input images are taken from a public and commonly used dataset known as all-idb, which consists of two groups of images: • the first one is all-idb1 which was designed for testing the segmentation techniques, and it contains 108 microscopic blood images, including (49 abnormal images of leukemia patients and 59 normal images) [21] • the second one is called all-idb2 which was designed for testing the classifier techniques and it consists of 260 cropped wbc images (50% abnormal cells and 50% normal cells) which are taken from allidb1 [21]. as this study is mainly focused on the segmentation of wbcs, thus the first group of the dataset is used (i.e., allidb1). because as mentioned above, this group of the dataset is dedicated for testing segmentation techniques. 4.2. results as mentioned in section 3, in the thresholding segmentation, only the green channel of the input microscopic image was used for the segmentation process. because the green channel gives better results in terms of segmentation compared to the other two channels (red and blue), since it holds more contrast information regarding to wbc [22]. in the segmentation step, the thresholding technique used for segmenting wbcs with t = 50, in our experiments first different threshold values are tested on all three channels of normal and abnormal images and the experimental results demonstrate that the best result is given by the green channel with t = 50. as illustrated in fig. 11. the wbcs have higher contrast in the green channel of the image compared to two other channels; thus, it makes the segmentation process more accurate. fig. 9. applying bounding box technique on the original image based on the segmented image, as presented in fig. 5. fig. 10. example of cropped white blood cells. fig. 11. a blood microscopic image of leukemia patient: (a) original image, (b) red channel, (c) green channel, (d) blue channel. a b c d mohammed and abdulla: white blood cells segmentation 16 uhd journal of science and technology | jan 2020 | vol 4 | issue 1 to evaluate the performance of the proposed approach, we compare our proposed approach with the commonly used segmentation technique which is known as color-k-means clustering, which was used by many existing works such as mishra et al., kumar and vasuki, ferdosi et al., sarrafzadeh and dehnavi [23]-[26]. we also applied the color-k-means clustering on the same dataset (all-idb1) to evaluate both segmentation technique, the comparison was done in terms of time consumption as well as the performance of the segmentation techniques based on the pri, gce, and voi. the pri is a nonparametric technique used to measure the performance of the segmentation technique and it can be calculated as follows [22], [27]: pri s g c p 1 c 1 p test k i j i j i j i j i j i j , / , & , , , , ( ) = ∑ + −( ) −( ∀ < 1 2n ))  (5) moreover, the gce is a region-based segmentation consistency, which measures to quantify the consistency between image segmentation of differing granularities, and it can be calculated as follows [28]: gce s s min x s s x s s1 2 i i 1 2 i i 1 2, { ( , ), ( , )}( ) = ∑ ∑ 1 n (6) while the voi is calculated the distance between two segmented areas which provide information about the participation of pixels in different clusters and measures the knowledge lost or gained in two clusters it can be calculated as follows [28]: voi = h(x)=h(y)–2i(x,y) (7) the experimental results are as follows: 4.2.1. execution time the time consumption has been calculated for both segmentation techniques which implemented on the all-idb1, as illustrated in table 1. the systems were implemented using matlab on lenovo computer with 8 gb of ram and 64-bit operating system\windows 10. table 1 demonstrates that the proposed thresholding-based technique needs less time compared to the color-k-means technique for segmenting the wbcs from microscopic images which applies to 49 abnormal images and 59 normal images. 4.2.2. performance of the segmentation techniques three well-known segmentation measurements such as pri, gce, and voi are calculated for both proposed technique and color-k-means clustering and table 2 illustrates the performance of each technique. the lower value of pri, gce, and voi means the better/higher performance of the segmentation technique [27]. table 2 demonstrates that the proposed technique has a lower value of pri, gce, and voi. thus, it is better for segmenting wbcs from microscopic images. 5. conclusion the progression of using techniques of image processing in medical fields provides better accuracy for identifying different diseases from medical images. this paper focuses on the segmentation step, which is the most important step in medical image processing for segmenting wbcs from microscopic blood images, depending on the thresholdbased technique. experimental results prove that the proposed segmentation technique can extract the wbcs from images better than color-k-means clustering in terms of segmentation evaluation as well as time consuming. references [1] h. sellahewa, s.a. jassim and a.a. abdulla. stego quality enhancement by message size reduction and fibonacci bitplane mapping. united kingdom, london, 2014, pp. 151-166. [2] a.a. abdulla. exploiting similarities between secret and cover images for improved embedding efficiency and security in digital steganography, department of applied computing, university of buckingham, phd thesis, 2015. available from: http://www.bear. buckingham.ac.uk/149. [3] h. b. kekre, b. archana and h. r. galiyal. “segmentation of blast table 1: time consumptions. segmentation techniques execution time\ms proposed 70.8144 color-k-means clustering 204.7188 table 2: segmentation evaluation. measurements proposed technique color-k-means clustering technique pri 0.126347 0.515437 gce 0.04317 0.159923 voi 5.419551 5.164454 pri: probability random index, gce: global consistency error, voi: variance of information mohammed and abdulla: white blood cells segmentation uhd journal of science and technology | jan 2020 | vol 4 | issue 1 17 using vector quantization technique”. international journal of computer applications, vol. 72, pp. 20-23, 2013. [4] m. a. bennet, g. diana, u. pooja and n. ramya. “texture metric driven acute lymphoid leukemia classification using artificial neural networks”. international journal of recent technology and engineering, vol. 7, no. 6s3, pp. 152-156, 2019. [5] k. a. eldahshan, m. i. youssef, e. h. masameer and m. a. mustafa. “an efficient implementation of acute lymphoblastic leukemia images segmwntation on fpga”. advances in image and vedio prpcessing, vol. 3, no. 3, pp. 8-17, 2015. [6] v. venmathi, k. n. shobana, a. kumar and d. g. waran. “leukemia detection using image processing”. international journal for scientific research and development, vol. 5, no. 1, pp. 804808, 2017. [7] s. c. neoh, w. srisukkham, l. zhang, s. todryk, b. greystoke, c. p. lim, m. a. hossain and n. aslam. “an intelligent decision support system for leukaemia diagnosis using microscopic blood images”. scientific reports, vol. 5, p. 14938, 2015. [8] f. sadeghian, z. seman, a. r. ramli, b. h. a. kahar and m. saripan. “a framework for white blood cell segmentation in microscopic blood images using digital image processing”. biological procedures online, vol. 11, pp. 196-206, 2009. [9] n. i. c. marzukia, n. h. mahmoodb and m. a. a. razakb. “segmentation of white blood cell nucleus using active contour”. jurnal teknologi, vol. 74, pp. 115-118, 2015. [10] h. t. madhloom, s. a. kareem and h. ariffin. “computer-aided acute leukemia blast cells segmentation in peripheral blood images”. journal of vibroengineering, vol. 17, pp. 4517-4532, 2015. [11] n. m. sobhy, n. m. salem and m. el dosoky. “a comparative study of white blood cells segmentation using otsu threshold and watershed transformation”. journal of biomedical engineering and medical imaging, vol. 3, no. 3, pp. 15-24, 2016. [12] j. p. gowda and s. c. p. kumar. “segmentation of white blood cell using k-means and gram-schmidt orthogonalization”. indian journal of science and technology, vol. 10, pp. 1-6, 2017. [13] k. n. sukhia, m. m. riaz, a. ghafoor and n. iltaf. “overlapping white blood cells detection based on watershed transform and circle fitting”. radioengineering, vol. 24, pp. 1177-1181, 2017. [14] s. shafique and s. tehsin. “computer-aided diagnosis of acute lymphoblastic leukaemia”. computational and mathematical methodsin medicine, vol. 2018, p. 6125289, 2018. [15] s. yuheng and y. hao. “image segmentation algorithms overview”. arxiv preprint, vol. 2017, pp. 1-7, 2017. [16] k. bhargavi and s. jyothi. “a survey on threshold based segmentation technique in image processing”. international journal of innovative research and development, vol. 3, pp. 234239, 2014. [17] d. kaur and y. kaur. “various image segmentation techniques: a review”. international journal of computer science and mobile computing, vol. 3, no. 5, pp. 809-814, 2014. [18] s. ravi and a. m. khan. “morphological operations for image processing: understanding and its applications”. in: ncvscoms-13 conference proceedings, 2013. [19] s. singh and s. k. grewal. “role of mathematical morphology in digital image processing: a review”. international journal of scientific engineering and research, vol. 2, no. 4, 2014. [20] a. e. huque. “shape analysis and measurement for the hela cell classification of cultured cells in high throughput screening”. university of skövde, skövde, sweden, 2006. [21] r. d. labati, v. piuri and f. scotti. “all-idb: the acute lymphoblastic leukemia image database for image processing”. in: 18th ieee international conference on image processing, 2011. [22] s. kumar, s. mishra, p. asthana and pragya. “automated detection of acute leukemia using k-mean clustering algorithm”. in: advances in computer and computational sciences, proceedings of iccccs, pp. 655-671, 2018. [23] s. mishra, l. sharma, b. majhi and p. kumar sa. “microscopic image classification using dct for the detection of acute lymphoblastic leukemia (all)”. proceedings of international conference on computer vision and image processing, pp. 171180, 2017. [24] p. s. kumar and s. vasuki. “automated diagnosis of acute lymphocytic leukemia and acute myeloid leukemia using multi-sv”. journal of biomedical imaging and bioengineering, vol. 1, pp. 2024, 2017. [25] b. j. ferdosi, s. nowshin, f. a. sabera and habiba. “white blood cell detection and segmentation from fluorescent images with an improved algorithm using k-means clustering and morphological operators”. in: 4th international conference on electrical engineering and information and communication technology (iceeict), 2018. [26] o. sarrafzadeh and a. m. dehnavi. “nucleus and cytoplasm segmentation in microscopic images using k-means clustering and region growing”. advanced biomedical research, vol. 4, p. 174. 2015. [27] r. kumar and a. m. arthanariee. “performance evaluation and comparative analysis of proposed image segmentation algorithm”. indian journal of science and technology, vol. 7, pp. 39-47, 2014. [28] r. sardana. “comparitive analysis of image segmentation techniques”. international journal of advanced research in computer engineering and technology, vol. 2, no. 9, pp. 26152619, 2013. tx_1~abs:at/tx_2:abs~at 56 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 1. introduction there are two essential reasons why electron-nucleus scattering is such a successful apparatus used for studying nuclear structure. the primary one belongs to the reality that the main interaction occurring between the electron and the nucleus is well known [1]. the origin of the second reason that makes electron scattering is a valuable method in examining the properties of nuclear structure comes from its ability to identify the excited states, spins, and parities, through the calculations of the reduced matrix elements of nuclear transitions. basically, in the electron scattering with a relatively weak interaction, the interactions of the electron with charge and the nuclear current density occur where they described by the theory of quantum electrodynamics [2]. one of the great (standard) effective interactions for light nuclei is the cohen-kurath [3], for 1p-shell (1p 1/2, 1p 3/2 ) nuclei with core 2 he4. in addition, different macroscopic and microscopic theories have been used to analyze excitation states in be nucleus. the form factor calculations were done by utilizing the model space (ms) wave functions alone which were not sufficient for duplicating the experimental data of the electron-nucleus scattering [4]. therefore, the electron scattering coulomb form factors in the p-shell nucleus (be9) have been investigated by taking into account higher energy configurations outside the p-shell ms which are named core polarization effects [5]. many research studies have focused their efforts on the improvement and development of the electron scattering. starting with hofstadter who was the primary to utilize highenergy electron beams given by the stanford linear electron accelerator to discover electron scattering and an old work of sir nevill mott which was used electrons against point nuclei in his experiment as the relativistic scattering of dirac particles. then, he established a series formulation for the cross-section of the elastic scattering, also he allowed to estimating formula [6], [7]. elastic and inelastic electron-nucleus scattering form factors for be9 nucleus hawar muhamad dlshad, aziz hama-raheem fatah, adil mohammed hussain department of physics, college of science, university of sulaimani, sulaymaniyah, kurdistan region, iraq a b s t r a c t the computations of the elastic and inelastic coulomb form factors for the electron-nucleus scattering of beryllium nucleus be9 have performed with core polarization (cp) effects including the realistic michigan sum of three range yukawa (m3y) interaction and the other residual interaction which is modified surface delta interaction (msdi). in the calculations, root mean square charge density and charge radii include for the ground states. the perturbation theory is adopted to compute the cp using the harmonic oscillators potential to calculate single-particle radial wave functions. the comparison has been done between the theoretical calculations of coulomb form factors by msdi interaction, realistic m3y interaction, and the experimental results that measured by other workers, it noticed that the coulomb form factors for the (m3y) interaction gave a reasonable depiction of the measured data. index terms: beryllium nucleus, core polarization effect, electron scattering, form factor, harmonic oscillator. corresponding author’s e-mail: hawar muhamad dlshad, department of physics, college of science, university of sulaimani, sulaymaniyah, kurdistan region, iraq. e-mail: hawar.muhamaddlshad@gmail.com received: 05-04-2020 accepted: 13-08-2020 published: 17-08-2020 access this article online doi: 10.21928/uhdjst.v4n2y2020.pp56-62 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2020 dlshad, et al. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l r e s e a r c h a r t i c l e uhd journal of science and technology hawar muhamad dlshad, et al.: electron scattering form factors for be9 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 57 elastic and inelastic electron scattering for the light nuclei using born approximation had performed by uberall and ugincius [8]. in the last decades, the single-particle quadrupole transitions of coulomb electron-nucleus scattering form factors studied in the b10 which is the p-shell nucleus by majeed [9], whereas the studies included a microscopic theory in the core polarization (cp) effects for the excitation states up to 2ℏω by employing the modified surface delta interaction (msdi). the charge density distributions and charge radii of the nucleus were distinguished from the investigation of elastic electron scattering data [10]. however, sharrad et al. [11] have used the charge density distributions of the ground state for determining the coulomb form factors using the approximation rule which is the plane wave born approximation with the two-body short range correlation. consequently, radhi et al. [12] presented inelastic coulomb and electromagnetic form factors for f19 in each positive parity and negative parity states by applying the single-particle states shell model and hartree–fock method. at present, raheem et al. [13] have been calculated the elastic coulomb c 0 form factors for a few sd-shell nuclei using nucleon-nucleon effective interaction, which is two-body (michigan sum of three range yukawa [m3y]) as residual interactions with considering the cp matrix elements. this work is devoted to calculate the theoretical coulomb electron scattering form factors for be9 by considering the role of the ms besides the cp effects using msdi and the realistic interaction named m3y including root mean square charge density along with charge radii for the ground states. the harmonic oscillator (ho) wave function will be adopted as a single particle wave function. to do this, first needed to use shell model code (oxbash) to calculate the one-body density matrix (obdm) elements [14], [15]. finally, the theoretical calculations of coulomb form factors by msdi, m3y interactions are compared with the experimental results. 2. theory 2.1. coulomb form factor the coulomb electron scattering form factors of a given multipolarity (j) is a function of transfer momentum (q) and it can be described in term of reduced matrix elements (in spin state) of the transition operator [16]: ( ) ( ) ( )2 4 | | 2 1j f j ii f q j t q j z j  = + (1) where, ji and jf are, respectively, the initial and final total angular momentum, while z is the number of proton (atomic number), tj(q) is the multipole operator of electron scattering, and j t q jf j i| |( ) is the reduced many body matrix element. the best description of the experimental form factors requires to correct the form factor in equation (1) corresponding to the center of mass correction and the finite size correction of the nucleon [17]: ( ) ( ) ( ) ( )2 2 0.43 4 2 4 | | 2 1 q b a a j f j i i f q j t q j e z j  − = + (2) here, b is the ho size parameter that obtained from the experiment, a is the nuclear mass number, and the final term in the above equation is the correction coefficient. the reduced matrix element in equation (1) can be expressed in two terms, the first one is ms term and the other is cp term [18]. j t q j j t q j j t q jf j i f j i ms f j i cpz z z | | | | | |τ τ τδ( ) = ( ) + ( ) (3) the ms reduced matrix element in the spin and isospin spaces of the transition operator tj is performed as the sum of the product of the (obdm) elements which are in neutron-proton formalism obdm(β, α ,j, τ z , i, f) multiplied by the single-particle reduced matrix elements as follow [18]: j t q j obdm j i f tf j i ms z jz zτ β α τβ α τ β α( ) = ( )∑ � , , , , , , � | | (4) in addition, the cp reduced matrix element can be represented as: � , , , , , � � , f j i cp z j t q j j obdm j i f t z z δ β α τ β δ α τ β α τ ( ) = ( )∑ (5) where, α and β are, respectively, the initial and final singleparticle states for the ms when isospin included, the index hawar muhamad dlshad, et al.: electron scattering form factors for be9 58 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 τz is the third component of nucleons pauli isospin which used to identify the nucleons with τz = 1, −1 for protons and neutrons, respectively, and the obdm determined macroscopically for the elastic scattering by the initial and final nuclear wave functions, while it obtained from oxbash code for inelastic scattering. the single-particle matrix element is determined from: β ατ β τ α β β α α| | | |t j j n l j qr n lj j jz z= ( )υ , , (6) where, 〈nβ, lβ│jj (qr)│nα lα〉 is the radial part matrix element of the spherical bessel function jj (qr) which is calculated in [19, equation (23)] of our published article; and j jj zβ τ αυ represents the reduced matrix element of the spherical harmonics υ j z . the single-particle matrix element can represent according to the first-order perturbation theory as [20]: β δ α β α β α | | | | | | t v q e h t t q e h v cp a o b o λ λ λ = − + − � (7) the single-particle matrix element β δ ατ| |tj z in the above equation obtained from the particle hole excitation with the first-order perturbation including residual interaction (v) for the msdi and m3y interaction [20]. β α βα αα α α α α α β α α α | | | | | | v q e h o v o e e e e b o− = − + − −( )∑ λ γ γ λ , ,1 2 1 2 2 1 1 2 1 ++ + +( ) × +( ) +( )     + α δ δ β α α α 2 2 1 1 1 1 2 γ γ λ γbp ah �terms�with�exchaanged� �and� �by�an�overall�minus�sign�α α1 2 (8) here, ho is the unperturbed hamiltonian, ea, eb are the initial and final states of energy, the q operator projects the outside space of the ms, both the indices α1 and α2, are, respectively, run over particle and hole states, β α α α λ γ1 2       �is the six-j symbol and e is the single-particle energy. every matrix element in the equation (7) is obtained in iso-scalar (t = 0) and isovector (t = 1) formalism with λ = jt and γ=j’ t’. the single-particle energies are calculated by [21]: e n l l f r for j l l f r nlj nl = + −    + − +( ) ( ) = − ( 2 1 2 1 2 1 1 2 1 2  � � � � �, � � )) = +       nl for j l� � �, 1 2 (9) where f r a a a nl ( ) ≈ − = − − − −20 45 25 2 3 1 3 2 3� �, � � � �/ /� � . 2.2. ground-state form factor and the charge density it is clear that the electron scattering is one of the most powerful tools for analyzing the charge density distributions of the nucleus. since the charge density is a measurable quantity, subsequently, it is another way of calculating the form factor. moreover, the elastic form factor is occurring when j = 0 (zero spin) and is obtainable from the simple form of the fourier transform as [22]: f q z r j qr r dr0 0 0 0 24( ) = ( ) ( ) ∞ ∫ π ρ� � � � � (10) where, f 0 (q) is the ground-state form factor, r is the radius of the nucleus, and ρ 0 (r) is the charge density. the entirety of all protons point charge is the representation of operator of transition charge density ( )ˆ jm r  of a nucleus [23]. ( ) ( ) ( )2 1 ˆ õ ù z k jm jm k kk r r r r   = − =∑    (11) where, j is the multipolarity of the operator, m is the projection quantum number takes 2j + 1 values, −j ≤ m ≤ j, υjm (ωk) represents the spherical harmonic, and    r rk−( ) is a dirac delta function. the matrix element in the reduced form of the operator ( )ˆ jm r  is gotten when the transition happens from initial nuclear spin ji to the final nuclear spin jf and complying the inequality ji ≤ j ≤ jf , from equation (4) where ( )ˆ ,zj jt r ≡  then it is given by: ( ) ( )  , ( , , , , , ) ˆ ˆf j i z jj r j obdm j i f r         = ∑ (12) hawar muhamad dlshad, et al.: electron scattering form factors for be9 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 59 for the ground state (j = 0) as mentioned before. moreover, ji= jf, and the charge density j p r( ) define of the nucleus is getting from this matrix element [23]. ( ) ( ) ( )   , 1 4 ( 2 1) 1 ( , , , , , ) ( 2 ) ˆ ˆ 4 1 p j f j i i z j i r j r j j obdm j i f r j             = + = + ∑ (13) the single-particle matrix element in equation (4) can be re presented by the radial wave functions of ho  n l n lr rα α β β( ) ( ) and 〈lβ jβ║υj (ωr) ║lα jα〉 which is the spherical harmonic reduced matrix element. ( ) ( ) ( ) ( )| | ˆ | |j n l n l j rr r r j j        = υ ω    (14) after putting equation (14) in equation (13), the nuclear charge density becomes [24]: ρ π α β τ α β α α β β j p i z n l n l r j obdm j i f r r j ( ) = + ( ) ( ) ∑� ( ) ( , , , , , ) � , 1 4 2 1   ββ α| |υ ωj r j( ) (15) for the ground-state nucleus, (τz= 1, −1), it makes j = 0 and j j jj r j jβ α αδ πα β| |υ ω( ) = +( ), /2 1 4 . after utilizing the delta-kronecker in equation (15) and putting τz= 1 for protons, the equation is rewritten as: ρ π α α α α α α 0 2 1 4 2 1 0 1 2 1 p i n l r j obdm i f j r r ( ) = +( ) ( ) +( ) ( ) ∑ � , , , , , � (16) the index α≡nα lα jα used for all closed shells for the ground state. the ground-state charge radii and charge distribution considered as two great determinable quantities experimentally, meanwhile, they can be calculated theoretically. the mean square radius for the nucleus gets from the charge density integration in equation (16) [22]. r r r dr ch p2 0 2= ( )∫   � � � (17) under the effect of the point-proton folded charge density distribution, in equation (10), the charge density needs to be corrected by the folding factor [25]. ρ π fo r r ar r a e a fm     −( ) = =− −( ) ' � � � �,� . � � ' 1 0 6532 3 6 2 2 (18) now, for the normalized charge density with the target nucleus atomic number z, the root-mean-square in equation (17) gives [24]: r z r r r r drch p fo 2 0 21= ( ) −( )∫      � � �' (19) 3. results and discussion the beryllium nucleus has 12 known isotopes, but only one of these isotopes (be9) is stable and a primordial nuclide. the microscopic structure of the stable nucleus be9 imagined as being composed of a tightly bound core he4 plus five loosely bound nucleons outside the core divided over the p-shell (1 p3/2, 1 p1/2). on the other way, it consists of four protons and five neutrons. in this paper, the cp effects are calculated according to equation (7) which include m3y and msdi interactions. the potential parameters of m3y which known as three range potential contain spin orbit, central, and tensor interactions are obtained from bertsch et al. [26]. besides, the msdi strength parameters that used in the calculations of the cp effects are a t , b, and c. where t is defined as the isospin (1, 0). they have taken the values as a 0 =a 1 =b=25/a and c=0 [20], where a is the mass number of beryllium nucleus, b and c are the correction parameters. it has the ho length parameter b = 1.791 fm [27]. fortran 2008 used as a computer program for calculating cpm3y and msdi in the elastic and inelastic form factors. in addition, the obdm elements calculated with the shell model code oxbash for excitation states but is obtained from the occupation numbers for the closed-shell orbits (ground state). beryllium nucleus has a ground state, whereas its value is ( �j ti i  = 3/2 −1/2) e = 0.0 mev. here, two transitions are under investigation representing c 2 with e = 2.43 mev where the transition occurs to the excited state (jf tf = 5/2 −1/2) and the other excited state is (jf tf = 7/2 −1/2) e = 6.38 hawar muhamad dlshad, et al.: electron scattering form factors for be9 60 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 mev o[28]. in all graphs, the form factors with ms and cp effects including the realistic (m3y) interactions representing as the red lines, the form factors with msdi interaction performs as blue lines and the small filled circles represent the experimental values for the electron scattering form factors. 3.1. elastic coulomb form factor for 3/2 −1/2 state for elastic electron scattering, the scattered electron leaves the nucleus in the ground-state configuration. the ground state has (jπ t= 3/2 −1/2) with e = 0.0 mev. the multipoles entering the elastic scattering are j = 0, 2 with the corresponding coulomb transition c 0 and c 2 , respectively. the calculated form factor of sum c 0 + c 2 is shown in fig. 1. the obtained obdm elements are shown in table 1. the calculated root-mean-square (charge radii) for the ground state without folding is 2.629 fm but with folding is 2.505 fm, while experimentally is equal to 2.519 fm [29]. the results with m3y interaction have a great agreement with measured data in the transfer momentum domain of 1.1 ≤ q ≤2.5 fm−1. on the contrary, the calculations with msdi interaction have a bad deal with the experimental data excepting the area of 1.5 ≤ q ≤ 2 fm−1 where they have similarities with each other. 3.2. inelastic coulomb form factor for 5/2 −1/2 state the c 2 transition for coulomb scattering is taking place between the ground state of (jπ t = 3/2 −1/2) and the first excited state (jπ t = 5/2 −1/2) with excitation energy of e = 2.43 mev. the computed and measured coulomb form factors of inelastic electron scattering for the be9 nucleus are shown in fig. 2. the obdm elements which calculated with oxbash code are listed in table 2. in this transition, the calculations with msdi are not able to denote an adequate description of the experimental data for the region of transfer momenta (q = 0.8 fm−1) and (q = 1.8 fm−1), but once the cp effect with m3y interaction is applied, making the results of the total theoretical form factors fitting the experimental data along with all regions of transfer momenta. 3.3. inelastic coulomb form factor for 7/2 −1/2 state the squared inelastic scattering of coulomb form factors for be9 is displayed in fig. 3. the symbol of this transition (coulomb transition) c j = c 2 , it occurs between the ground state (jπ t= 3/2 −1/2) and the second excited state (jπ t= 7/2 −1/2) with transition energy e = 6.38 mev. the obdm elements are tabulated in table 3. fig. 3 shows the plot of measured and calculated data for the squared inelastic coulomb scattering form factors. the ratio of agreement between the results of both m3y and msdi interactions for the be9 form factors and the measured data are quite strong between (q = 1 fm−1) and (q = 2.5 fm−1). taking into consideration that the form factors for the second excited state (jπ t = 7/2 −1/2) are not substantially different from fig. 1. elastic coulomb c 0 +c 2 form factors for be9. the experimental data were taken from reference [28]. table 1: the calculated obdm elements for the coulomb c0+c2 transition of be 9 j1 j2 obdm (n) obdm (p) c0 1s1/2 1s1/2 2.8284 2.8284 1p1/2 1p1/2 0.5305 0.6072 1p3/2 1p3/2 1.6249 2.5707 c2 1p1/2 1p3/2 0.2502 0.1623 1p3/2 1p1/2 −0.2502 −0.1623 1p3/2 1p3/2 −0.4610 −0.2982 table 2: the calculated obdm elements for the coulomb c2 transition of be 9 j1 j2 obdm (n) obdm (p) 1p1/2 1p3/2 −0.4820 −0.8767 1p3/2 1p1/2 0.4187 0.5365 1p3/2 1p3/2 0.6372 0.1167 table 3: the calculated obdm elements for the coulomb c2 transition of be 9 j1 j2 obdm (n) obdm (p) 1p1/2 1p3/2 0.2724 0.2442 1p3/2 1p1/2 −0.1264 −0.1242 1p3/2 1p3/2 −0.7953 −0.1735 hawar muhamad dlshad, et al.: electron scattering form factors for be9 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 61 that of the first excited state (jπ t = 5/2 −1/2), whence a noticeable change is observed in the computation of form factors by m3y and msdi interactions. the msdi decreased faster than the m3y during the increase of momentum transfer, especially at the point of (q = 3 fm−1). 4. conclusion in the present work, it is possible to consider the following conclusions: • the basic calculations include the coulomb form factors for the ground state and other excitation states. • the ground-state coulomb form factors (c 0 transitions) for the m3y interaction and the ground-state charge radii with folding effect give the best fit with the experimental data for beryllium (be9) nucleus. • for be9 nucleus which is under consideration, the quality of similarity between the computed coulomb form factors f c (q) and those of the measured data become even better in the using of the cp effects with including m3y residual interaction for all coulomb c 2 transitions because m3y interaction is more realistic nucleonnucleon interaction that adopted for the cp calculation. • the calculation of coulomb form factors fc(q) with (msdi) interaction is dealing with the surface nucleons only. therefore, it has a limited agreement with the experimental results. • the ho succeeded to describe the wave functions completely. 5. conflicts of interest there are no conflicts of interest. references [1] t. w. donnelly and j. d. walecka. “electron scattering and nuclear structure”. annual review of nuclear science, vol. 25, no. 1, pp. 329-405, 1975. [2] d. j. millener, d. i. sober, h. crannell, j. t. o’brien, l. w. fagg and l. lapikas. “inelastic electron scattering from 13c”. physical reviews, vol. 39, no. 1, pp.14-46, 1989. [3] s. cohen and d. kurath. “effective interactions for the 1p shell”. nuclear physics, vol. 73, no. 1, pp. 1-24, 1965. [4] d. salman, s. a. al-ramahi and m. h. oleiwi. “inelastic electronnucleus scattering form factors for 64, 66, 68zn isotopes”. vol. 2144. in: aip conference proceedings, p. 030029, 2019. [5] k. s. jassim. “longitudinal form factor for some sd-shell nuclei using large scale model space”. international journal of science and technology, vol. 1, no. 3, pp. 140-143, 2011. [6] n. f. mott. “sir nevill mott: 65 years in physics”. vol. 12. world scientific, singapore, 1995. [7] z. czyżewski, d. o. n. maccallum, a. romig and d. c. joy. “calculations of mott scattering cross section”. journal of applied physics, vol. 68, no. 7, pp. 3066-3072, 1990. [8] h. uberall. “electron scattering from complex nuclei v36a”. academic press, new york, london, 2012. [9] f. a. majeed. “the effect of core polarization on longitudinal form fig. 2. inelastic coulomb c 2 form factors for 5/2 −1/2 state of be9 with e = 2.43 mev. the experimental data were taken from reference [28]. fig. 3. inelastic coulomb c 2 form factors for 7/2 −1/2 state of be9 with e = 6.38 mev. the experimental data were taken from reference [28]. hawar muhamad dlshad, et al.: electron scattering form factors for be9 62 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 factors in 10b”. physica scripta, vol. 85, no. 6, p. 065201, 2012. [10] g. fricke, c. bernhardt, k. heilig, l. a. schaller, l. schellenberg, e. b. shera and c. w. dejager. “nuclear ground state charge radii from electromagnetic interactions”. atomic data and nuclear data tables, vol. 60, no. 2, pp. 177-285, 1995. [11] f. i. sharrad, a. k. hamoudi, r. a. radhi, h. y. abdullah, a. a. okhunov and h. a. kassim. “elastic electron scattering from some light nuclei”. chinese journal of physics, vol. 51, no. 3, pp. 452465, 2013. [12] r. a. radhi, a. a. alzubadi and e. m. rashed. “shell model calculations of inelastic electron scattering for positive and negative parity states in 19f”. nuclear physics a, vol. 947, pp. 12-25, 2016. [13] e. m. raheem, r. o. kadhim and n. a. salman. “the effects of core polarization on some even-even sd-shell nuclei using michigan three-range yukawa and modified surface delta interactions”. pramana, vol. 92, no. 3, p. 39, 2019. [14] b. a. brown, a. etchegoyen, n. s. godwin, w. d. m. rae, w. a. richter, w. e. ormand and c. h. zimmerman. “oxbash for windows pc”. in: msu-nscl report, pp. 1289, 2004. [15] s. mohammadi, b. n. giv and n. s. shakib. “energy levels calculations of 24al and 25al isotopes”. nuclear science, vol. 2, no. 1, pp. 1-4, 2017. [16] t. de forest and j. d. walecka. “electron scattering and nuclear structure”. advances in physics, vol. 15, no. 57, pp. 1-109, 1966. [17] l. j. tassie and f. c. barker. “application to electron scattering of center-of-mass effects in the nuclear shell model”. physical review, vol. 111, no. 3, p. 940, 1958. [18] k. s. jassim, a. a. al-sammarrae, f. i. sharrad and h. a. kassim. “elastic and inelastic electron-nucleus scattering form factors of some light nuclei: na 23, mg 25, al 27, and ca 41”. physical review c, vol. 89, no. 1, p. 014304, 2014. [19] h. fatah, r. a. radhi and n. r. abdullah. “analytical derivations of single-particle matrix elements in nuclear shell model”. communications in theoretical physics, vol. 66, no. 1, p. 104, 2016. [20] p. j. brussaard and w. m. glaudemans. “shell-model application in nuclear spectroscopy”. north-holland, amsterdam, 1977. [21] d. salman, d. r. kadhim. “longitudinal electron scattering form factors for 54, 56 fe”. international journal of modern physics e, vol. 23, no. 10, p. 1450054, 2014. [22] f. i. sharrad, a. k. hamoudi, r. a. radhi, y. abdullah, a. a. okhunov and h. a. kassim. “elastic electron scattering from some light nuclei”. chinese journal of physics, vol. 51, no. 3, pp. 452-465, 2013. [23] b. a. brown, r. radhi and b. h. wildenthal. “electric quadrupole and hexadecupole nuclear excitations from the perspectives of electron scattering and modern shell-model theory”. physics reports, vol. 101, no. 5, pp. 313-358, 1983. [24] h. m. dlshad and a. h. r. fatah. “using msdi and m3y core polarization for the coulomb electron scattering for some ground state nuclei”. jzs (part-a), vol. 21, no. 2, pp. 11-20, 2019. [25] g. s. anagnostatos, a. n. antonov, p. ginis, j. giapitzakis and m. k. gaidarov. “on the central depression in density of”. journal of physics g: nuclear and particle physics, vol. 25, no. 1, p. 69, 1999. [26] g. bertsch, j. borysowicz, h. mcmanus and w. g. love. “interactions for inelastic scattering derived from realistic potentials”. nuclear physics a, vol. 284, no. 3, pp. 399-419, 1977. [27] f. a. majeed. “longitudinal and transverse form factors from 12c”. physica scripta, vol. 76, no. 4, p. 332, 2007. [28] j. p. glickman, w. bertozzi, t. n. buti, s. dixit, f. w. hersman, c. e. hyde-wright and b. l. berman. “electron scattering from sup 9 be”. physical review c (nuclear physics) (usa), vol. 43, no. 4, pp. 1740-1757, 1991. [29] angeli and k. p. marinova. “table of experimental nuclear ground state charge radii: an update”. atomic data and nuclear data tables, vol. 99, no. 1, pp. 69-95, 2013. tx_1~abs:at/tx_2:abs~at uhd journal of science and technology | jan 2020 | vol 4 | issue 1 87 1. introduction computing technology is seeing significant progress and significant interest, especially when the computation outsourcing has been outsourced to a third party as the cloud is the most frequently used form [1]. that is why many companies no longer trust to store their sensitive data in the cloud, which uses traditional unsecured encryption systems [2]. from this, the need to use homomorphic encryption for banking data is coming, which is a new approach that can help banks to increase data security and management [3]. there are two types of homomorphic cryptosystems: partially homomorphic systems and fully homomorphic systems [4]. partially homomorphic schemes support one of the additions or multiplication operations, these systems are divided into two parts according to the process that supports like the rsa, where it only supports the multiplication process and does not support the addition process, for example, if we have two numbers m1, m2 and they are encrypted by the rsa, then its value becomes c1, c2 and on obtaining the product of multiplying the two encrypted values c1 * c2 = c3 and then we decrypt the encrypted output c3, we will get a result similar to m1 * m2 = m3, but if we add the two values c1 + c2 = c4 and when decrypting the result c4 we do not get a result similar to m1 + m2 = m4. on the contrary, when the two values are encrypted using paillier, we find that only the result of c1+c2=c5 is similar to m1+m2=m3 and c1*c2=c6 do not equal to m1+m2=m4. therefore, we say that the two algorithms (rsa and paillier) are not a fully homomorphic systems [5], [6]. the first fhe was given in 2009 by craig gentry [7]. researchers first researched a (fhe) system in the late last century, specifically at the end of the seventies, and soon after, in 1987, rsa was published, the rsa algorithm became a leading approach by many researchers because at that time there was no idea of the public key cipher a proposed fully homomorphic for securing cloud banking data at rest zana thalage omar1,2*, fadhil salman abed1,2 1university of human development, college of science and technology, department of computer science, sulaymaniyah, kurdistan region of iraq, iraq, 2university of sulaimani, college of science, computer department, sulaymaniyah, kurdistan region of iraq, iraq a b s t r a c t fully homomorphic encryption (fhe) reaped the importance and amazement of most researchers and followers in data encryption issues, as programs are allowed to perform arithmetic operations on encrypted data without decrypting it and obtain results similar to the effects of arithmetic operations on unencrypted data. the first (fhe) model was introduced by craig gentry in 2009, and it was just theoretical research, but later significant progress was made on it, this research offers fhe system based on directly of factoring big prime numbers which consider open problem now, the proposed scheme offers a fully homomorphic system for data encryption and stores it in encrypted form on the cloud based on a new algorithm that has been tried on a local cloud and compared with two previous encryption systems (rsa and paillier) and shows us that this algorithm reduces the time of encryption and decryption by 5 times compared to other systems. index terms: cloud computing security, encryption, decryption, cloud storage, homomorphic encryption corresponding author’s e-mail: zana thalage omar, university of human development, college of science and technology, department of computer science, sulaymaniyah, kurdistan region of iraq, iraq, university of sulaimani, college of science, computer department, sulaymaniyah, kurdistan region of iraq, iraq. e-mail: zana.th.omar@gmail.com received: 13-03-2020 accepted: 10-05-2020 published: 12-05-2020 access this article online doi: 10.21928/uhdjst.v4n1y2020.pp87-95 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2020 omar and abed. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l re se a rc h a rt i c l e uhd journal of science and technology mailto:zana.th.omar@gmail.com omar and abed: cloud banking data security 88 uhd journal of science and technology | jan 2020 | vol 4 | issue 1 that was presented during the rsa scheme for the first time [5]. because this kind of encryption allows the key to decrypt the encrypted data, and thus one can read and know all the data, and for this reason, if one does not have the secret key, the data become useless. therefore, a question and an issue were asked: can mathematical operations apply to encrypted data without decrypting it, and from this, the idea of using fully homomorphic systems (fhe) was raised. after that, several attempts were made to develop these systems, but most of the research did not succeed as they received partially homomorphic schemes such as rsa and goldwasser-micali [8]. the algorithm that achieves the addition and multiplication properties can be considered as fhe, as it is regarded as a special algorithm that contains the feature of performing mathematical operations (addition and multiplication) on data without decrypting it and obtaining correct results [9]. fhe is an encryption technology that allows calculations to be performed on encrypted data without decrypting it, and this results in an encrypted result where when this result is decrypted we get a result similar to the result of the calculations on the data without encrypting it [9]. the world of computing is in constant progress, and the main challenge is to create a guarantee and trust among customers when storing their sensitive data on the cloud to ensure and respect their privacy. this is a new approach that cloud providers follow to encrypt users’ data, upload it to the cloud, and perform operations on it without decrypting it to ensure the integrity of customer data [10]. this paper presents a fully homomorphic system (the correct numbers and texts) based on a new algorithm that will be explained later in this paper as this scheme relies on data encryption and operations performed on it without decrypting and reducing computational complications and the time used to encrypt and decrypt data and reduce energy consumption. most of the previous research in this field deals with data when encrypting after converting it to the binary system and this means more time. as for our current research, data operations are encrypted without the need to convert them to binary representation and this reduces mathematical operations and there is a reduction in the time of encryption and encryption solution, as well as a mathematical model has been suggested that deals with the inverse calculation and the process of raising to the exponential and increases the complexity of attacking the new system. 2. literature review c. gentry et al. (2012), this paper introduces contrast/ orientation techniques to transfer the elements of plain text across these vectors very efficiently so that they are able to perform general calculations in a batch way without the need to decrypt the text and also make some improvements that can accelerate the normal homomorphic, where you can make homogeneous evaluation of arithmetic operations using multi-arithmetic head only [11]. j. fan et al. (2012), this paper concludes two copies of the redefinition that lead to a quick calculation of homogeneous processes using the parameter transformation trick, as this paper conveys brakerski’s fully homomorphic scheme based on the learning with errors (lwe) problem to the ring-lwe [12]. z. brakerski et al. (2012), this paper introduces squash and bootstrapping techniques to convert a somewhat symmetric encryption scheme into an integrated symmetric encryption scheme [13]. x. cao et al. (2014), this paper presents a completely symmetric encryption scheme using only a basic unit calculation as it relies on the technique of using multiplication and addition instead of using ideal clamps on a polynomial loop [14]. c. xiang et al. (2014), this paper presents an entirely symmetric encryption scheme on integers, as it reduces the size of the public key using the square model encryption method in public key elements instead of using a linear model based on a stronger variant of the approximate-gcd problem [15]. m. m. potey et al. (2016), this paper presents a completely symmetric encryption system where it focuses on storing customer data on the cloud in an encrypted form so that customer data remain safe and when any data modification is made the system loads data on the customer’s device and modifies it and then stores it again on the cloud in encrypted form [16]. k. gai et al. (2018), this paper proposes a new solution for mixing real numbers on a novel tensor-based fhe solution that uses tensor laws to reduce the risk of unencrypted data storage [17]. s. s. hamad et al. (2018), these heirs offer a completely symmetric encryption system, as it relies on the principle of encryption a number from the plain text with another number using a secret key without converting to binary format and then comparing the result with a dghv and sdc system [18]. s. s. hamad et al (2018), this paper presents a fully homomorphic encryption system based on euler’s theory and the time complexity has been calculated and compared with other systems with an encrypt key size up to 512 bits while the size of the key in our proposed scheme reaches more omar and abed: cloud banking data security uhd journal of science and technology | jan 2020 | vol 4 | issue 1 89 than 2048 bits and the encrypting process is done through more complex and powerful mathematical equations [19]. v. kumar et al (2018), this paper presents fully homomorphic encryption system with probabilistic encrypting and relies on euler’s theory. the encrypting process is done through the following mathematical equation (c=mk* 𝜇 (n) +1 mod x) while in our proposed scheme a more complex and difficult mathematical algorithm is used which helps to stand more against hacker attacks and deter them [20]. r. f. hassan et al. (2019), this paper proposes a blueprint for building asymmetric cloud-based architecture to save user data in the form of unusual text. this pattern uses the elliptic curve to create the secret key for data encryption. this pattern is a new algorithm that reduces processing time and storage space [21]. 3. statement of the problem cloud providers provide many services, including applications and storage many companies and users do not trust the providers of these services due to security concerns. where the user does not upload his personal data to the cloud because the cloud providers are able to read and modify every bit loaded on the cloud and use it for personal purposes, and this thing does not comply with respecting the user’s privacy. furthermore, some cloud providers still use traditional security techniques that are not secure with low-security level to protect user privacy. some of the cloud providers have started to use high-level technologies to protect the privacy of users and the security of their data, but there remains a problem that the provider of the cloud itself is still able to access user data, and this is not safe for users. this problem can be solved when following fhe systems when storing data on the cloud where these systems can encrypt the data and store it in the cloud in an encrypted form and thus the cloud provider or others cannot see the data and use it, so the privacy of users and the security of their data are protected. 4. proposed fhe system the proposed scheme works as follows: generating the encryption key and then encrypting the numbers and texts and storing them in encrypted form on the cloud. in our work, we use a local cloud and experiment with the proposed scheme on it. the purpose of this process is to save the data encrypted on the cloud so that no one can view the data and use it for personal purposes therefore, when the data owner needs to perform an amendment of the encrypted data on the cloud, an encrypted request is sent to the server, and the server performs mathematical operations on the encrypted data and returns an encrypted result where this encrypted result can only be decrypted through the private encryption key which is with the owner of data only so that he can decrypt the encrypted result and see his data. in this way, we have been able to maintain the privacy and security of the data when stored in the cloud. these procedures go through three stages. generation the encryption key stage, the encryption stage, and the decryption stage. the model of the proposed scheme is given in fig. 1, and the flowchart of the proposed scheme is given in fig. 2. the proposed scheme performs several random examples with multiplication and addition as follows: a. key generation: 1. generate two large prime number p and q 2. compute n = p*q 3. calculate l=((p−1 mod q)*p)+((q−1 mod p)*q) 4. select r: where r is a big random integer b. b. messages encryption the conditions: (m 1 &m 2 ), (m 1 +m 2 ), and (m 1 *m 2 ) < n where m1 and m 2 are the messages. pk sk storage key generation encryption request decryption set to storage get from storage fig. 1. model of proposed fhe scheme. omar and abed: cloud banking data security 90 uhd journal of science and technology | jan 2020 | vol 4 | issue 1 the schema of message encryption is: c = l * mr µ (p) +1 mod n. (1) where 𝜇 (p) = (p-1), euler function and c, the ciphers text. c. message decryption the schema of cipher decryption is: m=c mod p (2) where m is the number or text that will be encrypt and c is the result of the encrypted number or text we named it cipher text d. euler’s theorem all of us know that euler’s theorem contains two-part they are: 1. m 𝜇 (p) ≡ 1(mod p), when p and m are prime to each other. 2. m r* 𝜇 (n) +1 ≡ m (mod n), when r is an integer, m<n and n=p*q where q and p are two primes number. e. a simple example of how to make an amendment to encrypted data we have two values m1 = 3, m2 = 5. we encrypt them through a simple encryption equation that is multiplied by each value, so we get c1 = m1 * 2 and c2 = m2 * 2, so c1 = 6, c2 = 10 when we add the two values c1 + c2 = c3 so c3 = 16 we decrypt c3 so we get the result 16/2 = 8 which is the same result when we add m1 + m2 = m3 where 3 + 5 = 8 as shown in fig. 3. 5. the prove of our schema is fhe we choose two numbers m 1 , m 2 and encrypt them to get two encrypted or (ciphers) c 1 and c 2 , respectively, and then we combine c 1 + c 2 to get a new ciphered result we name it c 3 then we decrypt c 3 and compare the result with m 3 which is the result of combine m 1 + m 2 we also multiply c 1 * c 2 to get c 4 and compare it to m 4 , which is the result of m 1 * m 2 . a. the prove of additive homomorphic if the following condition is fulfilled, it becomes clear to us that the proposed scheme additive homomorphic: m 1 +m 2 =dec [enc (m 1 ) + enc (m 2 )] (4) where dec is the decryption function and enc is the encryption function proof: c 1 =l*(m 1 r µ (p) +1 mod n). c 2 =l*(m 2 r µ (p) +1 mod n). c 1 +c 2 = l*(m 1 r µ (p) +1 mod n) + l*(m 2 r µ (p) +1 mod n). dec (c 1 +c 2 ) = (c 1 +c 2 ) mod p = [l*(m 1 r µ (p) +1 mod n) + l*(m 2 r µ (p) +1 mod n)] mod p = [(l mod p) + ((m 1 r µ (p) +1 mod n) mod p) + (l mod p) + ((m 2 r µ (p) +1 mod n) mod p)] = [(m 1 r µ (p) +1 mod p) mod n + (m 2 r µ (p) +1 mod p) mod n] we know m 1 r µ (p) +1 mod p = m 1 and m 2 r µ (p) +1 mod p = m 2 by euler’s theorem so fig. 2. flowchart of proposed fhe scheme. fig. 3. a simple example of how modify encrypted data. omar and abed: cloud banking data security uhd journal of science and technology | jan 2020 | vol 4 | issue 1 91 = (m 1 mod n) + (m 2 mod n) = (m 1 +m 2 ) mod n because m 1 +m 2 less than < (n) = m 1 +m 2 dec (c 1 +c 2 ) = m 1 +m 2 so the condition is fulfilled b. the prove of multiplicative homomorphic if the following condition is fulfilled, it becomes clear to us that the proposed scheme multiplicative homomorphic: m 1 *m 2 =dec [enc (m 1 ) * enc (m 2 )] (5) where dec is the decryption function and enc is the encryption function proof: c 1 =l*(m 1 r µ (p) +1 mod n). c 2 =l*(m 2 r µ (p) +1 mod n). c 1 *c 2 = (l*(m 1 r µ (p) +1 mod n)) * (l*(m 2 r µ (p) +1 mod n)). dec (c 1 *c 2 ) = (c 1 *c 2 ) mod p = [(l*(m 1 r µ (p) +1 mod n)) * (l*(m 2 r µ (p) +1 mod n))] mod p = [(l mod p) * ((m 1 rµ (p) +1 mod n) mod p) * (l mod p) * ((m 2 r µ (p) +1 mod n) mod p)] = [(m 1 r µ (p) +1 mod p) mod n * (m 2 r µ (p) +1 mod p) mod n] we know m 1 r µ (p) +1 mod p = m 1 and m 2 r µ (p) +1 mod p = m 2 by euler’s theorem so = (m 1 mod n) * (m 2 mod n) = (m 1 *m 2 ) mod n because m 1 *m 2 less than < (n) = m 1 *m 2 dec (c 1 *c 2 ) = m 1 *m 2 so the condition is fulfilled 6. real example let us choose two different number m 1 = 10, m2 = 40, select two big prime numbers p=523, q=617, select random number r=100 and compute n, l where n = p*q and l = ((p−1 mod q)*p) + ((q−1 mod p)*q), as in fig. 1, so n = 322691 and l = 322692 now we will compute c 1 , c 2 as shown in fig. 4 where c 1 = l*(m 1 r µ (p) +1 mod n) c 1 = 322692 * (10100*(522) +1 mod 322691) c 1 = 84555952836 c 2 = l*(m2r µ (p) +1 mod n) c 2 = 322692 * (40100*(522) +1 mod 322691) c 2 = 70220360736 a. check the additive homomorphism as shown in fig. 5, let us define c 3 is the result of c 1 +c 2 c 3 = c 1 +c 2 c 3 = 84555952836+70220360736 c 3 = 154776313572 m 3 = c 3 mod p m 3 = 154776313572 mod 523 m 3 = 50, which is the same of m 1 +m 2 = 10 + 40 =50 b. check the multiplication homomorphism as shown in fig. 6, let us define c 4 is the result of c 1 *c 2 c 4 = c 1 *c 2 c 4 = 84555952836*70220360736 c 4 = 5937549510520122247296 m 4 = c 4 mod p m 4 = 5937549510520122247296 mod 523 m 4 = 50, which is the same of m 1 *m 2 = 10 * 40 =400 7. results our proposed method has been applied in java language on a laptop that has these characteristics intel (r) core (tm) fig. 4. verification that the proposed scheme supports multiplicative homomorphic system. omar and abed: cloud banking data security 92 uhd journal of science and technology | jan 2020 | vol 4 | issue 1 8110011485075986472668526287650474047702785552627 7139605888244856948751484971356945255025501977455 2242450237953291087748249243514503675865792656993 93729838051811451 q=1428710476741675056983014575571840561823434645 4308656139478728164897664839386811178089812726550 5708304317799826040708010919513590630000335514123 0450272776428691918542268804417638286844165397915 2025714562637864986328594073222682892055143305181 2687655466062902039677657892147970940904308293447 443717248390239707 r=365786518636912644776042065063644900157123708 343207168729026337822168443513411848955281137920 158798190812229338500124674534784650074194884515 2271827981943 n=2423607304574691011057220168339429745775510876 8568255735015912152074767485603430716960118055838 9243234428533681091865082478949426971733704964018 7576258686356978941473841479446251805425174083399 i7-8550u cpu @1.80ghz 2.00ghz, 8 gb ram, 64-bit operating system,x64-based processor, windows 10 and big integer library of java is used. we have previously seen that in section 5 our scheme achieves the two properties (addition and multiplication) on the correct numbers when encrypt in contrast to the two systems (rsa and pailler) that produce a single property either multiplication or addition, we took as an example of 2048 bit a message containing several languages: english, kurdish, arabic and chinese, to indicate that our scheme works in all languages. the message was: the language considered at the university is english ەییزیلگنیئ ینامز ەدنەسەپ ەب ادۆكناز ەل ەك یەنامز وەئ ةيزيلكنالا يه ةعماجلا يف ةربتعملا ةغللا 大學考慮的語言是英語 p=16963600001744112018845147215300525136637986969 1606741862700184835051235683163815632346364513963 5084178686272336909107427580252972488626713138540 fig. 5. verification that the proposed scheme supports additive homomorphic system. fig. 6. a real example of generating an encryption key and encrypting two different numbers. omar and abed: cloud banking data security uhd journal of science and technology | jan 2020 | vol 4 | issue 1 93 0 50000 100000 150000 200000 250000 300000 64 bit 128 bit 256 bit 512 bit 1024 bit 2048 bit e xc ut io n ti m e in m s size of key proposed method rsa pailler fig. 7. computation encryption time of various schema. fig. 8. computation decryption time of various schema. 0 50000 100000 150000 200000 250000 300000 64 bit 128 bit 256 bit 512 bit 1024 bit 2048 bit e xc ut io n ti m e in m s size of key proposed method rsa pailler 0064335766935428553190201256143433654205627755229 6773381542471828455818813480172653876483980783338 1523057539397742755088141082360135822895062302531 9405062251415063552873019444449238666440140085803 2829153319755489679960430558612883401366594381416 5468112883656495673094721811758386521739451237520 5070768701405826931878983152614067930454176175622 2924904444160392437762620644204922911348434700560 07271825256265091103199457484857 l=24236073045746910110572201683394297457755108 7685682557350159121520747674856034307169601180 5583892432344285336810918650824789494269717337 0496401875762586863569789414738414794462518054 2517408339900643357669354285531902012561434336 5420562775522967733815424718284558188134801726 5387648398078333815230575393977427550881410823 6013582289506230253194050622514150635528730194 4444923866644014008580328291533197554896799604 3055861288340136659438141654681128836564956730 9472181175838652173945123752050707687014058269 3187898315261406793045417617562229249044441603 9243776262064420492291134843470056007271825256 265091103199457484858 the message after encryption (cipher text) 541476558809409702391337896786206578891371305860 691083047075666735305343010133163455201330600329 818224876828649953919527566237735831578747495518 661815006015094327373599324058140501637682523981 366081263444402953878958225004102881404987245214 085192130463968623162036132714218988345866733882 828903027959438577677193858956252126893602243322 002345822997903630750182808060329693726890973821 429052022147058264305295245097017754099269475380 968046201854139181624798301373478600684536391994 135042539217304792283425928429438405414943114956 731879603950076538717093967938918097476473355425 283428257417215267662967218064104960563636218183 044111151212122457871341575675158274986166996526 006578968820402465601212584511978294298514268554 125554995603375526132322574633145472359908234720 133081143881121000520379674740198817341417761860 826872691325817210768306765600237104658826101240 831563114649492567258100255788974674414548062825 table 1: computation encryption time of various schema key size proposed method rsa pailler 64 bit 88 ms 59 ms 103 ms 128 bit 139 ms 100 ms 182 ms 256 bit 218 ms 149 ms 727 ms 512 bit 1141 ms 397 ms 4212 ms 1024 bit 6058 ms 2185 ms 55139 ms 2048 bit 65876 ms 29820 ms 263303 ms average 12253 ms 5451 ms 53944 ms table 2: computation decryption time of various schema key size proposed method rsa pailler 64 bit 31 ms 72 ms 193 ms 128 bit 43 ms 116 ms 330 ms 256 bit 50 ms 315 ms 1429 ms 512 bit 60 ms 1450 ms 11441 ms 1024 bit 131 ms 7829 ms 122628 ms 2048 bit 260 ms 79609 ms 976289 ms average 95 ms 14898 ms 185385 ms omar and abed: cloud banking data security 94 uhd journal of science and technology | jan 2020 | vol 4 | issue 1 289283723466283906129211351747505124028302482285 703983215028876622455291720747481357534432866979 561161051794214495384555564764420022771673904608 026921423801092986234452794686905395614250428215 39788299132797410163169266322544228501653889014 41231774699848323143262753973952981234912026166 08092930905934341723828737491536602463718665109 04460259954270328428583255617601007624795283333 12842498083186883479202672771030666642010819171 03871200835323397770355533934916679402150831209 71613731326031096116661094157439907155297740345 56154583520372539433462549143084673593882815487 33624326112691124298132589125000613618859548392 80194895402855066065235834892981371608451492075 898392646836063832420875791614210127746840887222 061576759203922203224378888374677613916469740136 215937279995273878941455335546570056409881117615 612427776918414604124368172979351492484034377939 232910419697167267189883148981938503891449737345 277644170563374412805408898899652315897930433017 221778569673211415882347553987827098592640370937 720688618264473231853293964905495556724277624311 697945653171210371750503583126470426057905397533 244577146375498719004689422402622745765224202206 710655778164805330785789281954819858081405410264 417267795724923069668706099902071….etc. due to the length of the encrypted text (cipher text), which reaches 67 pages, it has been truncated. where the encrypted time was 1572 ms, and the decrypted time was 31 ms. we have also tested it on text with 8kb in its size and several different keys in terms of size, and we compared the results with the planners from. in terms of velocity, we obtained the following results: as shown in tables 1 and 2, respectively, and the graph in figs 7 and 8. 8. conclusion our scheme relies on fhe on whole numbers, texts, and supports all languages such as english, arabic, kurdi, and chinese and others. very large prime numbers (up to 617 digits, 2048 bit) represent the strength for the attack of our scheme because the proposed system depends on the problem of factorization to the primary factors, which are considered mathematical problems under discussion at the present time when taking the time. we have come to the conclusion that our scheme is very effective in relation to the time when encrypt and decrypt numbers and texts in comparison with other techniques and approaches that are circulated and used at the present time. references [1] l. a. tawalbeh and g. saldamli. “reconsidering big data security and privacy in cloud and mobile cloud systems”. journal of king saud university computer, vol. 40, pp. 1-7, 2019. [2] j. domingo-ferrer, o. farràs, j. ribes-gonzález and d. sánchez. “privacy-preserving cloud computing on sensitive data: a survey of methods, products and challenges”. computer and communications, vol. 140-141, no. 2018, pp. 38-60, 2019. [3] s. sakharkar, s. karnuke, s. doifode and v. deshmukh. “a research homomorphic encryption scheme to secure data mining in cloud computing for banking system”. international journal for innovative research in multidisciplinary field, vol. 4, no. 4, pp. 276-280, 2018. [4] j. h. cheon, a. kim, m. kim and y. song. “homomorphic encryption for arithmetic of approximate numbers”. in: lecture notes in computer science. vol. 10624. springer science+business media, berlin, germany, pp. 409-437, 2017. [5] p. sha and z. zhu. “the modification of rsa algorithm to adapt fully homomorphic encryption algorithm in cloud computing”. in: proceeding 2016 4th ieee international conference cloud computing and intelligence systems. pp. 388-392, 2016. [6] l. chen and z. zhang. “bootstrapping fully homomorphic encryption with ring plaintexts within polynomial noise. vol. 2. conference paper, pp. 285-304, 2017. [7] c. gentry. “a fully homomorphic encryption scheme”. dissertation, p. 169, 2009. [8] v. kumar and n. srivastava. “chinese remainder theorem based fully homomorphic encryption over integers”. international journal of applied engineering research, vol. 14, no. 2, pp. 203208, 2019. [9] m. a. mohammed and f. s. abed. “a symmetric-based framework for securing cloud data at rest”. turkish journal of electrical engineering and computer sciences, vol. 1, pp. 347-361, 2019. [10] k. gai, m. qiu, y. li and x. y. liu. “advanced fully homomorphic encryption scheme over real numbers”. in: proceeding 4th ieee international conference cyber secur cloud computing cscloud 2017 3rd ieee intertnational conference scalable smart cloud, ssc 2017, pp. 64-69, 2017. [11] c. gentry, s. halevi and n. p. smart. “fully homomorphic encryption with polylog overhead”. in: lecture notes in computer science. vol. 7237. springer science+business media, berlin, germany, pp. 465-482, 2012. [12] j. fan and f. vercauteren. “somewhat practical fully homomorphic encryption”. in: proceeding 15th international conference practice theory public key cryptogr, pp. 1-16, 2012. [13] z. brakerski and v. vaikuntanathan. “fully homomorphic encryption from ring-lwe and security for key dependent messages”. in: lecture notes in computer science. vol. 6841. springer, berlin, germany, pp. 505-524, 2011. [14] x. cao, c. moore, m. o’neill, n. hanley and e. o’sullivan. “highspeed fully homomorphic encryption over the integers”. in: lecture notes in computer science. vol. 8438. springer, berlin, germany, pp. 169-180, 2014. [15] c. xiang and c. m. tang. “improved fully homomorphic encryption over the integers with shorter public keys”. international journal of security and its applications, vol. 8, no. 6, pp. 365-374, 2014. [16] m. m. potey, c. a. dhote and d. h. sharma. “homomorphic encryption for security of cloud data”. procedia computer science, omar and abed: cloud banking data security uhd journal of science and technology | jan 2020 | vol 4 | issue 1 95 vol. 79, pp. 175-181, 2016. [17] k. gai and m. qiu. “blend arithmetic operations on tensorbased fully homomorphic encryption over real numbers”. ieee transactions on industrial informatics, vol. 14, no. 8, pp. 35903598, 2018. [18] s. s. hamad and a. m. sagheer. “design of fully homomorphic encryption by prime modular operation”. telfor journal, vol. 10, no. 2, pp. 118-122, 2018. [19] s. s. hamad and a. m. sagheer. “fully homomorphic encryption based on euler’s theorem”. the international journal of information security, vol. 9, no. 3, p. 83, 2018. [20] v. kumar, r. kumar, s. k. pandey and m. alam. “fully homomorphic encryption scheme with probabilistic encryption based on euler’s theorem and application in cloud computing”. in: advances in intelligent systems and computing. vol. 654. springer, berlin, germany, pp. 605-611, 2018. [21] r. f. hassan and a. m. sagheer. “a proposed secure cloud environment based on homomorphic encryption”. international advanced research journal in science, engineering and technology, vol. 6, no. 5, pp. 166-175, 2019. . uhd journal of science and technology | jan 2020 | vol 4 | issue 1 51 1. introduction the interest is shown by medical professionals in deepening their knowledge of internal anatomy plays an essential part in the importance of medical images that are used in both treatment and diagnosis [1]. numerous methods of diagnostic medical imaging have been created dependent on different types of electromagnetic band imaging which includes gamma-ray and x-ray imaging, cross-sectional pictures, such as computed tomography (ct), single-photon emission ct, positron emission tomography, magnetic resonance imaging (mri), or ultrasound [1,2]. these various applications and techniques in medical image processing rely on different ranges of electromagnetic spectrum bands [2]. improvements in technology caused to increase size and volume of medical imaging, also these developments raise demand on automated diagnosis with computer technology developments, also it decreases the cost and time [3]. image segmentation can be described as; trying to find homogenous limits inside an image and after that the classification of them, and can be considered as the most significant field of medical image processing, also it allows images to be divided into relevant areas according to homogeneity or heterogeneity criteria, and it is an automatic or semi-automatic process used for separating the region of interest (roi), also there are numerous medical applications that use to differentiate in the segmentation of body organs and tissues, these include cardiology image analysis, breast tumor detection, autoclassification in hematology field, brain cognitive development, mass segmentation, mass detection, digital medical image segmentation using fuzzy c-means clustering bakhtyar ahmed mohammed1, muzhir shaban al-ani2 1department of computer science, university of human development, college of science and technology, sulaymaniyah, krg, iraq, 2department of information technology, university of human development, college of science and technology, sulaymaniyah, krg, iraq o r i g i n a l re se a rc h a rt i c l e a b s t r a c t in the modern globe, digital medical image processing is a major branch to study in the fields of medical and information technology. every medical field relies on digital medical imaging in diagnosis for most of their cases. one of the major components of medical image analysis is medical image segmentation. medical image segmentation participates in the diagnosis process, and it aids the processes of other medical image components to increase the accuracy. in unsupervised methods, fuzzy c-means (fcm) clustering is the most accurate method for image segmentation, and it can be smooth and bear desirable outcomes. the intention of this study is to establish a strong systematic way to segment complicate medical image cases depend on the proposed method to share in the decision-making process. this study mentions medical image modalities and illustrates the steps of the fcm clustering method mathematically with example. it segments magnetic resonance imaging (mri) of the brain to separate tumor inside the brain mri according to four statuses. index terms: medical image, medical image modality, segmentation, fuzzy c-means clustering corresponding author’s e-mail: bakhtyar ahmed mohammed, department of computer science, university of human development, college of science and technology, sulaymaniyah, krg, iraq. e-mail: bakhtyar.mohammed@uhd.edu.iq received: 30-12-2019 accepted: 23-02-2020 publishing: 27-02-2020 access this article online doi: 10.21928/uhdjst.v4n1y2020.pp51-58 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2020 mohammed and al-ani. this is an open access article distributed under the creative commons attribution noncommercial no derivatives license 4.0 (cc by-nc-nd 4.0) uhd journal of science and technology mohammed and al-ani: digital medical image segmentation using fcm clustering 52 uhd journal of science and technology | jan 2020 | vol 4 | issue 1 surgery simulations, plan of surgery, and detection of vessel boundary in coronary angiograms [1,4]. techniques of image segmentation can be characterized according to these essential terminologies, such as oriented of pixel, color, region, model, and hybrid [4]. intelligent decision support systems commonly use the prevailing image segmentation to accurately organize image pixels [5]. the procedure divides the picture into systematic and defined sectors depending on their similarities [5]. image segmentation is one main component of analysis processes which use in these techniques; remote sensing, computer vision, medical image processing, and geographical information system [6]. image segmentation plays a key role in automated object recognition systems in the process of computer vision and medical imaging for the analysis of details, image segmentation enables greater ease in detecting and quantifying abnormalities in anatomical structures, such as the brain and lung [5]. however, the processing of the image segmentation can be affected by improper illumination, noise disturbances, environmental factors, and blurring of images, an important phase toward the automatic segmentation of images is region segmentation since this is the step taken to determine and segment the area of interest [7]. because of ease to apply fuzzy c-mean (fcm) and its high accuracy, it has become one of the best way for image segmentation [8]. nevertheless, fcm has inadequacies in confusion acknowledgment; various undertakings are practiced for covering this deficiency, with using the objective work fcm and the use of neighbor pixels despite the pixel and besides pixel division have been used, fcm methodology is used for improving the accuracy in picture division, enlistment work is changed [8]. the best criterion to find the optimum solution for these issues is the method known as fcm clustering which is a clustering method whereby points of data can be designated to more than one group each based on shared correlations, and then tries to identify parallels and relationships within each set, least-squares solutions are employed to identify the ideal location for any point of data which may lie in a space of probability bounded by two or more clusters, also there should be as higher level as possible in the likeness of clusters and as lower-level as possible in differences and fuzzy boundaries are easier to develop from a computational point of view [9]. the purpose of writing this paper is to indicate the importance of the segmentation process in image processing and the mechanism of brain segmenting image using fcms clustering and how it can find the optimum solution for segmenting images and diagnosis using medical image techniques. furthermore, there are four major steps in the medical imaging field, which consist of capturing the image, digitalizing it, processing for segmentation and finally extracting important information [9,10]. 2. literature review in 2008, ahmed and mohamad explored that fuzzy clustering is very important in image segmentation, using parallel length to calculate fuzzy weights [11]. in 2010, naz et al. found that the reason digital image processing developers are innovating this method is the best and most accurate way of diagnosing medical imaging can be found, improvements have been made rapidly with many methodologies being put forward supported by a wide range of literature on taking information from a picture and dividing it into defined areas. however, constraints are presented in regards to intricacy, time, and precision as a result of unclear cluster borders shown in images, fuzzy techniques, on the other hand, are largely free of such problems and provide much better results in comparison to other segmented image methods [6]. in 2010, padmavathi et al. stated that quality of underwater image is different from the quality of an image which capture in air, because some factors have impact of it such as; water medium, atmosphere, pressure, and temperature which means that image segmentation is necessary for digital image processing that implies image demonstrations is the need of picture segmentation, which separates a picture into portions that have strong correlations with objects to mirror the genuine data gathered from this present reality, picture segmentation is the most down to earth approach among practically all robotized picture acknowledgment frameworks, also clustering of numerical information shapes the premise of numerous arrangement and framework displaying calculations [7]. in 2011, quintanilla-dominguezab et al. tested fcm for the early stages of breast cancer detection using mammography technique [12]. in 2013, jiang et al. utilized the fuzzy science strategy, and fuzzy grouping examined isolates the differentiate things and arranges them [13]. furthermore, in 2013, yambal and gupta showed that segmentation process is unsupervised classification technique and an important step in advance image analysis process mohammed and al-ani: digital medical image segmentation using fcm clustering uhd journal of science and technology | jan 2020 | vol 4 | issue 1 53 used as assistance to some other processes like detection for mri brain tumor, generally original works of clustering are detecting anomalies, identifying salient features, classifying data, and compressing data, using conventional fcms algorithm depend on hierarchical self-organized map, the aim of image segmentation process is to effective segmentation of noisy images [14]. in 2014, khalid et al. illustrated that fcms can diagnose some special diseases, such as glaucoma which is an ailment characterized by expanded weight inside the eyeball, making extreme harm the optic nerve, it is the most astounding reason for visual deficiency and irreversible, with early revelation early and appropriate treatment which it could continue for a long time [15]. furthermore, in 2014, norouzi et al. showed that mechanism of clustering algorithms is same as the classification technique without training of data, these methods have unsupervised learning algorithm and individual authority to calculate the similar features in the image and retain some things, same as; keys to recognize other features that have same attributes, this method is compatible with most of the data mining algorithms considering unsupervised methods they do not train data because its process is not time consuming in segmenting [1]. in 2017, kumar et al. tested correlative distance by adding the process of eliminating, clustering, and merging to compute fuzzy weights using large initial prototypes and gaussian weights. in standard fcm spatial fcms methods incorporated the spatial information and altering of every cluster membership weights after considering the cluster distribution in the neighborhood [3]. in 2018, ali et al. denoted that segmentation is an essential step to the sensitive analysis of human tissue lesions with aim of improving the partition of different clusters of images rely on similar features [16]. 3. medical image modalities imperative topic in medical imaging is medical image modalities or techniques. it is used to anatomical vision of body organs. there are some modalities which use in digital medical image processing. 3.1. x-ray nowadays, x-ray imaging system uses in the branch of diagnostic radiology, however, there are many types of x-ray imaging systems, but the most famous one is conventional planar radiography which is too beneficial because of its low cost and low dose with side effect of interacting with useless beams. its problem is overlapping anatomical image details and finding a lot of clusters [17]. in telemedicine uses as an integral part to automate x-ray segmentation which use broadly in remote areas which usually accidents happen there which helps the medical staffs to analyze the emergency cases, in this situation x-ray analysis in two manners; first one is segmenting the bone region from the surrounding flesh of the bone then extraction their features, the other one is determining the situation of the bone which usually happen this case, also it automatically selecting fractions aim doctors and medical staffs to analyze the level of injury to select the suitable medical treatment [17]. 3.2. ct to avoid overlapping effect, ct has been evolved. it can reshape the picture from all sides and angles until it can recreate the image of the body organ. high radiation dose is side effect of ct, also side effect of using that big amount of radiation increased day after day until it has detected that using ct cause of cancer higher than other types [17]. 3.3. digital tomosynthesis this technique has been invented to solve the problems of overlapping and high dose of radiation. it has moderate performance between these two disadvantages of x-ray and ct. its angle was <360, but in digital theater systems, resolution depth increased with lower radiation compared to ct, because of this advantage, little angles accurately reshape images. two major reshaping methods have been evolved: analytical reconstruction and iterative reconstruction. in different algorithms of analytical reconstruction, filtered back-projection was the most famous one because of its smallest deforming and high precision. however, this method needs high pass filtering, to prevent this distortion maximum likelihood expectation maximization (mlem) evolved which essentially involved iteration number with edge-preserving regularization. another algorithm invented based on mlem known as chest digital tomosynthesis (cdt) system to project data using a real cdt system [17]. 3.4. mri using of mri images more common than other modalities in brain diagnosis and segmentation process. among unsupervised learning methods fcms clustering for medical image segmentation methods, fcm clustering is the most accurate one for segmentation process compare to other segmentation methods more appropriate for sensitivity mohammed and al-ani: digital medical image segmentation using fcm clustering 54 uhd journal of science and technology | jan 2020 | vol 4 | issue 1 noisy with intensity inhomogeneity mri which can properly decrease the noises and supply superior segmentation results [18]. mri is a wide medical modality that takes the image of the internal body and organs without contacting the skin. the characteristics of mri are non-linear. a lot of things have influences of mri accuracy, such as partial volumes effects which implies a pixel consists of more than one tissues, also, the volume is dependable thing in the segmentation process so as to determine the size of organs and strange things [19]. 3.5. high-resolution ct the segmentation is an indispensable step, as in many medical image analysis applications. accurate segmentation using high-resolution ct (hrct) images and quantification of the lungs have an important role in the early diagnosis of lung diseases. especially detection of nodules in the lung region, airway and vessels diameter, lung volume is the key component of diagnosing lung diseases. the definition of each lung region from the ct images is the first step for the computer-aided diagnosis algorithm of lung. to extract each lung tissue, fcms clustering algorithm has been applied to the segmentation of lung region in two-dimensions hrct images [5]. 4. methodology many methods developed during several past years to enhance medical segmentation with demanding to obtain an accurate diagnosis, but fcm was the most accurate one in unsupervised learning methods. in supervised learning, when you feed the images or specified dataset to the method, it can extract similar features and cluster them according to observe of patterns. fcm distribute one piece of data to two or more clusters. however, the con of these methods is cannot label similar groups. for instance, in the tested image, fcm can determine the tumor without labeling it. fcm clustering is the most accurate and widely method in the segmentation process of digital medical image processing compared to other methods. it acts as preprocessing, which aids classification and detection processes and sometimes use as a diagnosis process. mentioning of fuzzy k-means (fkm) clustering is important because it used before fcms clustering. nowadays, using fcm is wider than fkm clustering, because it could solve these problems that had been happened in fkm clustering. fkm creates segmentations hierarchically, which involve each data point that can only be specified in one cluster, but fcm permits data points to be allocated into more than one cluster in which each data point has a degree of the endurance of belonging to each cluster as in fuzzy logic. the mechanism of the process pass through these steps, as shown in fig. 1; the first step is importing an mri image that captured according to the specific technique of electromagnetic band imaging after that is converting the image in analog to same digital image size inside data acquisition step. then preprocessing image begin by preparing images to the segmentation process, which involves image denoising and restoration. the process used median filter, as shown in fig. 2. then, fcm clustering implement and segment images into four parts. after that, the process converts the same image to same size analog image. finally, export segmented image is as shown in fig. 3. fig. 1. processes of fcm clustering. fig. 2. original image after the filtration process. mohammed and al-ani: digital medical image segmentation using fcm clustering uhd journal of science and technology | jan 2020 | vol 4 | issue 1 55 4.1. mathematical model of fcms clustering in general, core of the fcm method is a mathematical model which relies on five steps until it can cluster similar features together inside one region. because the first step of the work starts by taking values randomly from the membership matrix, so this example has depended on the imaginary value of two-dimensional image that illustrates the mathematical process of fcm clustering. *m: fuzzification parameter; its range between (1.25 and 2). while m: 2, i: first data point, j: first cluster, c: number of clusters. 4.1.1. first step randomly initialized the values as the membership matrix (um) to the original image according to equation.1, number of objects is 8, number of clusters is 4, the fuzziness parameter range (m) between (1.2 and 2), as shown in table 1. using this equation 1. µ = = …∑( 1) ( ) 1 ...... i : 1, 2, 3, 4, ..., k c j ij x (1) [20] 4.1.2. second step after that, the model find constraints for every cluster. according to equation 2. ( ) ( ) µ µ = ∑ ∑ ( ) _ ( ) m j ii m ji xi x c j xi (2) [20] 4.1.3. third step after that, the model finds distance (di) for every cluster and find centroids, for instance, in the current example find; centroid (1), centroid (2), centroid (3), and centroid (4), using euclidean rule, according to the equation 3. d x x y yi � � ( ) ( )= − + −2 1 2 2 1 2 (3) [20] after that, the model finds distances for every cluster, as shown in table 2. 4.1.4. fourth step updating the membership values, according to the equation 4. ( )µ − − = = ∑ 1 1 1 1 1 1 1 ( ) / ( ) c m m j i ji kik x d d (4) [20] in this case, m=2, i =first data point, j =first cluster, it is important new cluster values and put them in the new table so as to compare them with earlier cluster values to get new membership values, as shown in table 3; 4.1.5. final step this process iterates until it arrives correct centroids. 4.2. fcms algorithm • initialize the membership matrix of randomize values (fuzziness partition) • calculate the centroid vectors using the following equations ( ) ( ) ( ) ( ) m j ii m ji xi x cj xi µ µ = ∑ ∑ (5) and d x x y yi � � ( ) ( )= − + −2 1 2 2 1 2 (6) • update partition matrix for new elements according to this equation ( )µ − − = = ∑ 1 1 1 1 1 1 1 ( ) / ( ) c m m j i ji kik x d d (7) table 1: x and y values of eight objects with initial values of four clusters of them x y c1 c2 c3 c4 2 8 0.1 0.2 0.3 0.4 4 6 0.3 0.4 0.6 0.8 6 4 0.5 0.6 0.9 0.2 8 2 0.7 0.8 0.2 0.6 1 7 0.2 0.1 0.5 0.1 3 5 0.4 0.3 0.8 0.3 5 3 0.6 0.5 0.1 0.5 7 1 0.8 0.7 0.4 0.7 fig. 3. all processes together. mohammed and al-ani: digital medical image segmentation using fcm clustering 56 uhd journal of science and technology | jan 2020 | vol 4 | issue 1 • repeating this process by checking convergence until it gets the same manually result in centroids if not the loop returns back to step 2. 5. results and discussion the imperative thing in the procedure of digital medical image segmentation is the sensor or camera which use to capture images. this steps the primary step in the process and it has direct interaction with the physical organic things because the natural form of every signal is an analog signal. in medical imaging, it varies according to techniques of electromagnetic band imaging because any technique uses in the specified range to take an image. this study relied on using mri because it is the most effective technique to diagnose brain tumor. most of the time converting analog to digital happen inside the sensor. data acquisition is the first and important process in methodology for the training and testing process, but fcm clustering method is unsupervised learning. testing on this method did not depend on the specified dataset because it is only segmentation process and tested on five ready mri images taken from the internet that properly applied on the proposed method, which could clustered images well, as shown in fig. 4. after that, preprocessing step begins by image restoration using median filter and resizing because the segmentation process needs to find regions accurately and median filter does same process without damaging any edges. the next step is implementing fcms algorithm which use to segmenting medical images accurately. the proposed method dispatches the images to some clusters according to areas and how much iteration necessary to execute the method until it can show the accurate results. the backbone of the issue is the results of the algorithm that is applied on and original gray scale image of 4 years child head mri image in the left side. the target of this process is diagnosing an abnormal mass in the brain, as shown in fig. 4. the best method to solve these cases is fcm clustering because it simplifies the process of feature extraction. it separates different attributes according to these clusters that determine optionally. for instance, in the current method determined four clusters. the fcn algorithm depends on the fcn() method inside fuzzy logic toolbox inside the matlab tool. every mathematical step automatically happens inside the ready method according to the called image, but it iterates inside the method automatically 100 times to find the values of fcn objects and cluster them. matrix laboratory known as matlab is a practical robust tool used to conduct table 2: data point and distance for all clusters cluster 1 cluster 2 cluster 3 cluster 4 data point distance data point distance data point distance data point distance (2,8) 6.79 (2,8) 6.764 (2,8) 3.84 (2,8) 5.385 (4,6) 4 (4,6) 3.954 (4,6) 1.065 (4,6) 2.59 (6,4) 1.36 (6,4) 1.23 (6,4) 1.87 (6,4) 0.66 (8,2) 1.93 (8,2) 1.84 (8,2) 4.675 (8,2) 3.187 (1,7) 6.76 (1,7) 6.79 (1,7) 4.223 (1,7) 5.416 (3,5) 3.95 (3,5) 4 (3,5) 2.05 (3,5) 2.656 (5,3) 1.23 (5,3) 1.36 (5,3) 2.56 (5,3) 0.88 (7,1) 1.84 (7,1) 1.93 (7,1) 4.99 (7,1) 3.24 table 3: manually solved values of clusters x y cluster 1 cluster 2 cluster 3 cluster 4 2 8 0.198689327 0.199453065 0.351328263 0.250529346 4 6 0.136763286 0.138354361 0.513664923 0.21121743 6 4 0.204349796 0.225947742 0.148618034 0.421084428 8 2 0.326016177 0.34196262 0.134590635 0.197430568 1 7 0.206419945 0.205507927 0.330428327 0.257643801 3 5 0.185132797 0.182818637 0.356719292 0.275329273 5 3 0.26436788 0.23909742 0.127020505 0.369514195 7 1 0.346019973 0.329884326 0.127590531 0.19650517 fig. 4. original image. mohammed and al-ani: digital medical image segmentation using fcm clustering uhd journal of science and technology | jan 2020 | vol 4 | issue 1 57 procedures within the processing of digital images. it used to carry out mathematical computations with matrices and vectors, which is very straightforward to operate as it incorporates computation, visualization, and programming inside the system [21]. many methods used to prevent the noise problem in digital image processing. however, the best filtering technique in this case is median filtering. in addition, different types of filters are applied to remove different types of noise. there are many types of filtering, such as median, mean (average), and gaussian. after that, implementing the fcm clustering method starts by applying preprocessed images to the fcm method. it involves grouping the similar images data and distributing it into n clusters with all data points inside the images. different clusters exhibit different visions, as shown in fig. 5. this method can illuminate some properties of the image. it makes clusters over that rule. fig. 6 shows the mass inside the head and it aids the medical professionals to decide accurately. fig. 7 shows the image separated into n clusters. fig. 8 shows the background of the image. in general, all of these processes mixed in this method, as shown in fig. 3. the result of fcm is the most accurate compared to other unsupervised learning methods. fcm is used in a wide range of applications compared to other types, especially fig. 5. spaces inside the head magnetic resonance imaging. fig. 6. segmented tumor image. fig. 7. segmentation according to n clusters. fig. 8. background of the image. mohammed and al-ani: digital medical image segmentation using fcm clustering 58 uhd journal of science and technology | jan 2020 | vol 4 | issue 1 in diagnosing. recently, it uses in most of the segmentation cases relate to medical images. 6. conclusions new image segmentation process using fcm clustering is very important to get every desired feature and making clusters to extract their patterns. in addition, the results of every step are very important for manually working to find what are the weak points capable to change and improve. comparison between final values to initial values is done to realize the difference between them. the fcm clustering method performs this process automatically, but changing parameters are so beneficial criterion to get the best accurate segmented images. fcm method is the most effective segmentation approach in unsupervised learning methods which used in medical image segmentation processes. the mathematical procedure of this approach is illustrated step by step. fcm clustering is an unsupervised learning method, which is an accurate segmentation process. it can rely in on diagnosis process, such as brain tumor, which tested. then, the important role in finding the boundary of components of medical images and how it can find roi and segment the organs and abnormal shapes. this approach supports medical professionals to take the correct decision. references [1] a. norouzi, m. s. m. rahim, a. altameem, t. saba, a. e. rad, a. rehman and m. uddin. “medical image segmentation methods, algorithms, and applications”. iete technical review, vol. 31, no. 3, pp. 199-213, 2014. [2] r. c. gonzalez and r. e. woods. “digital image processing”. 4th ed. pearson education, new york, 2018. [3] r. kumar, g. satheesh and b. nisha. “mri brain image segmentation using fuzzy c means cluster algorithm for tumor area measurement”. international journal of engineering technology science and research, vol. 4, no. 9, pp. 929-935, 2017. [4] t. saikumar, p. yugander, p. s. murthy and b. smitha. “colour based image segmentation using fuzzy c-means clustering”. in: international conference on computer and software modeling, singapore, 2011. [5] e. doğanay, s. kara, h. k. özçelik and l. kart. “a hybrid lung segmentation algorithm based on histogram-based fuzzy c-means clustering”. journal computer methods in biomechanics and biomedical engineering: imaging and visualization, vol. 6, no. 6, pp. 638-648, 2014. [6] s. naz, h. majeed and h. irshad. “image segmentation using fuzzy clustering: a survey”. in: 6th international conference on emerging technologies, islamabad, pakistan, 2010. [7] g. padmavathi, m. m. kumar and s. k. thakur. “nonlinear image segmentation using fuzzy c means clustering method with thresholding for underwater images”. ijcsi international journal of computer science issues, vol. 7, no. 3, pp. 35-40, 2010. [8] o. jamshidi and a. h. pilevar. “automatic segmentation of medical images using fuzzy c-means and the genetic algorithm”. journal computational medicine, vol. 2013, p. 972970, 2013. [9] g. stephanie. “fuzzy clustering definition”, 2016. available from: https://www.statisticshowto.datasciencecentral.com/fuzzyclustering. [last accessed on 2019 oct 01]. [10] l. ma, h. chen, k. meng and d. liu. “medical image segmentation based on improved fuzzy c-means clustering”. in: international conference on smart grid and electrical automation, changsha, china, 2017. [11] m. m. ahmed and d. b. mohamad. “anisotropic diffusion model segmentation of brain mr images for tumor extraction by combining kmeans clustering and perona-malik”. international journal of image processing, vol. 2, no. 1, pp. 27-34, 2008. [12] j. quintanilla-dominguezab, b. ojeda-magañaac, m. g. cortinajanuchsab, r. ruelasc, a. vega-coronab and d. andinaa. “image segmentation by fuzzy and possibilistic clustering algorithms for the identification of microcalcifications”. scientia iranica, vol. 18, no. 3, pp. 580-589, 2011. [13] h. jiang, y. liu, f. ye, h. xi and m. zhu. “study of clustering algorithm based on fuzzy c-means and immunological partheno genetic”. journal of software, vol. 8, no. 1, p. 134, 2013. [14] m. yambal and h. gupta. “image segmentation using fuzzy c means clustering: a survey”. international journal of advanced research in computer and communication engineering, vol. 2, no. 7, pp. l-5, 2013. [15] n. e. a. khalid, n. m. noor and n. ariff. “fuzzy c-means (fcm) for optic cup and disc segmentation with morphological operation”. procedia computer science, vol. 42, pp. 255-262, 2014. [16] n. a. ali, b. cherradi, a. e. abbassi, o. bouattane, m. youssfi. “gpu fuzzy c-means algorithm implementations: performance analysis on medical image segmentation”. multimedia tools and applications, vol. 77, no. 16, pp. 21221-21243, 2018. [17] o. bandyopadhyaya, a. biswasa and b. b. bhattacharyaba. “long-bone fracture detection in digital x-ray images based on digital-geometric techniques”. computer methods and programs in biomedicine, vol. 123, pp. 2-14, 2016. [18] s. k. adhikari, j. k. sing, d. kumar, b. m. nasipuri. “conditional spatial fuzzy c-means clustering algorithm for segmentation of mri images”. applied soft computing, vol. 34, pp. 758-769, 2015. [19] g. s. chuwdhury, m. khaliluzzaman and m. rashed-al-mahfuz. “mri segmentation using fuzzy c-means clustering and bidimensional empirical mode decomposition”. in: international conference on computer, communication, chemical, materials and electronic engineering, 2016. [20] e. n. sathishkumar. “fuzzy c means manual work”. lecturer at periyar university, salem, 2015. [21] r. c. gonzalez, r. e. woods and s. l. eddins. “digital image processing using matlab”. pearson education, london, united kingdom, 2004. tx_1~abs:at/tx_2:abs~at uhd journal of science and technology | jul 2020 | vol 4 | issue 2 117 1. introduction hepatitis b virus (hbv) is a small dna virus with a spherical structure and lipid coat membrane, a hepadnaviridae family member. cause acute liver inflammation in acute and chronic forms is followed by liver failure, cirrhosis, and hepatocellular carcinoma (hcc). more than 90% of patients with acute hbv infection recover although they have severe symptoms, whereas in chronic forms (can be asymptomatic); patients are unable to clear the virus [1]. several transmission routes are now known, including tattooing, piercing, exposure to the infected blood and body fluids, saliva, vaginal discharge, and semen, perinatal transmission. however, it was noticed that hbv infection in infancy and early childhood could lead to chronic hepatitis in about 95% of cases [2]. it was reported that the sexually transmitted hbv infection also could lead to chronic hepatitis in 5% of unvaccinated men and women. moreover, studies revealed that hbv could serological and molecular detection of hepatitis b virus among patients referred to kurdistan center for hepatology and gastroenterology in sulaimani city/kurdistan region of iraq raz sirwan abdullah1, salih ahmed hama1,2 1department of biology, college of science, university of sulaimani, kurdistan region, sulaymaniyah, iraq, 2department of medical laboratory science, college of health sciences, university of human development, kurdistan region, sulaymaniyah, iraq a b s t r a c t hepatitis b virus infection is caused by the hepatitis b virus, a major global health problem. this infection can lead to chronic conditions, followed by cirrhosis and hepatocellular carcinoma (hcc). the current study was aimed to detect hbv using serological and molecular techniques. during 2019, 300 blood samples were collected from kurdistan center for hepatology and gastroenterology in sulaimani city. enzyme-linked immunosorbent assay (elisa) and real-time polymerase chain reaction (rt-pcr) techniques were used for the detection of hbsag and hbv dna, respectively. obtained results were revealed that 92 out of 300 tested patients (30.66%) seropositive for hbsag. among 92 seropositive patients, 53 were shown positive results for hbv dna by rt-pcr. dental clinic visiting and dialysis were among the important risk factors for hbv transmission. the vast majority of positive results were among males. smokers showed relatively high rates of positive results. one-third of the referred patients who had liver complaints were positive for hbsag. more than half of the seropositive patients showed rt-pcr positive results. it was concluded that the molecular method (rt-pcr) is more sensitive and gives a more accurate result than serology (elisa). therefore, it can be used as a diagnostic tool for hbv detection. index terms: hepatitis b virus, hepatitis b surface antigen, enzyme-linked immunosorbent assay, gold in-tube, realtime polymerase chain reaction corresponding author’s e-mail: raz sirwan abdullah, department of biology, college of science, university of sulaimani, kurdistan region, sulaymaniyah, iraq. e-mail: razbio89@gmail.com received: 12-10-2020 accepted: 08-11-2020 published: 26-11-2020 access this article online doi: 10.21928/uhdjst.v4n2y2020.pp117-122 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2020 raz. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l r e s e a r c h a r t i c l e uhd journal of science and technology abdullah and hama: serological and molecular detection of hepatitis b virus among patients referred to kurdistan center for hepatology and gastroenterology in sulaimani city/kurdistan region of iraq 118 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 survive outside the body for at least seven days [2], [3]. it was confirmed that hbv infection is a global health problem, and more than 400 million people worldwide suffering from chronic hbv infections [4]. hepatitis b surface antigen (hbsag) is one of the essential antigens of the virus, located on the lipid membrane of the virus [1], [4]. it was reported that hepatitis b e antigen (hbeag) negative and anti-hbe positive results in the laboratory could be considered as evidence of chronic hbv infection [4]. other investigators concluded that hbsag and anti-hbsag antibodies are more likely indicators of detecting hbv infection [5]. the absence of hbsag and detection of anti-hbsag among hbv patients can be considered a sign of recovery from the disease, whereas if hbv dna is still detected during this stage, this could be an indicator for chronic infection [4]. the vast majority of chronic cases may lead to hcc [6]. it was reported that the seropositive patients for hbsag, hbeag, and hbv dna were more likely suspected with the chronic infection of hbv. several mechanisms were known to lead to the conversion of chronic hbv infection to hcc [6]. researches showed that if the immune system fails to clear hbv, the cycles of necrosis, inflammation, and reconstruction will repeat, causing hepatocytes may suffer from potentially epigenetic changes and carcinogenic mutations [7]. as a subsequence, the hbv genome may find in almost liver tumor cells, which may alter liver cell function. this can lead to changes in carcinogen-related genes, including cyclin a, telomerase reverse transcriptase, platelet-derived growth factor beta-receptor, and mitogenactivated protein kinase 1 [6], [7]. it was good understood that smoking and alcohol are risk factors for hbv infection [8]. hbv replication is a relatively complex mechanism. the hbv genome is converted to a relaxed circular dna or a double-stranded linear dna during replication. each of them can be converted to covalently closed circular dna (cccdna) [9]. rna pre-genome and hbv mrnas are produced by cccdna. the rna pre-genome acts as the template for the synthesis of the negative-sense dna strand and the positive-sense dna strand is finally made based on the dna negative-strand [9], [10]. as laboratory markers, the hbsag, anti-hbs, hbeag, anti-hbe, hepatitis b core antigen hbcag, and anti-hbc igm/igg are the most critical serological markers for the diagnosis of hbv [11]. hbsag is the most significant marker for detecting hbv by elisa [5], [6], [9], [10], [11]. molecular methods for identifying hbv dna according to the who standards include the quantitative polymerase chain reaction (qpcr) and the real-time polymerase chain reaction (rt-pcr) [10], [11]. the current study aims to determine the percentage rates of hbsag seropositivity and hbv nucleic acid detection by rt-pcr among patients (who are enormously suffered from liver complaints) referred to gastroenterology center in sulaimani city. 2. materials and methods 2.1. study population the study’s population included people visiting the kurdistan center for hepatology and gastroenterology in sulaimani city from june to november 2019. all patients had liver problems, and they had experienced specific symptoms included fever, chills, abdominal pain, nausea, diarrhea, dark urine, loss of appetite, jaundice, and fatigue. the sample size of the current study was 300 patients included 160 males and 140 females. 2.2. sample collection fresh venous blood samples were collected and transferred to the laboratory using cool boxes for the desired lab. investigation. serum samples were kept until hbsag and hbv dna detection by elisa rt-pcr. 2.3. hbsag detection by elisa sandwich-elisa method was depended to detect hbsag using a special elisa kit (elab-science, china) with a precoated microtiter plate well with recombinant hbsab. all preserved sera samples were transferred to room temperature for about 30 min. 100 μl of each sample, standard, and blank were added to the desired wells. the plate was covered with sealer and incubated for 90 min at 37oc. the wells were aspirated and washed by an elisa washer as directed by the supplied company. 100 μl of biotinylated detection ab working solution were added to each well, and the plate was covered and incubated for 60 min at 37oc. the plate was washed. to each well, 100 μl of hrp conjugate working solution was added except for the blank control and incubated q for 30 min at 37oc. the plate was washed after that, 90 μl of substrate solution was added to each well and mixed, then incubated at 37 oc for 15 min. 50 μl of stop solution were added for the wells and mixed thoroughly. the optical density was measured at 450 nm using elisa microplate reader. 2.4. viral dna extraction and amplification high molecular weight genomic viral dna was extracted by a fully automated magnetic beat nucleic acid extraction system (zinexts, taiwan). the viral nucleic acid extraction kit (zinexts) was used to separate viral genomic viral dna using 400 μl of serum and 100 μl elute volume. the abdullah and hama: serological and molecular detection of hepatitis b virus among patients referred to kurdistan center for hepatology and gastroenterology in sulaimani city/kurdistan region of iraq uhd journal of science and technology | jul 2020 | vol 4 | issue 2 119 extracted viral dna was stored at -80oc until the day of examination by rt-pcr using an rt-pcr machine (line gene 9600 plus -bioer, china-). a special kit (fluorion hbv qnp 2-0 -inotek, turkiye) was used for the detection of hbv dna. 2.5. pcr reaction a total volume of pcr mix, internal control, ah2o, and extracted viral dna (25 μl) was prepared as directed by the supplied company as summarized below: items volume pcr mix 12.5 μl detection mix 1 1.4 μl detection mix 2 1.1 μl internal control 1 μl dh2o 4.0 μl extracted viral dna/standard/ negative/positive control 5 μl total volume 25.0 μl the process of pcr programming for detecting hbv nucleic acid was performed starting with the denaturation step, renaturation, annealing, elongation, and data collection as summarized in the below table. step temperature (°c) duration cycle initial denaturation 95 15 min 1 denaturation 95 30 s 50 annealing, elongation, and data collection 54 1:30 min infinite hold 22 ∞ 2.6. statistical analysis the results were analyzed using chi-square and mann–whitney u-tests through spss v. 25 software (spss inc., chicago, il, united states). statistically, the p-value was considered to be <0.05 (p < 0.05) significant. 3. results the patients who participated in this study who suffered from health complaints were distributed on males (160) and females (140). the mean age of them was 38.16 ± 15.24 years. all patients who were referred to the git center were clinically suspected of having hepatitis and liver complaints. several symptoms were depended and recorded among the tested patients who were submitted for serologic and molecular detection of the hepatitis b virus. pre-diagnosed liver caser patients, blood transfusion also was included in addition to their residency (table 1). table 1: distribution of the tested patients according to the different variables variable number (%) gender males 160 (53.33) females 140 (46.67) smoking smokers 180 (60.00) non-smokers 120 (40.00) age mean mean±sd 38.16±15.24 clinical symptoms fever 260 (86.67) chills 249 (83.00) abdominal pain 171 (57.00) nausea 133 (44.33) diarrhea 120 (10.00) loss of appetite 110 (36.67) jaundice 73 (24.33) dark urine 35 (11.67) fatigue 30 (10.00) liver cancer (hcc) yes 6 (2.00) no 294 (98.00) residency urban 130 (43.33) rural 170 (56.67) blood transfusion yes 13 (4.33) no 287 (96.67) serologic tests by elisa revealed that out of 300 tested patients, 92 (30.66%) were seropositive for hbsag. the vast majority of positive results were among males (50 patients 54.34%), whereas the percentage rates of seropositive cases were lower among females (42 patients 45.66%). when the results were analyzed statistically, it was noticed that gender has a valuable effect on the seropositive results. the majority of the seropositive cases were smokers (67.4%). there was a significant difference between smokers and non-smokers (p < 0.05), whereas it has appeared that the marital status has no significant effects on the hbsag seropositive results (p > 0.05). studying several risk factors for hbv transmission indicated that visiting dental clinics and dental surgery (927.17%), followed by dialysis (21.73%), were among the highest risk factors. statistical analysis showed that dental visiting and dialysis significantly affect the seropositive results (p < 0.05). although relatively a large number of the seropositive patients were within uncertain transmission routes, 39 patients out of 92 (42.4%). finally, it has appeared that seropositive results were higher (58.7%) outside sulaimani city center (rural) while the percentage was lower (41.3%) in the city center (urban). it was concluded that the patient’s residency has valuable effects on the obtained results (p < 0.05) (table 2). depending on hbv dna detection by rt-pcr, it was noticed that out of the 92 seropositive patients, 53 (57.6%) were positive for hbv dna (table 3). positive results abdullah and hama: serological and molecular detection of hepatitis b virus among patients referred to kurdistan center for hepatology and gastroenterology in sulaimani city/kurdistan region of iraq 120 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 city center (rural) (58.5%) comparing to those from the city center (41.5%) (urban). statistical analysis indicated that gender, smoking, marital status, and liver cancer have significant effects on hbv dna detection results (p < 0.05) (table 3). studying the modes of transmission indicated that dental clinics and dental visiting have a significant effect on the positive results of rt-pcr (p < 0.05) and can be considered as an important risk factor for hbv transmission, followed by dialysis, which is also can be considered as a risk factor after dental clinic visiting (table 3). the positive results were relatively higher among males for hbsag and dna detection (54% and 62.26%), respectively, when compared to female patients who showed lower positive results (46% and 37.74%), respectively. it was noticed that relatively a large number of patients were seropositive for hbsag while they were negative for hbv dna detection by rt-pcr (fig. 1). 4. discussion life-threatening infection by hbv stills a global cause of health concern. it was estimated worldwide that over 257 million people are under the risk of liver cirrhosis and hcc due to chronic hepatitis b virus (hbv) infection [12]. the high seropositivity rates of hbv infection among studied cases in the current study can be explained where all patients submitted to this research were with a chronic history of liver problems. all of them were previously referred by their specialist physicians to the gastroenterology center for checking and laboratory investigations. they were suspected of having hepatitis. several studies and investigators reported a lower prevalence of hbv infections than our observation. in a previous study, it was reported that the prevalence of hbsag seropositivity among a population of 345 tested cases was relatively lower [13]. in the current study, blood transfusion was considered as an important were higher among males (62.26%) when compared with females (37.74%). similarly, dna detection among smokers was relatively higher (64.15%) comparing to non-smokers (35.85%). the positive results among married patients were elevated (56.6%) if compared with single patients (43.4%). half of the patients with liver cancer were seropositive for hbsag, whereas two-third of them was hbv dna positive. moreover, patient’s residency showed significant effects on the positive results where the percentage rates of positive results were higher among patients outside sulaimani 54 46 62.26 37.74 0 10 20 30 40 50 60 70 80 90 100 males females elisa real-time pcr p er ce nt ag e (% ) fig. 1. comparison between positive results of hbv using elisa (seropositivity) and rt-pcr techniques. table 2: represents different risk factors for hbsag seropositivity by elisa variable number (%) p value gender males 50 (54.34) <0.05 females 42 (45.66) smoking smoker 62 (67.4) <0.05 non-smoker 30 (32.6) marital status married 48 (52.17) >0.05 single 44 (47.83) liver cancer yes 3 (50%) >0.05 no 3 (50%) mode of transmission dental visiting 25 (27.17) <0.05 surgery 1 (1.88) dialysis 20 (21.73) blood transfusion 2 (2.16) sexual route 2 (2.16)) animals modes 1 (1.08) barbershop 1 (1.08) familial history 1 (1.08) uncertain 39 (42.4) residency urban 38 (41.3) <0.05 rural 54 (58.7) table 3: studying risk factor for hbv dna detection by rt-pcr variables explanation number (%) p value gender male 33 (62.26) <0.05 female 20 (37.74) smoking smokers 34 (64.15)) <0.05 non-smokers 19 (35.85)) marital state married 30 (56.6) <0.05 single 23 (43.4) liver cancer no 1 (33.33) <0.05 yes 2 (66.67) modes of transmission dental visiting 16 (30.18) <0.05 dialysis 7 (13.2) surgery 2 (3.77) blood transfusion 2 (3.77) sexual 1 (1.88) animals 1 (1.88) barbershop 1 (1.88) familial 1 (1.88) unknown 22 (41.5) residency in city center 22 (41.5) <0.05 outside the city center 31 (58.5) abdullah and hama: serological and molecular detection of hepatitis b virus among patients referred to kurdistan center for hepatology and gastroenterology in sulaimani city/kurdistan region of iraq uhd journal of science and technology | jul 2020 | vol 4 | issue 2 121 risk factor for hbv transmission, which was also common among thalassemic patients. these observations were agreed with observation reported by others [14]. in another study done in sulaimani, it was found that occupation, education level, history of jaundice, smoking, and alcohol drink had a significant effect on viral hepatitis infections, especially hcv infection [15]. our observations were parallel with their conclusion considering some factors, including smoking. the results of the current study were different from the results reported by the iranian research group who reported a relatively lower prevalence of hbv infection (7.4%) [16]. moreover, our conclusions agreed with results reported in a study in switzerland who reported relatively high hbv prevalence (32.4%) [17]. ott et al. in 2012 found that the global hbeag prevalence varied between 20 and 50% [18]. the current observations were in this range and agreed with their results. similar to our results, a high prevalence of hbv infections (30.4%) were reported in spain [19]. studies done in italy reported higher prevalence rates (52.7%) [20] than our observations. moreover, studies done in australia reported lower hbv prevalence in comparison to the current results, although it was relatively high compared to other related studies in other countries [21]. the high prevalence rates of hbv seropositivity among males in our study may be due to that they engaged in a range of parenteral (sharp objects sharing), especially in barbershops, which expose them to hbv infection. our results were agreed with conclusions reported by other investigators in 2018 [9]. the risk of hbv infection among males was agreed with other epidemiological studies conducted in different areas among different groups and populations who observed that hbv infection is strongly associated with increased age among males [22], [23], [24], [25]. results reported in the current study sug gested that occupational transmission of hbv infections in dental settings, which sometimes found frequently, and high prevalence of hbv seropositivity among dental clinics and patients who referred to these clinics may be due to inadequate sterilization of the surgical tools used in dental clinics, and sharing some tools between the patients [26]. similarly, inadequate sterilization and cleaning of materials and machines used in hemodialysis might be directly contributed to the prevalence of hbv infection among patients with hemodialysis. the current study showed that not all hbsag seropositive patients were positive for hbv dna detection by rt-pcr, which was agreed with observations reported by a study done in nairobi (kenya) in 2017 by mathai et al. [27]. they reported that the seropositive results of hbsag and hbv dna detection are totally different, and none of elisa and rt-pcr cannot be alternative for each other [28]. unlike conclusions of the current study, other researchers found that hbsag detection and quantification by elisa is more sensitive and gives more accurate results in detection of hbv infection, and they found that elisa can be considered as an acceptable or adequate method in the diagnosis of hbv infection and hbsag detection [29]. 5. conclusion the percentage rates of hbsag were relatively high (30.66%) among patients referred to the gastroenterology center in sulaimani who have suffered from liver complaints. more than half of the seropositive patients showed rtpcr positive results for hbv nucleic acid detection. the gender, smoking, and marital status, and living places were significantly increased the incidence rate of hbv infection. visiting dental clinics, dental surgery, and dialysis were among significant risk factors that facilitate the transmission of hbv infection. 6. acknowledgments our thanks, due to all patients involved in this study, we want to thank specialist physicians in the sulaimani gastroenterology center. references [1] t. j. liang. “hepatitis b: the virus and disease”. hepatology, vol. 49, no. 5 suppl, pp. s13-s21, 2009. [2] world health organization. “hepatitis b: fact sheet”, world health organization, geneva, 2020. [3] s. lanini, v. puro, f. n. lauria, f. m. fusco, c. nisii and g. ippolito. “patient to patient transmission of hepatitis b virus: a systematic review of reports on outbreaks between 1992 and 2007”. bmc medicine, vol. 7, p. 15, 2009. [4] g. fattovich. “natural history of hepatitis b”. the journal of hepatology, vol. 39, no. suppl 1, pp. s50-s58, 2003. [5] g. raimondo, t. pollicino and g. squadrito. “clinical virology of hepatitis b virus infection”. the journal of hepatology, vol. 39, no. suppl 1, pp. s26-s30, 2003. [6] m. g. peters. “hepatitis b virus infection: what is current and new”. topics in antiviral medicine, vol. 26, no. 4, pp. 112-116, 2019. [7] j. e. song and d. y. kim. “diagnosis of hepatitis b”. annals of translational medicine, vol. 4, no. 18, p. 338, 2016. [8] s. k. k. mani and o. andrisani. “hepatitis b virus-associated abdullah and hama: serological and molecular detection of hepatitis b virus among patients referred to kurdistan center for hepatology and gastroenterology in sulaimani city/kurdistan region of iraq 122 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 hepatocellular carcinoma and hepatic cancer stem cells”. genes (basel), vol. 9, no. 3, p. 137, 2018. [9] x. liu, a. baecker, m. wu, j. y. zhou, j. yang, r. q. han, p. h. wang, z. y. jin, a. m. liu, x. gu, x. f. zhang, x. s. wang, m. su, x. hu, z. sun, g. li, l. mu, n. he, l. li, j. k. zhao and z. f. zhang. “interaction between tobacco smoking and hepatitis b virus infection on the risk of liver cancer in a chinese population”. international journal of cancer, vol. 142, no. 8, pp. 1560-1567, 2018. [10] c. seeger and w. s. mason. “molecular biology of hepatitis b virus infection”. virology, vol. 479-480, pp. 672-686, 2015. [11] j. e. song and d. y. kim. “diagnosis of hepatitis b”. the annals of translational medicines, vol. 4, no. 18, p. 338, 2016. [12] world health organization. “hepatitis b”. world health organization, geneva, 2017. [13] m. asad, f. ahmed, h. zafar, s. farman. “frequency and determinants of hepatitis b and c virus in general population of farash town, islamabad”. pakistan journal of medical sciences, vol. 31, no. 6, pp. 1394-1398, 2015. [14] s. a. hama and m. i. sawa. “prevalence of hepatitis b, c, and d among thalassemia patients in sulaimani governorate”. kurdistan journal of applied research, vol. 2, no. 2. pp. 137-142, 2017. [15] s. a. hama. “hepatitis c seropositivity and rna detection among type2 diabetic patients in sulaimani governorate-iraqi kurdistan region”. journal of zankoy sulaimani a, vol. 18, no. 4, pp. 1-8, 2016. [16] z. nokhodian, m. r. yazdani, m. yaran, p. shoaei, m. mirian, b. ataei, a. babak, m. ataie. “prevalence and risk factors of hiv, syphilis, hepatitis b and c among female prisoners in isfahan, iran”. hepatitis monthly, vol. 12, pp. 442-447, 2012. [17] l. gétaz, a. casillas, c. a. siegrist, f. chappuis, g. togni, n. t. tran, s. baggio, f. negro, j. m. gaspoz and h. wolff. “hepatitis b prevalence, risk factors, infection awareness and disease knowledge among inmates: a cross-sectionalstudy in switzerland’s largest pre-trial prison”. journal of global health, vol. 8, no. 2, pp. 020407, 2018. [18] j. j. ott, g. a. stevens, s. t. wiersma. “the risk of perinatal hepatitis b virus transmission: hepatitis b e antigen (hbeag) prevalence estimates for all world regions. bmc infectious diseases, vol. 12, p. 131, 2012. [19] p. s. hoya, a. marco, j. garcía-guerrero, a. rivera. “hepatitis c and b prevalence in spanish prisons”. european journal of clinical microbiology and infectious diseases, vol. 30, pp. pp. 857-862, 2011. [20] s. babudieri, b. longo, l. sarmati, et al. “correlates of hiv, hbv, and hcv infections in a prison inmate population: results from a multicentre study in italy”. journal of medical virology, vol. 76, pp. 311-317, 2005. [21] j. m. reekie, m. h. levy, a. h. richards, et al. “trends in prevalence of hiv infection, hepatitis b and hepatitis c among australian prisoners 2004, 2007, 2010”. mja, vol. 200, pp. 277280, 2014. [22] a. c. stief, r. m. martins, s. m. andrade, m. a. pompilio, s. m. fernandes, p. g. murat, g. j. mousquer, s. a. teles, g. r. camolez, r. b. l. francisco and a. r. c. motta-castro. “seroprevalence of hepatitis b virus infection and associated factors among prison inmates in state of mato grosso do sul. brazil”. revista da sociedade brasileira de medicina tropical, vol. 43, no. 5, pp. 512-515. [23] f. j. souto. “distribution of hepatitis b infection in brazil: the epidemiological situation at the beginning of the 21st century”. revista da sociedade brasileira de medicina tropical, vol. 49, pp. 11-23, 2016. [24] j. paoli, a. c. wortmann, m. g. klein, v. r. z. pereira, a. m. cirolini, b. a. de godoy, n. j. r. fagundes, j. m. wolf, v. r. lunge and d. simon. “hbv epidemiology and genetic diversity in an area of high prevalence of hepatitis b in southern brazil”. the brazilian journal of infectious diseases, vol. 22, no. 4, pp. 294-304, 2018. [25] p. f. belaunzaran-zamudio, j. l. mosqueda-gomez, a. maciashernandez, s. rodríguez-ramírez, j. sierra-madero and c. beyrer. burden of hiv, syphilis, and hepatitis b and c among inmates in a prison state system in mexico. aids research and human retroviruses, vol. 33, no. 6, pp. 524-533, 2017. [26] m. a. a. al kasem, m. al-kebsi abbas, m. m. ebtihal and a. a. s. hassan. “hepatitis b virus among dental clinic workers and the risk factors contributing for its infection”. online journal of dentistry and oral health, vol. 1, no. 2, pp. 1-6, 2018. [27] f. mathai, m. o. ngayo, s. m. karanja, a. kalebi and r. lihana. “correlation of quantitative assay of hbsag and hepatitis b virus dna levels among chronic hbv patients attending pathologist lancet laboratory in nairobi, kenya”. archives of clinical infectious diseases, vol. 12, no. 4, pp. e13306, 2017. [28] m. gencay, a. seffner, s. pabinger, j. gautier, p. gohl, m. weizenegger, d. neofytos, r. batrla, a. woeste, h. s. kim, g. westergaard, c. reinsch, e. brill, p. t. t. thuy, b. h. hoang, m. sonderup, c. w. spearman, g. brancaccio, m. fasano, g. b. gaeta, t. santantonio and w. e. kaminski. “detection of in vivo hepatitis b virus surface antigen mutations a comparison of four routine screening assays”. journal of viral hepatitis, vol. 25, no. 10, pp. 1132-1138, 2018. [29] h. j. lee, s. y. kim, s. m. lee, j. heo, h. h. kim and c. l. chang. “elecsys hepatitis b surface antigen quantitative assay: performance evaluation and correlation with hepatitis b virus dna during 96 weeks of follow-up in chronic hepatitis b patients”. annals of laboratory medicine, vol. 32, no. 6, pp. 420-425, 2012. tx_1~abs:at/tx_2:abs~at uhd journal of science and technology | jul 2020 | vol 4 | issue 2 91 1. introduction these days, we can see a quick increment in people using wearable devices for more different purposes and reasons in their lives. according to the previous studies, the employment of connected wearable detector devices is expected to extend from 325 million in 2016 to 1105 million in 2022 [1]. in this way, where an unmeasurable number of wearable devices is connected together will produce a huge amount of information consistently, at that time, we face or interaction with a critical test of storing, handling, and processing that information’s to produce usable information and to make a keen world. distinguishing regularities and abnormalities or inconsistencies in gushing information from that measure of usable information, we previously produced subsequent to putting away and preparing it can give us an encounter and can possibly give experiences, and is useful in human services, money, security, web-based social networking, and numerous applications [2], [3]. over the most recent few years, totally different types of methodologies and techniques have been proposed for identifying regularities and irregularities pattern in streaming data for fall detection, which may be a wearable tool based, ambiance sensor-based, and vision-based [4]. above all, wearable devices typically take some different benefit of embedded sensors to observe the movement and placement of the body, such as measuring system, accelerometer, magnetometer, and gyroscope [5], [6]. as well as the value of wearable tool based any methodologies or techniques are very low, also because the installation and operation are not complex, and the task is not difficult for the elderly [7], [8]. if from now the physician could not specifically monitoring and identify falls, then there is no chance to save or protect you in future to preventing an accident. for such reasons, fall identification and anticipation have turned into a significant issue then must be find and propose a good way to solve fall detection using neural network based on internet of things streaming data zana azeez kakarash1,2, sarkhel h. taher karim3,4, mokhtar mohammadi5 1department of engineering, faculty of engineering and computer science, qaiwan international university, sulaymaniyah, iraq, 2department of computer engineering and information technology, faculty of engineering, razi university, kermanshah, iran, 3department of computer science, college of science, university of halabja, halabja, iraq, 4department of computer network, technical college of informatics, sulaimani polytechnic university, sulaymaniyah, iraq, 5department of information technology, lebanese french university, erbil, kurdistan region, iraq a b s t r a c t fall event has become a critical health problem among elderly people. we propose a fall detection system that analyzes real-time streaming data from the internet of things (iot) to detect irregular patterns related to fall. we train a deep neural network model using accelerometer data from an online physical activity monitoring dataset named, mobiact. an ibm cloud-based iot data processing framework is used to manage streaming data. about 96.71% of accuracy is achieved in assessing the performance of the proposed model. index terms: fall detection, internet of things, artificial neural networks, machine learning corresponding author’s e-mail: zana azeez kakarash, department of software engineering faculty of engineering and computer science, qaiwan international university sulaimani, iraq. e-mail: zana.azeez@kti.edu.krd received: 10-03-2020 accepted: 28-09-2020 published: 25-10-2020 access this article online doi: 10.21928/uhdjst.v4n2y2020.pp91-98 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2020 kakarash, et al. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l re se a rc h a rt i c l e uhd journal of science and technology kakarash, et al.: fall detection using ann 92 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 in the method of detecting regularities and irregularities in streaming data for fall detection [9], [10]. fall in the human daily life is one of the main health risks and dangerous, mostly for the older community in our today’s society, because of the rose growing in mortality, morbidity, incapacity, disability, and frailty [11]. nearly for fall initiate injuries as collected and represented, over 80% of all damage related clinic confirmations among peoples for more than 65 years [12], [13]. according to the reasons as we mentioned before, falls affect a huge number of the elderly all through the world. for instance, falls some of the aged cost the national health service more than £4.6 million every day as indicated by a report by the centre for social justice uk [14]. the notable studies and researches in this filed are detecting fall or anomaly in real streaming data [15]-[20] and outliers [21], [22]. we will probably investigate the circumstance of unpredictable human development identification, for example, fall, by utilizing continuous sensor information. at that point, we map that issue as sporadic example location issue, thinking about a fall as a capricious action with respect to standard human action and attempt to perceive the exceptional fall situations from ordinary development or human activities, for example, walking, sitting, lying (lyi), sleeping, standing, and every other movement such as playing, cooking, and running. in general, artificial neural networks (ann) have systematically achieved higher results in the detection falls from physical activity observation knowledge. ozdemir and barshan have used a pair of 2520 trials to make a huge amount of dataset [23]. their fall detection system achieved 95% accuracy by employing a multi-layer perceptron (mlp) for binary classification between activities of daily living (adl) and fall. kerdegari et al. recorded 1000 movement acceleration data collection using a waist-worn measuring system and obtained 91.6% accuracy for binary classification for adl against fall using mlp [24]. nukala et al. collected knowledge from 322 tests achieved 98.7% accuracy with mlp victimization scaled conjugate graduate learning [25]. theodoridis et al. [26] developed two long short-term memory (lstm) models, one with easy measuring system data collection associated another with accelerometer data revolved at an angle, using a published dataset referred to as ur fall detection. the lstm model with rotation obtained the most effective results with 98.57% accuracy. the work by class musci et al. used one public dataset (sisfall) dataset implementing recurrent neural network with underlying lstm blocks for developing online fall detection system [27]. they achieved 97.16% accuracy victimization associate best window breadth of w = 256. ajerla et al. [42] slightly changed the preprocessing done by vavoulas et al. on other public dataset (mobiact) dataset and achieved quite 90% accuracies on most of their experiments victimization mlp and lstm [28], [29]. they also obtained 99% accuracy victimization lstm in two of their experiments. table 1 shows the compares performance of the ann techniques. nowadays, there are a lot of wearable sensors devices exist which may observe falls automatically and send a notification to the caregivers, machine services, or ambulance offerings. however, most of them are expensive or need a subscription of month-to-month service. a large number of studies and research has been done on fall detection the usage of sensors such as accelerometers and gyroscopes, because of low cost and incorporated into a large number of cell phones accessible todays in the world such as smartphones and tablets. rather than fall identification, there is a dire need for prediction and prevention systems [30], [31]. in this paper, we propose a system for detecting abnormal pattern of falling behavior in real-time streaming data in internet of things (iot). the data were obtained from wearable sensor systems used for human health and activity tracking and control. in our research, we focus on how to train the proposed system to recognize and distinguish irregular activity pattern related to specific kinds of fall according to three different annotated published datasets: mobiact, using vavoulas et al. in 2016 [32], sisfall, using sucerquia et al. in 2017 [33], and umafall by way of eduardo et al. in 2018 [34] for implementing our framework from streaming data for fall detection, we used an ann model for fall recognition giving (96.71%) accuracy. at that point, we integrate our system for that huge amount of data recordings from streaming sensor (online) into free ibm streams to fabricate an ibm cloudbased iot information preparing structure. table 1: comparison of the ann techniques research paper technique accuracy ozdemir et al., 2014 [35] mlp 95% kerdegari et al., 2013 [36] mlp 91.6% nukala et al., 2014 [37] mlp 98.7% theodoridis et al., 2018 [38] lstm 98.57% abbate et al., 2012 [39] feed-forward ann 100% musci et al., 2018 [40] rnn with lstm 97.16% ajerla et al., 2018 [41] mlp and lstm 99% (lstm) mlp: multi-layer perceptron, rnn: recurrent neural network, lstm: long short-term memory, ann: artificial neural networks kakarash, et al.: fall detection using ann uhd journal of science and technology | jul 2020 | vol 4 | issue 2 93 2. methodology we proposed a profound learning model to recognize fall and in this manner, another structure for utilizing the model with steaming information from wearable sensor frameworks is proposed. simulated fall detection is used due to the scarcity of actual fall data, as fall is a dangerous event. we consider a framework that includes ann as one of the evaluating procedure. ann has ceaselessly achieved better outcomes in fall recognition from physical checking spilling information. the architecture comprises three essential layers (i) gathering stream for information ingestion, (ii) streaming extract, transform, and load engine for real-time query processing, and (iii) online analytical processing backend for taking care of long-running inquiries. the means of the proposed framework are as per the following data collection 2.1. dataset all open datasets or we can say public data sets for wearable fall detection frameworks are distinguished that the selection criteria for any datasets should give priority to the experimental subjects, the quantity of tests and sorts of adl and falls included in the study [42]. based on that, we have chosen three different datasets, mobiact, sisfall, and umafall, to evaluate our framework dependent on the number of subjects and number of exercises secured by each dataset. it is possible to collect data from one or several sources. for instance, in an iot workload, data are ingested simultaneously from thousands of separate data sources. each source submits fresh tuples to a stream (likely to send them through a socket), which a data collector mechanism then receives. this data collector mainly act as a queue for messaging [43]. as we explained in table 2, mobiact dataset contains labeled data for four differing kinds of falls and nine completely different adls were collected from 70 subjects (male/female) and quite a pair of 2500 trials buy using so many different kind of a smartphone. the activities are depicted employing a time stamp, raw measuring system values, raw rotating mechanism values, and orientation information. table 3 shows the activities covered within the mobiact dataset. in the mobiact dataset is that it does not embrace fall data from any old people [44]. sisfall remedies this situation by mixing with information from 15 old people aged between 60 and 75 years. still, in these two datasets real fall data it does not available or do not included. sisfall contains annotated data for 15 differing kinds of falls and 19 different types of adls were collected from 38 subjects and quite 4500 trials using a custom measuring instrument containing two completely different models of 3d accelerometers and a rotating mechanism positioned on a buckle. furthermore, umafall datasets include 12 different types of fall and 15 types of adls were collected from 2600 trials by using so many kinds of smartphone and remote accelerators. like first dataset as i mention above these activities are also depicted employing a time stamp, raw measuring system values, raw rotating mechanism values, and orientation information 2.2. data preprocessing 2.2.1. segmentation using the same procedure given by aziz et al. [45], the raw data are segmented into 200 blocks and then the feature sets for each block are generated. 2.2.2. feature extraction list of features a: a sum of 54 features were created in include seta [46]. for every pivot (x, y, and z) of the quickening, 21 highlights were determined from the mean, middle, standard deviation (std), slant, kurtosis, least, and most extreme. utilizing the outright estimations of every hub (x, y, and z) of the quickening, another 21 highlights were determined from the mean, middle, std, slant, kurtosis, least, and most extreme. incline, another component, was determined utilizing eq. 1, one for the given tomahawks esteems (x, y, and z) and another for the supreme qualities (|x|, |y|, |z|). table 2: activites covered in datasets code activity description fall fol forward-lying fall forward from standing, use of hands to dampen fall fkl front-knees-lying fall forward from standing, first impact on knees sdl sideward-lying fall sideward from standing, bending legs bsc back-sitting-chair fall backward while trying to sit on a chair adl std standing standing with subtle movements wal walking normal walking jog jogging jogging jum jumping continuous jumping stu stairs up 10 stairs up stn stairs down 10 stairs down sch sit chair sitting on a chair sci car step in step in a car sco car step out step out of a car kakarash, et al.: fall detection using ann 94 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 slope = + + (max min ) (max min ) (max min ) x x y y z z − − − 2 2 2 (1) four different features were determined utilizing mean, std, slant, and kurtosis of the tilt edge (tai) between the gravitational vector and the y-axis utilizing eq. 2. tai yi xi yi zi= + +sin ( /( )) −1 2 2 2 (2) where i denotes the sequence of the sample. utilizing the magnitude of the acceleration vector, six features were resolved from the mean, std, least, greatest, contrast among most extreme and least, and zero-intersection rate. the size was determined utilizing eq. 3. magnitude xi yi zi= + + 2 2 2 (3) where i denotes the sequence of samples. for each of the three axes (x, y, and z), the average absolute difference was calculated [47]. furthermore, the average resultant acceleration of all the three axes was generated using eq. 4. averageresultant acceleration n xi yi zi i =     + +∑1 2 2 2* (4) where i denotes the sequence of samples. combined feature set. the feature sets a, b, and c are merged to generate the dataset with the combined features. we had 7670 samples in the dataset. after feature extraction, each sample had one of six classification values and 58 extracted features. class values include four kinds of falls and two kinds of non-falls as defined in the mobiact dataset. the two nonfall classes denote standing (std) and lyi positions. 2.2.3. normalization the extracted features were normalized using matlab r2017b with the min-max scaling formula given in eq. 5. x x min feature max feature min feature = − − ( ) ( ) ( ) (5) 2.2.4. data balancing the obtained dataset after normalization was very unbalanced containing 5830 non-fall and 1840 fall data. we made data balanced for our dataset using matlab r2017b as follows: the various fall data categories were merged into a single fall data classification while the two non-fall data categories were merged into a single non-fall data classification. then, the fall data were oversampled to create 2000 samples and the non-fall data were under-sampled to create 2000 samples to create a combined balanced dataset containing 4000 samples. 2.2.5. feature selection selection of discriminable features affects the performance of the classification in terms of accuracy and complexity. among different feature selection methods, the relief-f method is used for feature selection, the relief-f method in python is used with number of neighbor set to two and number of features to be kept set to four. the relief-f compared to the relief method is more robust and can deal with incomplete and noisy data [48]. 2.3. predictive modeling we created a simple predictive deep neural network (dnn) model consisting of five layers including three hidden layers to detect fall based on data stream sensor. the structure of dnn is displayed in fig. 1. the model was constructed and designed using offline datasets and then was implemented in our streaming data processing system. we are building a streaming data processing system with ibm tools, and then we tested and validated our prediction model with a static fall data. 2.4. streaming data processing framework after finalizing, testing and approving our proposed model using specific static data for fall detection, we build up a streaming data processing and handling structure with ibm devices. the framework structure is appeared in fig. 2. table 3: datasets used for training dataset no. of subjects (male/female) age range no. of activities adls/falls no. of samples (adls/falls) mobiact [32] 70 (45/25) 20–47 9/4 2526 (1879/647) sisfall [33] 60 (30/20) 19–30, 60–75 19/15 4505 (2707/1798) umafall [34] 55 (35/15) 30–47 15/12 2650 (1700/950) adl: activities of daily living kakarash, et al.: fall detection using ann uhd journal of science and technology | jul 2020 | vol 4 | issue 2 95 notebook and the middle information records have been posted on github. we had an arrangement to utilize the ibm streaming analytics [52], an organization for ibm streams on ibm cloud, to continually screen the sensor information from mobile phone or wearable sensors and send an admonition to the watching application to alert human services suppliers about the crisis care required by patients in the unexpectedly occasion or some other of a fall. • data store: we have the plan to use ibm cloud [53] as the nosql database to store the sensor information from the watson iot platform and classification of that streaming data which are obtained from watson studio. • data sink: the framework will have different output channels: • knowledge base. • visualization • monitor • ibm cloud. 1. implementation: the tests were run on a framework with the accompanying least equipment and programming prerequisites: • platform: keras with tensor flow backend • cpu: intel core i5-2467m, 1.6 ghz • ram: 10 gb • hard disk: 128 gb • os: windows 10 64 bit . to actualize the framework, we utilized a free ibm cloud account and the relevant services. we included the following accompanying assets. • jupyter notebook 6.0.0 • cloud object storage • streaming analytics • iot platform. 3. results and discussion the machine learning model for the complete feature set was first tried on mobiact as explained in section 2.1. this resulted in our model in an accuracy of (96.71%) for binary classification, fall and adl (non-fall) using eq 6. the model was lept running for 200 epochs. acc = (true positive+true negative)/ number of all samples (6) the normalized and un-normalized confusion matrices were implemented for classifying fall and non-fall (adl) cases out of 7670 samples are appeared in fig. 4. this shows the fig. 1. deep learning model utilized for the proposed framework. 2.5. architecture and components • data source: sensor information is recovered from the cell phone or any wearable devices must be fixed in the patient body. • data ingestion system: to receiving the sensor data from a smartphone or a wearable sensor and go about as the message queuing telemetry transport [49] message broker, we used ibm watson iot platform. and also apache kafka [50] is used here as an open source alternative, which uses depends on its own protocol. • data stream processing: to start with we utilized jupyter notebook with ibm cloud [51] as appeared in fig. 3, for offline information preparing to run the element extraction and to execute the ai model utilizing the sample data. the python code utilized in jupyter fig. 2. the framework of the data stream processing system to detect irregular patterns for internet of things streaming data. kakarash, et al.: fall detection using ann 96 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 created model performs incredible and can be utilized to develop the proposed framework. 4. conclusion in this paper, we addressed the problem of irregularity pattern detection from online streaming data. here, we especially focused on detecting fall as an irregular human activity which is common amongst elderly people. we implemented a dnn model that can classify fall from nonfall activities based on the dataset mobiact and two more datasets with an accuracy of 96.71%. for future, we plan to compare the proposed system with other systems that use open source streaming data analytics tools to evaluate the functionality and performance of the ibm tools used in our framework. references [1] h. tankovska. “statistic”, 2020. available from: https://www.statista. com/statistics/487291/global-connected-wearable-devices. [2] mahfuz. “detecting detecting irregular patterns in iot streaming data for fall detection irregular patterns in iot streaming data for fall detection”. 2018 ieee 9th annual information technology, electronics and mobile communication conference (iemcon), 2018. [3] e. bahmani, m. jamshidi and a. shaltooki. “breast cancer prediction using a hybrid data mining model”. international journal on informatics visualization, vol. 3, no. 4, pp. 327-331, 2019. [4] j. liu. “development and evaluation of a prior-to-impact fall event detection algorithm”. ieee transactions on biomedical engineering, vol. 61, no. 7, pp. 2135-2140, 2014. [5] p. pierleoni. “a high reliability wearable device for elderly fall detection”. ieee sensors journal, vol. 15, no. 8, pp. 4544-4553, 2015. [6] r. freitas. “wearable sensor networks supported by mobile devices for fall detection”. sensors, ieee, 2014. [7] x. zhuang. “acousticfalldetection using gaussian mixture models and gmm super vectors”. 2009 ieee international conference on acoustics, speech and signal processing, 2009. [8] y. li. “efficient source separation algorithms for acoustic fall detection using a microsoft kinect”. ieee transactions on biomedical engineering, vol. 61, no. 3, pp. 745-755, 2014. [9] r. hamedanizad, e. bahmani, m. jamshidi and a. m. darwesh. “employing data mining techniques for predicting opioid withdrawal fig. 3. machine learning model executed using jupyter notebook. fig. 4. normalized confusion matrix for the trained model. kakarash, et al.: fall detection using ann uhd journal of science and technology | jul 2020 | vol 4 | issue 2 97 in applicants of health centers”. journal of science and technology, vol. 3, no. 2, pp. 33-40, 2019. [10] a. shaltooki and m. jamshidi. “the use of data mining techniques in predicting the noise emitted by the trailing edge of aerodynamic objects”. international journal on informatics visualization, vol. 3z, no. 4, pp. 388-393, 2019. [11] a. s. cook. “falls in the medicare population: incidence, associated factors, and impact on health care”. physical therapy, vol. 2009, no. 2, pp. 324-332, 2009. [12] p. kannus. “fall-induced injuries and deaths among older adults”. jama, vol. 281, no. 20, pp. 1895-1899, 1999. [13] y. cheng. “a fall detection system based on sensortag and windows 10 iot core”. 15th international conference on mechanical science and engineering, pp. 238-244, 2015. [14] o. ojetola, e. i. gaura and j. brusey. “fall detection with wearable sensors safe (smart fall detection)”. intelligent environments, 2011 7th international conference on, pp. 318-321, 2011. [15] s. m. s. forbes. “fall prediction using behavioural modelling from sensor data insmarthomes”. artificial intelligence review, vol. 53, pp, 1071-1091, 2019. [16] s. k. gharghan. “accurate fall detection and localization for elderly people based on neural network and energy-efficient wireless sensor network”. energies, vol. 11, pp. 1-32, 2018. [17] d. yacchirema. “fall detection system for elderly people using iot and big data”. procedia computer science, vol. 130, pp. 603-610, 2018. [18] c. c. h. hsu. “fallcare+: an iot surveillance system for fall detection”. proceedings of the 2017 ieee international conference on applied system innovation ieee-icasi 2017 meen, pp. 921922, 2017. [19] ahmad s. “real-time anomaly detection for streaming analytics”. arxiv, 2016. [20] s. c. tan. “fast anomaly detection for streaming data”. proceedings of the 22nd international joint conference on artificial intelligence, pp. 1511-1516, 2011. [21] m. gupta. “outlier detection for temporal data: a survey”. ieee transactions on knowledge and data engineering, vol. 26, no. 9, pp. 2250-2267, 2014. [22] s. sadik and l. gruenwald. “research issues in outlier detection for data streams”. vol. 15. sigkdd explorations, pp. 33-40, 2013. [23] t. özdemir and b. barshan. “detecting falls with wearable sensors using machine learning techniques”. sensors, vol. 14, no. 6, pp. 10691-10708, 2014. [24] h. kerdegari, k. samsudin, a. r. ramli and s. mokaram. “development of wearable human fall detection system using multilayer perceptron neural network”. international journal of computational intelligence systems, vol. 6, no. 1, pp. 127-136, 2013. [25] t. nukala, n. shibuya, a. rodriguez and j. tsay. “an efficient and robust fall detection system using wireless gait analysis sensor with artificial neural network (ann) and support vector machine (svm) algorithms”. open journal of applied biosensor, vol. 3, pp. 29-39, 2014. [26] t. theodoridis, v. solachidis, n. vretos and p. daras. “human fall detection from acceleration measurements using a recurrent neural network”. in: precision medicine powered by phealth and connected health. springer, berlin, germany, pp. 145-149, 2018. [27] m. musci, d. de martini, n. blago, t. facchinetti and m. piastra. “online fall detection using recurrent neural networks”. arxiv, 2018. [28] g. vavoulas, c. chatzaki, t. malliotakis, m. pediaditis and m. tsiknakis. “the mobiact dataset: recognition of activities of daily living using smartphones”. ict4ageingwell, pp. 143-151, 2016. [29] d. dharmitha, m. sazia and z. farhana. “fall detection from physical activity monitoring data”. presented at the international sigkdd workshop on big data, streams and heterogeneous source mining: algorithms, systems, programming models and applications (bigmine), london, 2018. [30] d. ajerla, s. mahfuz and f. h. zulkernine. “a real-time patient monitoring framework for fall detection”. queen’s university, kingston, ontario, canada, 2018. [31] d. ajerla. “fall detection from physical activity monitoring data”. bigmine, london, uk, 2018. [32] t. xu, y. zhou and j. zhu. “new advances and challenges of fall detection systems: a survey”. applied science, vol. 8, no. 3, p. 418, 2018. [33] h. chen, (eds.). “ngu3fall detection using smartwatch sensor data with accessor architecture”. springer international publishing, berlin, germany, pp. 81-93, 2017. [34] e. c. j. santoyo-ramón. “umafall: fall detection dataset (universidad demalaga)”. available from: https://www.figshare. com/articles/uma_adl_fall_dataset_zip/4214283. [last accessed on 2018 jun 04]. [35] s. ahmad, a. lavin, s. purdy and z. agha. “unsupervised real-time anomaly detection for streaming data”. neurocomputing, vol. 262, pp. 134-147, 2017. [36] m. ahmed, a. n. mahmood and m. r. islam. “a survey of anomaly detection techniques in financial domain”. future generation computer systems, vol. 55, pp. 278-288, 2016. [37] s. ahmad and s. purdy. “real-time anomaly detection for streaming analytics”. arxiv, 2016. [38] m. mohammadi, a. al-fuqaha, s. sorour, and m. guizani, “deep learning for iot big data and streaming analytics: a survey,” ieee communications surveys & tutorials, 2018. [39] j. meehan, c. aslantas, s. zdonik, n. tatbul and j. du. “data ingestion for the connected world”. classless inter-domain routing, 2017. [40] s. sadik and l. gruenwald. “research issues in outlier detection for data streams”. acm sigkdd explorations newsletter, vol. 15, no. 1, pp. 33-40, 2014. [41] m. gupta, j. gao, c. c. aggarwal and j. han. “outlier detection for temporal data: a survey”. ieee transactions on knowledge and data engineering, vol. 26, no. 9, pp. 2250-2267, 2014. [42] e. casilari, j. a. santoyo-ramón and j. m. cano-garcía. “analysis of public datasets for wearable fall detection systems”. sensors, vol. 17, no. 7, p. 1513, 2017. [43] j. meehan. “data ingestion for the connected world”. creative commons attribution license, 2017. [44] m. p. g. vavoulas. “the mobifall dataset: fall detection and classification with a smartphone”. international journal of monitoring and surveillance technologies research, vol. 2, no. 1, p. 13, 2014. [45] o. aziz, m. musngi, e. j. park, g. mori and s. n. robinovitch. “a comparison of accuracy of fall detection algorithms (thresholdbased vs. machine learning) using waist-mounted tri-axial accelerometer signals from a comprehensive set of falls and non-fall trials”. international federation for medical and biological engineering, vol. 55, no. 1, pp. 45-55, 2017. kakarash, et al.: fall detection using ann 98 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 [46] g. vavoulas1. “the mobiact dataset: recognition of activities of daily living using smartphones”. in” international conference on information and communication technologies for ageing well and e-health, 2016. [47] j. r. kwapisz, g. m. weiss and s. a. moore. “activity recognition using cell phone accelerometers”. sigkdd explorations, vol. 12, no. 2, pp. 74-82, 2010. [48] y. z. z. wang. “application of relieff algorithm to selecting feature sets for classification of high resolution remote sensing image”. 2016 ieee international geoscience and remote sensing symposium, pp. 755-758, 2016. [49] mqtt. available from: https://www.mqtt.org/getting-started. [last accessed on 2020 jan 01]. [50] a. kafka. available from: https://www.kafka.apache.org/intro. [last accessed on 2020 jan 05]. [51] ibm. “ibm watson studio ibm watson and cloud platform learning center”, 2020. available from: https://www.developer.ibm.com/ technologies/data-science. [52] ibm. “ibm streaming analytics ibm watson and cloud platform learning center”. 2016-07-18 2016 [53] ibm. “ibm cloudant-ibm watson and cloud platform learning center”, 2020. available from: https://www.developer.ibm.com/ components/cloud-ibm. tx_1~abs:at/tx_2:abs~at 10 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 1. introduction power system planning starts with the forecast of load requirements. with the fast growth of power systems networks and increase in their complexity, many factors have become influential in electric power generation, demand, or load management. the forecasting of electricity demand has been one of the major research fields in electrical engineering. massive investment decisions for network reinforcement and expansions are made based on the load forecast. hence, it is necessary to have accurate load forecast to carry out proper planning [1], [2]. although, load forecasting is one of the major factors for economic operation of power systems. future load forecasting is also important for network planning, infrastructure development, and so on. power system load forecasting can be classified into three categories, namely, short-term, medium-term, and long-term load forecasting. the periods for these categories are not defined clearly in literature [2]. thus, different authors use different time periods to define these categories. however, roughly, short-term load forecasting (stlf) covers hourly to weekly forecast. these forecasts are often needed for day-to-day economic operation of power generating units [3]. midterm load forecasting has period time in 3 months–3 years, maintenance of plants and networks is often roofed in these types of forecast. monthly maximum load demand forecasting for sulaimani governorate using different weather conditions based on artificial neural network model najat hassan abdulkareem moel (ministry of electricity-krg), electricity control center, sulaimani/iraq o r i g i n a l re se a rc h a rt i c l e a b s t r a c t medium-term forecasting is an important category of electric load forecasting that covers a time span of up to 1 year ahead. it suits outage and maintenance planning, as well as load switching operation. there is an on-going attention toward putting new approaches to the task. recently, artificial neural network has played a successful role in various applications. this paper is presents a monthly peak load demand forecasting for sulaimani (located in north iraq) using the most widely used traditional method based on an artificial natural network, the performance of the model is tested on the actual historical monthly demand of the governorate for the years 2014–2018. the standard mean absolute percentage error (mape) method is used to evaluate the accuracy of forecasting models, the results obtained show a very good estimation of the load. the mape is 0.056. index terms: actual load, artificial neural network, midterm monthly load forecast, multilayer perceptron, predicted load. yearly ahead access this article online doi: 10.21928/uhdjst.v4n2y2020.pp10-17 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2020 abdulkareem. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) uhd journal of science and technology corresponding author’s e-mail: najat hassan abdulkareem, moel (ministry of electricity-krg), electricity control center, sulaimani/iraq. e-mail: qaradakhi@gmail.com received: 03-05-2020 accepted: 07-01-2020 published: 07-05-2020 abdulkareem: sulaimani load demand forecasting uhd journal of science and technology | jul 2020 | vol 4 | issue 2 11 long-term forecasting, on the other hand, deals with forecast from few months to 1 year. it is primarily intended for capacity expansion plans, capital investments, and corporate budgeting. these types of forecasts are often complex in nature due to future uncertainties such as planning and extension of existing power system networks for both the utility and consumers required long-term forecasts [3]. the system load is a random non-stationary process composed of thousands of individual components. the system load behavior is influenced by a number of factors, which can be classified as economic factors, time, weather, and random effects. the economic environment in which the utility operates has a clear effect on the electric demand consumption patterns, such as the service area demographics, level of industrial activity, and changes in farming sector [4]. for the present, there are many algorithms for load forecasting in the computation intelligence such as fuzzy logic (fs) [5], neural network, and genetic [6]. many research purposed the article for load forecasting in the power system field: stlf using autoregressive integrated moving average (arima) and artificial neural network (ann) method based on non-linear load, a novel method approach to load forecasting using regressive model and ann the combination of ann, genetic algorithm, and fs method is proposed for adjusting stlf of electric system. genetic algorithm is used for selecting better rules and backpropagation algorithm is also for this network, papers show that more accuracy results and faster processor than other forecasting methods [7]. the aim of this paper is to provide a monthly peak demand forecast for sulaimani governorate (located in north iraq). this forecast is of special importance to this region because of the present shortage of generating capacity and the need for extensive load shedding. it is thus important to estimate what the near-term demand will be, especially in the peak demand months, as a key input in determining the availability of enough generating capacity to meet the demand [8]. sulaimani is one of the four northern governorates of iraq (iraqi kurdistan region). it is bounded by iran; erbil governorate and kirkuk governorate. the land area of the governorate is about 18,240 sq.km. the city of sulaimani is located in 335 km northeast to baghdad. this study is tried to find out the way to forecast the electrical power load in sulaimani governorate power distribution network using ann with the help of matlab software. the input is involved three feature; precipitation, temperature, and humidity. these three parameters effect on power consumption directly. mean square error is used as performance measure. the method of learning in ann which is used in this paper is feed forward backpropagation. finally, by comparing different cases is tried to find best solution to estimate the load demand. this dataset consists of 5 years of average monthly load collected from the electricity control center (ecc) of kurdistan region in iraq. the data are accumulated daily and involve the maximum electrical demand load in mega watt (mw) from january 1, 2014, to october 31, 2018. the dataset is also accompanied by temperature, humidity, and precipitation collected, which can be used to forecast the demand. 2. the load profile of sulaimani governorate there is a severe shortage of electricity supply in sulaimani governorate. main sources of electricity supply are dokan hydropower station (5×80 mw), derdandikhan hydropower station (3×83 mw), sulaimani combined cycle gas power plant which is consists of 10 units (8 simple cycle 125 mw per unit and 2 combined cycle 250 mw per unit), tasluja heavy fuel power plant (51 mw), and bazyan gas power plant (4×125 mw), which is also supply erbil and dhuk governorates. (for explain this: all of this power stations cannot work in its full capacity. and the generate power also go to erbil and duhok, therefore, the sulaimani governorate demand every time is bigger than the power generation and that is the problem). however, the available supply from the above sources dose not meets the power requirements in the governorate. consumers are provided for a very short duration, sometimes 10 h per day depending on the generation capacity. monthly energy consumption demand (unit in mwh) data is recorded form (ecc) kurdistan region from 2014 to 2018. fig. 1 shows the relationship between energy consumption demand and time (corrected divide each year into 12 months). we consider the period from 2014 to 2018 to establish the parameters in forecast model. the original signal (behavior) of energy consumption demand is shown in fig. 1. it grew the higher demand every year. the maximum demand is occurred on months 11, 12, 1, 2, 6, 7, and 8 and minimum demand is occurred in months 3, 4, 5, 9, and 10. kurdistan regions climate is characterized by cool winters and hot summers. this extreme temperature swing affects the demand and abdulkareem: sulaimani load demand forecasting 12 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 producing a typical summer and winter peak demand periods each year. the monthly peak temperature profile for the years 2014–2018 is shown in fig. 2. 2.1. forecasting methods in terms of lead time, load forecasting is divided into four categories: • long-term forecasting with the lead time of more than 1 year • midterm forecasting with the lead time of 1 week–1 year • stlf with the lead time of 1 day–1 week • very stlf with the lead time shorter than 1 day. the research approaches of load forecasting can be mainly divided into two categories: statistical methods and artificial intelligence methods. in statistical methods, equations can be obtained showing the relationship between load and its relative factors after training the historical data, while artificial intelligence methods try to imitate human beings way of thinking and reasoning to get knowledge from the past experience and forecast the future load. some main stlf methods are introduced as follows. regression methods regression is one of most widely used statistical techniques. for load forecasting, regression methods are usually employed to model the relationship of load consumption and other factors such as weather, day type, and customer class. time series methods are based on the assumption that the data have an internal structure, such as autocorrelation, trend, or seasonal variation. the methods detect and explore such a structure. time series have been used for decades in such fields as economics, digital signal processing, as well as electric load forecasting. in particular, arma (autoregressive moving average), arima, and arimax (arima with exogenous variables) are the most often used classical time series methods. arma models are usually used for stationary processes while arima is an extension of arma to non-stationary processes. arma and arima use the time and load as the only input parameters. since load generally depends on the weather and time of the day, arimax is the most natural tool for load forecasting among the classical time series models. similar day approach, this approach is based on searching historical data for days within 1, 2, or 3 years with similar characteristics to the forecast day. similar characteristics include weather, day of the week, and the date. the load of a similar day is considered as a forecast. instead of a single similar day load, the forecast can be a linear combination or regression procedure that can include several similar days. the trend coefficients can be used for similar days in the previous years. expert systems are heuristic models, which are usually able to take both quantitative and qualitative factors into account. a typical approach is to try to imitate the reasoning of a human operator. the idea is then to reduce the analogical thinking behind the intuitive forecasting to formal steps of logic. a possible method for a human expert to create the forecast is to search in history database for a day that corresponds to the target day with regard to the day type, social factors, and weather factors. then, the load values of this similar day are taken as the basis for the forecast. fs is a generalization of the usual boolean logic used for digital circuit design. an input under boolean logic takes on a value of “true” or “false.” under fs an input is associated with certain qualitative ranges. for instance, the temperature of a day may be “low,” “medium,” or “high.” fs allows one to logically deduce outputs from fuzzy inputs. in this sense, fs is one of a number of techniques for mapping inputs to outputs. among the advantages of the use of fs are the absence of a need for a mathematical model mapping inputs to outputs and the absence of a need for precise inputs. with such generic conditioning rules, properly designed fs systems can be very robust when used for forecasting. integration of different algorithms are many presented methods for (stlf), it is natural to combine the results of several methods. one simple way is to get the average value of them, which can lower the risk of individual unsatisfactory prediction. a more complicated and reasonable 0 500 1000 1500 2000 2500 0 12 24 36 48 60 months 2014-2018 fig. 1. monthly energy consumption demand of sulaimani city from january 2014 to december 2018 (dcc). 0 5 10 15 20 25 30 35 40 0 12 24 36 48 60 monthes, year 2013-2017 fig. 2. average temperature of sulaimani governorate from 2013 to 2017. references: 1. http://www.motac.gov.krd/news.aspx?id=1226. 2. the (sulaimani directory of meteorological and seismology data) page on facebook: shorturl.at/gmyy9. abdulkareem: sulaimani load demand forecasting uhd journal of science and technology | jul 2020 | vol 4 | issue 2 13 way is to get the weight coefficient of every forecasting method by reviewing the historical prediction results. the comprehensive result is deduced by weighted average method. 2.2. forecasting requirements this subsection lists and describes the requirements to develop a user friendly and a good load forecasting tool. a good load forecasting tool should fulfill the requirement of accuracy, fast speed, friendly interface, and automatic data access. accuracy the most important requirement of designing a load forecasting tool is its prediction accuracy. as mentioned before, good accuracy is the basis of economic dispatch, system reliability, and electricity markets. the main goal of this paper is to make the forecasting result as accurate as possible. fast speed employment of the latest historical data helps to increase the accuracy. when the deadline of the forecasted result is fixed, the longer the runtime of the forecasting program is, the earlier historical data can be employed by the program. therefore, the speed of the forecasting is a basic requirement of the forecasting program. programs with too long training time should be abandoned and new techniques shortening the training time should be employed. friendly interface the graphical user interface of the load forecasting tool should be easy, convenient, and practical. the users can easily define what they want to forecast, whether through graphics or tables. the output should also be with the graphical and numerical format, in order that the users can access it easily. automatic data access the historical data is stored in the database. the load forecasting tool should be able to access it automatically and get the needed data. 2.3. advantages and disadvantages of load forecasting 2.3.1. advantages 1. it enables the utility company to plan well since they have an understanding of the future consumption or load demand 2. useful to determine the required resources such as fuels required to operate the generating plants as well as other resources that are required to ensure uninterrupted and yet economical generation and distribution of the power to the consumers. this is important for all short-, medium-, and long-term planning 3. planning the future in terms of the size, location, and type of the future generating plant is the factors which are determined by the help of load forecasting 4. provides maximum utilization of power generating plants. the forecasting avoids under generation or over generation 2.3.2. disadvantages 1. it is not possible to forecast the future with accuracy. the qualitative nature of forecasting, a business can come up with different scenarios depending on the interpretation of the data 2. organizations should never rely 100% on any forecasting method however, an organization can effectively use forecasting with other tools of analysis to give the organization the best possible information about the future. 3. making a decision based on a bad forecast can result in financial ruin for the organization, so the decisions of an organization should never base solely on a forecast 3. load forecasting using ann a large variety of artificial intelligence and statistical techniques has been developed for load forecasting. some of the methods such as similar day approach, regression methods, time series, neural network, expert systems, and fs are used nowadays, among them, ann is one of a good choice to apply for the load demand forecasting problem because this technique is not requiring explicit models to represent the complex relationship between the load demand and factors. ann methods are particularly attractive, as they have the ability to handle the non-linear relationships between load and the factors affecting it directly from historical data. anns use a dense interconnection of computing nodes to approximate non-linear functions each node constitutes a neuron and performs the multiplication of the input signals by constant weight, sums up the results, and maps the sum to a nonlinear activation function, the result is then transferred to its input [9]. 3.1. basic theory – feed forward backpropagation in this paper, we use backpropagation feed forward neural network to model the problem. fig. 1 shows an example of ann structure; each neural network has at least three layers, input layer, a hidden layer, and output layer. in a typical multilayer network, the input units which are denoted by xi are connected to all hidden layer units which are defined by yj and the hidden layer units are connected to all output layer units which are denoted by zk. the elements wij or vij of the weight matrix associate the weight of each connection between the input to hidden and hidden to output layer units. the hidden and output layer units also receive signals from weighted connections (bias) from units whose values are always 1. in each output and hidden units, the incoming signals from the previous layer sum together and apply an activation function to form the response of the net for a given input pattern. abdulkareem: sulaimani load demand forecasting 14 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 xi:input i=1,2,…n (1) (y j) = f bj +∑x i w ij (2) (z k) = f b k + ∑y j v jk (3) to determine the error, each output will be compare its output (zk) with the actual output value which is identified by dk. then, according to the calculated error, dk will be determined. dk is a factor which is used to distribute the error at zk back to all units in the previous layer. d k = f ' (z k )(d k z k) (4) zk: calculated output for each layer dk: actual output dk: the factor for calculation the errors. factor d j will be computes for each hidden unit. this factor is a weighted sum of all the back propagated delta terms from units in the previous layer multiplied by the derivative of the activation function for that unit. d j = f '(y j) ∑d k v jk (5) in the next step, the new value of bias and each element of weight matrix will be calculated where η is a learning rate coefficient that is given a value between 0 and 1 at the start of training. b j (new) = b j (old) + ηd j (6) w ij (new) = w ij (old) + ηdj xi (7) in each irritation, it will be checked if the stop condition is occurred or not. the stop condition can be reaching error threshold, defined irritation, etc. [10], [11]. 3.2. problem formulation and methodology the following steps have been followed by the investigator to formulate the above said problem: i. first of all historical weather and load data is scrutinized. all monthly and daily predictions have been read ii. then, database has been created by the investigator for developing load forecasting model iii. accordingly, temperature and humidity have been differentiated as average value iv. furthermore, the rainy season and predicted rainfall have been considered for making the algorithm for midterm load forecasting v. after this classification some ann technique has been used to train these input variables for getting the expected outcome. system has been simulated with the help of matlab/simulink vi. then, percentage error (pe) has been calculated for the given forecasting model. 3.3. accuracy of forecasted to evaluate forecasting accuracy of the whole procedure, the following indices have been calculated in equations (8 and 9), mean square error (mse) for each month of forecasting: ( ) = − = ∑ 2 1 m i i i actiual forecast mse m (8) mean absolute percentage error (mape) given by: ( ) = − = ×∑ 2 1 1 100 m i i i i actiual forecast mape m actiual (9) where, actual is the real value of monthly load demand at the each year, forecasted is the forecasted value in the same year, and m is month [12], [13]. after forecasting the load patterns for each test month, these forecasts were compared with the real load data, and the average error percentages were calculated. in comparing different models, the average percentage forecasting error is used as a measure of the performance. the reason for using the average percentage error is the fact that its meaning can be easily understood. it is also the most often used error measure in the load forecasting literature used as reference of this work and therefore allows for some comparative considerations (although the results are not directly comparable in different situations). however, when both measures calculated on some test models with relatively small errors, the orders of preference were in practice the same with both measures. therefore, the average forecasting error will be used throughout this work. in case of the monthly forecast, the training algorithms of gradient backpropagation and levenberg–marquardt were compared, whereas in case of the monthly forecast, the simulation was done. although levenberg–marquardt is a very fast training algorithm, it often has given fairly inaccurate results due to the large approximations it makes while calculating the hessian matrix [14], [15]. from years 2014 to 2018 were chosen in different seasons. of the year. the load of each month of the dataset was forecast without using the actual load data of that month. abdulkareem: sulaimani load demand forecasting uhd journal of science and technology | jul 2020 | vol 4 | issue 2 15 thereby, the monthly forecasting model was applied for each test of monthly peak demand recursively for all months in the year. after forecasting the load patterns for each test month, these forecasts were compared with the real load data, and the average error percentages were calculated. 4. case study and test results the present study develops midterm electric load forecasting using neural network; based on historical series of power demand, the neural network chosen for this network is feed forward network, case study and test results will be present in this section by following: 4.1. case study this study used the historical information or data for proposing ann model to load forecasting as following: monthly energy consumption demand (mwh), the humidity (h), precipitation (p), and temperature (t). all of data information are recorded from 2014 to 2018. table 1 shows the block model for 2 years ahead demand forecasting which have four inputs: the historical load demands from −12 months to −48 months, maximum temperature from 1 months to −48 months, humidity from 1 months to −84 months, and precipitation 1 months to −48 months. these are feature inputs to ann. the output of this model is the load demand of sulaimani governorate that is +24 months ahead or 2 years ahead. note that: the historical load demands: year 2018 (0), 2017 (−12), 2016 (−24), 2015 (−36), and 2014 (−48). various network models based on multilayer backpropagation feed forward architecture are tested with different designs and different configurations of hyperparameters (one and two) hidden layer with neurons of 10, 12, 15, and different transfer functions. after several trials, the near-optimal values have been taken. the information about different cases (for 1 month january 2014, for example) is described in table 1. a backpropagation network with momentary and with adaptive learning rate was trained and the neural network can forecast future load 1 day peak load per month ahead given in various inputs to the network. a sigmoid transfer function was used in the hidden layer while a linear transfer function was used in. it has been observed that using 12 neurons and transfer function of logsig gave less percentage error for nearly all months. and depending on this result, the forecasted values (using model in table 1) can be observed. fig. 3 represents the results of 2 years forecasting. fig. 3. artificial neural network structure used in midterm load forecasting. fig. 4. (a) actual and forecasted load demand for 24 months (b) percentage error for 24 months. a b table 1: the model for 2 years ahead forecasting −48 −36 −24 −12 0 12 +24 2014 2015 2016 2017 2018 2019 2020 train test forecasted value ¦ ¦ ¦ ¦ ¦ ¦ ¦ ¦ ¦ ¦ ¦ ¦ abdulkareem: sulaimani load demand forecasting 16 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 (the error is different because of the different in actual output and the output in the nn). 5. conclusions ann models provide a very useful tool for midterm load forecasting. radically different from statistical methods, these models have shown promising results in load forecasting. the aim of this paper is to develop a practical model for the peak load demand of sulaimani governorate which gives a best expectation values with minimum errors so that directorate of central control planners can estimate what the near-term demand will be. the monthly peak demand estimation is a key input in determining if there is enough generating capacity available to meet the demand. the result of this research shows the high efficiency of the neural network in estimating the electrical power load and this is because ann can define the nonlinear relation between the weather data and the load with high accuracy. midterm monthly load forecasting using ann can lead to very good results if the ann structure is well designed and training data selection is appropriate. as it can be seen on the table above, the electricity demand is increasing to the peak in january, february, and december every year. from this point, sulaimani city needs extra investment on electricity energy load to satisfy the demand of consumption. in 2015 and 2016 years, electricity demand is above 2000 mw in january. in this work, sulaimani maximum electrical energy demand has been forecasted by considering different ann models with high accuracy, to model the effects of weather, proper ann models have been implemented. because any prediction model does not give the best results, nine different ann models for prediction were performed with the same period data and the superior ann model was detected for forecasting electricity demand. forecasts model can be varied number of neuron in hidden layer 10, 12, and 15 neurons. the results show 2 years ahead midterm load forecasting model of 12 neurons in hidden layer can be reduced error. mape in this model is 5.6%. this research and generally every research about load forecasting can be helpful for scheduling on requirement on fig. 4(a) and (b) shows the value of actual load demand in 2016–2017 on dash line. solid line shows the forecasted values. the mean absolute percentage error is 5.6%. using the model developed, load, error, and % error of all the months have been calculated. then, error and % error are found out using the below given formula. error = output by nn – actual output % error = (error/actual output) × 100 actual output, output by nn, error, and % error in 2016 and 2017 are given in tables 2 and 3. table 2: percentage error of different cases for january 2013 no. of hidden layer neurons transfer function pe 1 10 logsig −53.5031 12 −25.4602 15 −25.4602 1 10 tansig 0 12 −119.092 15 0 2 10 logsig −0.33203 12 −40.0727 15 −56.5244 table 3: actual, forecasted load demand, and percentage error month actual demand forecasted demand %error 1 2050 2050 0 2 2176 2175.797 0.011883 3 2179 2178.866 0.00774 4 1839 1835.985 0.249687 5 1199 1199 0 6 1393 1395.096 −0.16384 7 1524 1520.822 0.234717 8 1468 1474.335 −0.54235 9 1324 1324 0 10 1421 1420.9246 0.009407 11 1640 1699 −3.59756 12 1740 1696.65 2.491406 13 2119 2149.73 −1.60011 14 2119 1918.241 10.67865 15 2086 1907.967 10.5034 16 1908 2179.83 −17.3104 17 1440 1358.92 8.273469 18 1102 1037 5.914468 19 1268 1332.29 −5.29136 20 1194 1219.61 −2.14489 21 1390 1362.76 2.508287 22 1391 1397.36 −0.74736 23 1191 1211 −1.18554 24 1285 1228.05 3.274871 abdulkareem: sulaimani load demand forecasting uhd journal of science and technology | jul 2020 | vol 4 | issue 2 17 developing electrical distribution network, switching, selling energy, maintenance, and repairmen. to have even better results, we may need to have more sophisticated topology for the neural network which can discriminate start-up months from other months. here, we utilized only temperature, humidity, and precipitation among other weather information. nevertheless, ann may enable us to model such weather information for midterm procedure. the use of additional weather variables such as cloud coverage and wind speed should yield even better result. 6. acknowledgment the author would like to thank the staff directory of sulaimani dispatch control center, especially operation department, and sulaimani directory of metrological and seismology data for the contribution of this work and providing. references [1] h. s. hippert, c. e. pedreira and r. c. souza. “neural networks for short-term load forecasting: a review and evaluation”. vol. 16. in: ieee transactions on power systems, piscataway, new jersey, 2001. [2] united nations development programme. “electricity network development plan sulaimani governorate, undp-enrp, distribution sector revision 1 february”. united nations development programme, new york. 2002. [3] a. mohan. “mid term electrical load forecasting for state of himachal pradesh using different weather conditions via ann model”. international journal of research in management, science and technology, vol. 1, no. 2, 80, 2013. [4] m. r. g. al-shakarchi and m. m. ghulaim. “short-term load forecasting for baghdad electricity region. electric machines and power systems, vol. 28, pp. 355-371, 2000. [5] s. h. ling, f. h. f. leung, h. k. lam and p. k. s. tam. “short-term electric load forecasting based on a neural on a neural fuzzy network”. vol. 50. in: ieee transactions on industrial electronics, 2003. [6] g. c. liao and t. p. tsao. “integrated genetic algorithm/tabu search and neural fuzzy networks for short-term load forecasting”. power engineering society general meeting, vol. 1, pp. 1082-1087, 2004. [7] p. k. dash, s. mishra, s. dash, a. c. liew. “genetic optimization of a self-organizing fuzzy-neural network for load forecasting”. in: ieee power engineering society winter meeting, conference proceedings, 2000. [8] united states agency for international development. electricity sector master plan for iraq. united states agency for international development, washington, dc, united states, 2004. [9] b. islam. “comparison of conventional and modern load forecasting techniques based on artificial intelligence and expert systems”. ijcsi international journal of computer science issues, vol. 8, no. 3, pp. 504-513, 2011. [10] g. b. huang, q. y. zhu, k. mao, c. k. siew, p. saratchandran and n. sundararajan. “can threshold networks be trained directly”. vol. 53. in: iee transactions on circuits and systems part 2: express briefs, pp. 187-191, 2006. [11] a. nahari, h. rostami, r. dashti. “electrical load forecasting in power distribution network by using artificial neural network”. international journal of electronics communication and computer engineering, vol. 4, no. 6, 2013. [12] y. y. hsu and c. c. yang. “design of artificial neural networks for short-term load forecasting. part i: self-organizing feature maps for day type selection”. vol. 138. in: ieee proceedings-c, pp. 407-413, 1991. [13] m. djukanovic, b. babic, d. j. sobajic and y. h. pao. “unsupervised/ supervised learning concept for 24-hour load forecasting”. vol. 140. in: iee proceedings-c, pp. 311-318, 1993. [14] y. wang and d. gu. “back propagation neural network for short-term electricity load forecasting with weather features”. in: international conference on computational intelligence and natural computing, 2009. [15] m. buhari and s. adamu. “short term load forecasting using artificial neural naetwork”. vol. 1. in: proceeding of the international multi conference of engineering and computer scientists, pp. 221-226, 2012. . uhd journal of science and technology | jan 2020 | vol 4 | issue 1 29 1. introduction in the past two decades, the internet and more specifically social media have become the main huge source of opinionated data. people broadly use social media such as twitter, facebook, instagram to express their attitude and opinion toward things such as products of commercial companies, services, social issues, and political views in the form of short text. this steadily growing subjective data makes social media a tremendously rich source of information that could be exploited for the decision-making process [1], [2]. as mentioned before, one of the most popular and widespread social media is twitter. it is a microblogging platform that allows people to express their feelings toward vital aspects in the form of a 280-character length text called sentiment analysis using hybrid feature selection techniques sasan sarbast abdulkhaliq1, aso darwesh2 1department of computer science, university of sulaimani, sulaymaniyah, iraq, 2department of information technology, university of human development, sulaymaniyah, iraq a b s t r a c t nowadays, people from every part of the world use social media and social networks to express their feelings toward different topics and aspects. one of the trendiest social media is twitter, which is a microblogging website that provides a platform for its users to share their views and feelings about products, services, events, etc., in public. which makes twitter one of the most valuable sources for collecting and analyzing data by researchers and developers to reveal people sentiment about different topics and services, such as products of commercial companies, services, well-known people such as politicians and athletes, through classifying those sentiments into positive and negative. classification of people sentiment could be automated through using machine learning algorithms and could be enhanced through using appropriate feature selection methods. we collected most recent tweets about (amazon, trump, chelsea fc, cr7) using twitter-application programming interface and assigned sentiment score using lexicon rule-based approach, then proposed a machine learning model to improve classification accuracy through using hybrid feature selection method, namely, filter-based feature selection method chi-square (chi-2) plus wrapper-based binary coordinate ascent (chi-2 + bca) to select optimal subset of features from term frequency-inverse document frequency (tf-idf) generated features for classification through support vector machine (svm), and bag of words generated features for logistic regression (lr) classifiers using different n-gram ranges. after comparing the hybrid (chi-2+bca) method with (chi-2) selected features, and also with the classifiers without feature subset selection, results show that the hybrid feature selection method increases classification accuracy in all cases. the maximum attained accuracy with lr is 86.55% using (1 + 2 + 3-g) range, with svm is 85.575% using the unigram range, both in the cr7 dataset. index terms: binary coordinate ascent, bag of words, chi-square, logistic regression, n-grams, opinion mining, sentiment analysis, support vector machine, twitter-application programming interface, term frequency-inverse document frequency corresponding author’s e-mail: sasan sarbast abdulkhaliq, department of computer science, university of sulaimani, sulaymaniyah, iraq. e-mail: sasan.abdulkhaliq@uhd.edu.iq received: 19-12-2019 accepted: 07-01-2020 published: 13-02-2020 access this article online doi: 10.21928/uhdjst.v4n1y2020.pp29-40 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2020 abdulkhaliq and darwesh. this is an open access article distributed under the creative commons attribution noncommercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l re se a rc h a rt i c l e uhd journal of science and technology sasan sarbast abdulkhaliq and aso darwesh: sentiment analysis in social media 30 uhd journal of science and technology | jan 2020 | vol 4 | issue 1 tweet. moreover, twitter is used by almost all famous people and reputable companies around the world [3]. this has made twitter become a very rich source for data, as it has approximately 947 million users and 500 million generated tweets each day. hence, companies and organizations trying to benefit from this huge and useful data to find their customer’s satisfaction with their products and service levels they offer, politicians wish to envisage their fans’ sentiments. however, it is impractical for a human to analyze this massive data, to avoid this, sentiment analysis, or opinion mining techniques can be used to automatically discover knowledge and recognize predefined patterns within large sets of data [4]. sentiment analysis is a natural language processing (nlp) technique for detecting or calculating the mood of people about a particular product or topic that has been expressed in the form of short text using machine learning algorithms. the main goal of sentiment analysis is to build a model to collect and analyze views of people about a particular topic and classify them into two main classes positive or negative sentiment [5]. one major step in sentiment analysis is feature extraction, which is the numerical representation of tokens in a given document. but actually, features can be noisy due to the data collection step as a consequence of data collecting technologies imperfection or the data itself which contains redundant and irrelevant infor mation for a specific problem. this degrades the performance of the learning process, reduces the accuracy of classification models, increases computational complexity of a model, and leads to overfitting. thus, high dimensionality problem should be handled when applying machine learning and data mining algorithms with data that have high dimensional nature. to handle this problem, feature selection techniques could be used to select the best features from available feature space for classification, regression, and clustering tasks. besides, one of the most important aspects of classification is accuracy. feature selection plays a key role in improving accuracy by identifying and removing redundant and irrelevant features. feature selection techniques are broadly classified into three categories, which are filter-based, wrapper-based, and embedded or hybrid methods [6]. in this work, twitter-application programming interface (twitter-api) being used to extract most recent tweet according to specific keywords such as (amazon, trump, chelsea fc, cr7) and label them using lexicon-based approach into two categories which are positive and negative, followed by pre-processing step to remove irrelevant terms. bag of word (bow) technique and term frequency-inverse document frequency (tf-idf) weighting scheme along with different n-gram ranges such as (unigram, bigram, trigram, and uni + bi + tri-gram) are used to extract features from tweets. the next step is selecting the best features to feed to our model which consists of a filter-based method chi-square (chi-2) to select the most relevant attributes within generated features, followed by selecting the best feature-subset within features using wrapper-based method binary coordinate ascent (bca) to improve classification accuracy. two supervised machine learning algorithm has been chosen for their performance and simplicity, namely, logistic regression (lr) and linear support vector machine (svm) to perform binary sentiment classification on selected feature-subsets. 1.1. related works in zhai et al. [7], the authors utilize chi-2 for feature selection with single and double words, together with svm and naïve bayesian as classifiers. obtained is that accuracy gradually increases as the number of features increase. with 1300 features accuracy hits 96%, then it remains slightly stable until 2000 features. meanwhile, the accuracy of information (information gain [ig]) remains below chi-2 for the whole features. besides, we can see that the feature selection applied as combination features could also affect the performance of the classification. it extracts context-related multifeatures with little redundancy which can help to reduce the internal redundancy, consequently improve the classification performance. shortcomings of chi-2 is also pointed, as it only considers the frequency of words within the document, regardless of the effectiveness of the word, and as result, it can cause the removal of some effective but low-frequency words during feature selection step. in contrast, the researchers in kurniawati and pardede [8] proposed a hybrid feature selection technique on a balanced dataset, which composed of particle swarm optimization (pso) plus ig together, followed by classification step using svm, achieving a better result than using each one separately. the results are as follows: compared to using svm alone, the proposed method achieves 1.15% absolute improvements. compared to ig + svm and pso + svm, the method achieves 1.97% and 0.6% improvements, respectively. overall, the system achieved 98% accuracy using area under the curve accuracy measure. in the research done by kaur et al. [9], the proposed system uses k-nearest neighbors (knn) as a classifier for classifying sasan sarbast abdulkhaliq and aso darwesh: sentiment analysis in social media uhd journal of science and technology | jan 2020 | vol 4 | issue 1 31 sentiments of text on e-commerce sites into positive, negative, and neutral sentiments on tweeter dataset. features generated using n-gram before the knn classifier took place. the performance of the proposed model was analyzed using precision, recall, and accuracy followed by comparing them to results obtained from the svm classifier. the outcome was that the proposed system could outperform svm classifier by 7%. on the other hand, the work done by zhang and zheng [10] incorporates part of speech tagging to specify adverbs, adjectives, and verbs in the text first, then applied term frequency-inverse document frequency (tf-idf) for generating features as a result of their corresponding word weights. then, features were adopted for classification and fed to both svm and extreme learning machine with kernels to classify sentiments of hotel reviews in chinese. they attained that essential medicines list accuracy is slightly better than svm when introduced with the kernel and takes an effectively shorter time of training and testing than svm. in the work joshi and tekchandani [11], researchers have made a comparative study among supervised-learning algorithms of machine learning such as svm, maximum entropy (maxent), and naïve bayes (nb) to classify twitter movie review sentiments using unigram, bigram, and unigram-bigram combination features. their study result shows that svm reaches maximum accuracy of 84% using hybrid feature (unigram and bigram), leaving other algorithms behind. furthermore, they observed that maxent excels nb algorithm when used with bigram feature. in luo and luo [12] researchers proposed a new odds ratio (or) + svm-recursive feature elimination (rfe) algorithm that combines or with a recursive svm (svm-rfe), which is an elimination based function. or is used first as a filtering method to easily select a subset of features in a very fast manner, followed by applying svm-rfe to precisely select a smaller subset of features. their observation result emphasizes that or + svm-rfe attains better classification performance with a smaller subset of features. in maipradit et al. [13], a group of researchers suggests a method for classifying sentiments with a general framework for machine learning. n-gram idf has been used in feature generation and selection stage. as the classification stage, an automated machine learning tool has been used which makes use of auto-sklearn for choosing the best classifier for their datasets and also choosing the best parameter for those classifiers automatically. classification is applied on different publicly available datasets (stack overflow, app reviews, jira issues). however, their study might not be feasible to be generalized for every other dataset; their datasets were specifically chosen for comments, reviews, and questions and answers. their classification result achieved the best average model evaluation metrics f1, precision, and recall score values for all datasets in predicting class labels for positive, negative, and neutral classes for abovementioned datasets. moreover, the highest f1-score value achieved was 0.893 in positive comments, 0.956 in negative comments of jira issues dataset, and 0.904 f1-score value for in neutral comments of stack overflow dataset. in work done by researchers in rai et al. [14], tweets have been gathered from twitter’s api first. later on weights for each word within review tweets have been calculated. followed by selecting the best features using the nb algorithm, and consequently classifying the sentiment of reviews using three different machine learning classifiers, namely, nb classifier, svm, and random forest algorithm. after measuring they realized that all three algorithms are performing the same for 50 tweets, but increasing the number of tweets and adding more features changes the accuracy and other measures dramatically. as a part of their observation, they noticed that increasing the number of tweets from 50 to 250 will increase the accuracy of nb and svm up to 83% approximately while adding more features to each algorithm gives slightly better classification accuracy up to 84% for 250 tweets. another group of researchers in naz et al. [15] has employed another method to classify sentiments of twitter data. the method composed of a model that employs a machine learning algorithm utilizing different feature combinations (unigram, bigram, trigram, and the combination of unigram + bigram + trigram) + svm to improve classification accuracy. furthermore, three different weighting approaches (tf and tf-idf and binary) have been tried with the classifier using different feature combinations to see the effect of changing weights on classification accuracy. the best accuracy achieved by this approach was 79.6% using unigram with tf-idf. furthermore, sentiment score vector is created to save overall scores tweets and then associated with the feature vector of tweets, then classified them using svm with different n-grams of features from different feature selection methods as mentioned before. the result shows that using a sentiment score vector with unigram + svm gives the best accuracy result compared to other n-grams which were 81%. sasan sarbast abdulkhaliq and aso darwesh: sentiment analysis in social media 32 uhd journal of science and technology | jan 2020 | vol 4 | issue 1 another research has been carried out by wagh and punde [16], a comparative study among different machine learning approaches have been applied by other researchers. the focus of their work was to discuss the sentiment analysis of twitter tweets, considering what people like or dislike. they perceived that applying machine learning algorithms such as svm, nb, and max-entropy on results of semantic analysis wordnet to form hybrid approach can improve accuracy of sentiment analysis classification by 4–5% approximately. another research has been performed by iqbal et al. [17], in which multiple feature combinations are fed to (nb, svm, and maxent) classifiers for classifying movie reviews from the imdb dataset and tweets from stanford twitter sentiment 140 dataset, in the term of people’s opinion about them. the experiment incorporates four different sets of features, each of which are a combination of different single features as following: combined single word features with stopword filtered word features as (set 1), unigram with bigram features as (set 2), bigram with stopword filtered word features as(set 3), and most informative unigram with most informative bigram features as(set 4). chi-2 has been used as a supervised feature selection technique to obtain more enhanced performance by selecting the most informative features. and also, chi-2 helps to decline the size of training data. their result shows that combining both unigram and bigram features and subsequently feeding it to maxent algorithm gives the best result in term of f1-score, precision, and recall compared to two other algorithms, and also compared to using single feature and baseline model which is sentiwordnet (swn) method by 2–5%. another research was done by rane and kumar [18] on a dataset containing tweets of 6 us airlines and carried out sentiment analysis to extract sentiments as (positive, negative, and neutral). the motivation of the research was to provide airline companies a general view of their customer’s opinions about airline services to provide them a good level of service to them. as the first step preprocessing has been performed, followed by a deep learning concept (doc2vec) to represent tweets as vectors which makes use of distributed bow and distributed memory model, which preserves ordering of words throughout a paragraph, to do phrase-level sentiment analysis. the classification task has been done using seven different supervised and unsupervised learning, namely, decision tree, random forest, svm, knn, lr, gaussian nb, and adaboost. after classification, they attained acceptable accuracy that can be used by airline companies with most of the classifiers as follows: random forest (85.6%), svm (81.2%), adaboost (84.5%), and lr (81%) are among the best classifiers as result, they concluded that the accuracy of the classifiers are high enough, that makes them reliable to be used by airline industry to explore customer satisfaction. another work done by jovi et al. [19] to review available feature selection approaches for classification, clustering and regression tasks, along with focusing on their application aspects. among which ig (precision) and normal separation (accuracy, f-measure, and recall) have the best performance for text classification tasks, whereas iterative feature selection (entropy, precision) attains the best performance for text clustering. results show that using hybrid approaches for feature selection, consisting of a combination of the best properties from filter, and wrapper methods giving out the best result by applying first, a filter method to reduce feature dimensions to obtain some candidate subsets. then applying a wrapper method was based on a greedy approach to find the best candidate subset. in rana and singh [20] authors have proposed a model for classifying movie reviews using nb classifier and linear svm classifiers. they realized that applying the classifiers after omitting synthetic words gives a more accurate result. their result shows that svm achieves better accuracy than the nb classifier. furthermore, both algorithms distinctly performed better for genre drama, reaching 87% with svm and 80% with the nb algorithm. in kumar et al. [21] authors have developed a classification model to classify reviews from websites such as amazon. com. after extracting reviews of three different products, namely, apple iphone 5s, samsung j7, and redmi note 3 from the website automatically, they applied nb, lr, and swn algorithms for classifying reviews in the term of positive and negative. after using quality measure metrics (f1 score, recall, and precision), nb has achieved the best result among three classifiers with f1-scores: 0.760, 0.857, and 0.802 for three above-mentioned datasets, respectively. in iqbal et al. [22] researchers proposed a hybrid framework to solve scalability problems that appear when feature set grows in sentiment analysis. using genetic algorithm (ga) based technique to reduce feature set size up to 42% without effecting accuracy. comparing their proposed (ga) based feature reduction technique against two other well-known techniques: principal component analysis (pca) and latent symantec analysis, they affirmed that ga based technique had 15.4% increased accuracy over pca and up to 40.2% increased accuracy over latent semantic analysis. furthermore, they employed three different methods of sentiment analysis, which are swn, machine learning, and sasan sarbast abdulkhaliq and aso darwesh: sentiment analysis in social media uhd journal of science and technology | jan 2020 | vol 4 | issue 1 33 machine learning with ga optimized feature selection. in all cases, the swn approach has lower accuracy than two other mentioned approaches achieving its best accuracy of 56%, which impractical for real-time analysis. their developed model which incorporates ga results in reducing feature size by 36–43% in addition to 5% increased efficiency when compared to the ml approach due to reduced feature size. they have tested their proposed model using six different classifiers on different datasets, the classifiers are, namely, j48, nb, part, sequential minimal optimization, ib-k, and jrip. among all classifiers, nb classifier has shown the highest accuracy (about 80%) while using ga based feature selection on twitter and reviews dataset, on the other hand, ib-k outperformed other classifiers with accuracy 95% while applying on the geopolitical dataset. another evaluation is done for the scalability and usability of their proposed technique using execution time comparison. they found that the system showed a linear speedup with the increased dataset size. however, the technique consumed 60–70% of the aggregate execution time on customer reviews dataset, but it results in a speedup of modeling the classifiers up to 55% and remains linear, confirming that proposed algorithm is fast, accurate, and scales well as the dataset size grows. 1.2. problem statement thus, twitter is one of the richest sources of opinionated data; there is a big demand on analyzing twitters’ data nowadays for the process of decision making. however, these data are unstructured and contain a lot of irrelevant and redundant information that leads to high-dimensionality of feature space, consequently analyzing them properly and accurately by machine learning, and data mining techniques is a big challenge. high dimensionality degrades the performance of the learning process, reduces the accuracy of classification models, increases computational complexity of a model, and leads to model overfitting. to overcome this problem, a hybrid of two feature selection methods proposed to remove the redundant and irrelevant features to select the best feature subset for classification task automatically. in this work, we use chi-2 to calculate the correlation between attributes and the class labels. low correlation of a particular feature means that the feature is irrelevant to the class label and needs to be removed prior to classification. in this way features are reduced. but there is still the problem of redundant features. the redundant features were removed by applying bca, which uses an objective function to selects optimal feature subset from features were selected by chi-2. as result irrelevant and redundant features were removed, which lead to solve high dimensionality problem in model building, and eventually classification accuracy is improved. 2. methodology the main objective of this study is to select optimal or sub-optimal feature-subset to perform twitter sentiment analysis throughout utilizing filter-based method (chi-2) and hybrid filter + wrapper method, namely, chi-2 + bca to improve the accuracy of the classification model. the work was implemented on acer vn7-591g series laptop, intel(r) core(tm) i7-4710hq cpu at 2.50 ghz (8 cpus), 16-gb ram, windows 10 home 64-bit. fig. 1 depicts the flow diagram of our proposed system model and the following subsections explain each step of developing the proposed model in detail: 2.1. data collection twitter-api with python code is used to automatically download most recent tweets about (amazon, trump, chelsea fc, and cr7) keywords, respectively, and lexicon rule-based method being utilized to assign positive and negative scores for each tweet, then protecting it in comma separated value (.csv) file. table 1 illustrates the details of each keyword dataset and fig. 2 shows a sample of collected tweets. 2.2. pre-processing text pre-processing is the first step in twitter sentiment analysis, the tweets should go through some pre-processing step such as removing duplicate tweets, converting to lowercase, replacing emoji’s with their meaning, removing urls, usernames, and expand contractions such as (can’t → cannot), replacing slang word like (omg → oh my god), reducing repeated character to only two character, removing numbers, special characters, punctuation marks, multiple space, tokenizing, removing stop-words, and lemmatizing, followed by removing duplicate tweets after pre-processing. finally persisting the cleaned in another (.csv) file. table 2 shows the emoji’s and their meaning. table 1: dataset size description keyword positive negative total # of tweets amazon 432 791 1223 trump 1054 1200 2254 chelsea fc 2239 761 3000 cr7 2816 1184 4000 table 2: emoji’s and their meaning emoji meaning :), :-d, :-j, =p, :3 positive :(, :|, :^), +o(, :=& negative sasan sarbast abdulkhaliq and aso darwesh: sentiment analysis in social media 34 uhd journal of science and technology | jan 2020 | vol 4 | issue 1 2.3. feature extraction feature extraction is the process of converting the text data into a set of features or numerical representations of words or phrases. the performance of the machine learning process depends heavily on its features, so it is crucial to choose appropriate features for your classification model. on the other hand, applying different n-grams which are a different combination of words within the document gives out different accuracy results. we used (unigram, bigram, trigram, and combination of all) as the most commonly used ranges to see the impact of each of them on classification results. the followings are two feature extraction methods used by the proposed model 1) tf-idf: tf-idf stands for term frequency – inverse document frequency, it is a simple and effective metric that represents how “important” a word is to a document in the document set. it has many uses; one of its common uses is for automated text analysis. it is very useful for scoring words in machine learning algorithms in nlp. tf-idf for a word in a document is calculated by multiplying two different metrics: • tf: calculating how many times a term occurs in a document. the reason behind using it is that words that frequently occur in a document are probably more important than words that rarely occur. the result is then normalized by dividing it by the number of words in the whole document. this normalization is done to prevent a bias toward longer documents. tf(t) = (number of times term t appears in a document)/(total number of ter ms in the document). and its mathematical formula is: tf n ntd td k kd � � (1) where ntd is the number of times that term t occurs in document d, and nkd is the number of occurrences of every term in document d. • idf: measures how important term is by taking the total number of documents in the corpus and divide it by the number of documents where the term appears. it is calculated by: idf(t) = log(total number of documents/number of documents with term t in it). and its mathematical formula is: idf d dt t =�log | | | | (2) where |dt| is the total number of documents in the corpus, |dt| is the number of documents where the term t appears in it. hence, the tf-idf is the multiplication of eq. 1 and eq. 2. which is: tf idf n n d dt td k kd t � � � � �log | | | | (3) 2) bow: one of the simplest types of feature extraction models is called bow. the name bow refers to the fact that it does not take the order of the words into account. instead one can imagine that every word is put into a bag. it simply counts the number of occurrences of each word within a document and keeps the result in a vector which is known as count-vector. fig. 1. flow diagram of proposed system model. fig. 2. sample of collected tweets. sasan sarbast abdulkhaliq and aso darwesh: sentiment analysis in social media uhd journal of science and technology | jan 2020 | vol 4 | issue 1 35 2.4. feature selection the main goal of feature selection is to select an optimal group of features for learning algorithms from the original feature set to retain the most useful features as possible and removes useless features that do not affect classification result. in this way, feature selection reduces the high dimensionality of data by eliminating irrelevant and redundant features. thus, improves the model accuracy, reduces computation, and training time. it also reduces storage requirement, and avoids overfitting. feature selection methods mainly divided into three categories which are filter methods, wrapper methods, and hybrid or embedded methods. in general, feature selection methods are composed of four main steps, namely, feature subset generation, subset evaluation, stopping criterion, and result validation. fig. 3 illustrates the basic steps of the feature selection process. in this work, we are using chi-2 filter-based method in conjunction with bca wrapper-based method to form a hybrid feature subset selection technique to select the best subset for our classification models. first, we employed chi2 to remove irrelevant features, leading to produce reduced feature set. then applied bca for selecting more optimal subset of features that are more reduced feature subset. the operation details of both methods are described below: 2.5. chi-2 chi-2 is a type of filter-based feature selection method; it is used to select informative features and ranking them to remove irrelevant features with low ranks. in statistics, the chi-2 test is used to examine the independence of two events. the events, and are assumed to be independent if: p xy p x p y� � � � �� ( ) (4) in text feature selection, these two events correspond to the occurrence of a particular term and a class, respectively. chi-2 can be computed using the following formula: ����� , � ( ,� � ,�� )� ,�� { , }� � { , } chi t c n c e c e ct c t t t � � � � � � � � �2 0 1 0 1 2 (5) n is the observed frequency, and e is expected frequency for both of term t and class c. chi2 is a measure of how much expected count e and observed count n deviate from each other. a high value of chi-2 indicates that the hypothesis of independence is not correct. the occurrence of the term makes the occurrence of the class more likely if the two events are dependent. consequently, the regarding term is relevant as a feature. the chi-2 score of a term is calculated for individual classes. this score can be globalized over all classes in two ways. the first way is to compute the weighted average score for all classes and the second way is to choose the maximum score among all classes. in this paper, the former approach is preferred to globalize the chi-2 value for all classes in the corpus. � � � �p c t ci i. ( ,� )chi �2 (6) where p(ci) is the class probability and chi-2 (t, ci) is the class-specific chi-2 score of term t. 2.6. bca bca is a wrapper based feature selection method, was introduced by zarshenas and suzuki [23] in 2016. the goal of the bca algorithm is to choose optimal or sub-optimal sub-set from available features from feature space that makes machine-learning algorithms the highest possible performance for a specific task, such as classification. the bca algorithm iteratively adds and removes features to and from the selected subset of features based on the objective function values starting from an empty sub-set. at each iteration, the bca checks whether the existence of a particular feature, in a given subset of features, improves or degrades the classification performance. if a feature was included in or removed accidentally from the feature sub-set, the bca algorithm will be capable of correcting the wrongly taken decisions in the proceeding scans to approximate the optimal solution as much as possible. compared to sequential feature selection (sfs) and sequential forward floating selection (sffs) are two of the most popular wrapper-based fss techniques and filter-wrapper incremental wrapper subset selection (iwssr), the bca is more efficient than fig. 3. a general framework of feature selection. sasan sarbast abdulkhaliq and aso darwesh: sentiment analysis in social media 36 uhd journal of science and technology | jan 2020 | vol 4 | issue 1 both of them in terms of processing time and classification accuracy. we came to the point that this algorithm is an effective feature selection method for the classification of datasets with a high number of initial attributes. fig. 4 shows the work of the algorithm. 2.7. hybrid feature selection method filter methods are fast because they use mathematics and statistics for selecting features. the proposed model uses chi-2 as filter-based method to remove irrelevant features and producing a reduced feature subset from the original feature set. in addition, wrapper-based methods are more accurate because they work as a part of the classification algorithms to evaluate usefulness of a particular feature, but they are computationally slow when applied on original feature set. hence, the proposed model takes advantage of characteristics of both feature selection methods by first removing irrelevant features from the original feature set using chi-2, and then applying bca to the features those are selected by chi-2 to select more optimal feature subset to enhance classification accuracy. 2.8. classification algorithms in sentiment analysis classification essentially means categorizing data into different classes based on some calculation to determines the sentiment of the text. in our study, we applied two machine learning algorithms, namely, linear svm (lsvm) and lr for binary classification (positive and negative) of twitter data. svm is a non-probabilistic machine learning algorithm. it is primarily used for classification in machine learning and could be fine-tuned for using with regression. the aim of svm is to find the optimal decision boundary between classes by transforming our data with the help of mathematical functions called kernels. the best decision boundary created is called a hyperplane. with a linearly separable data linear kernel is used. since our class labels are linear (only positive and negative), we will perform classification with “linear svm.” lr is a statistical machine learning algorithm for predicting classes which have dichotomous nature. dichotomous mean having just two possible classes, binary by another mean. the term logistic mean logit function (a probabilistic function which returns values just in [0,1]). 3. results and discussion based on the results attained from the two classifiers: tf-idf with svm and bow with lr along with different n-grams, five-fold cross-validation, using chi-2 and hybrid chi-2 + bca feature selection methods, we achieved accuracy levels illustrated in the following figs. 5-12: fig. 4. binary coordinate ascent. fig. 5. accuracies of amazon dataset. sasan sarbast abdulkhaliq and aso darwesh: sentiment analysis in social media uhd journal of science and technology | jan 2020 | vol 4 | issue 1 37 the graph shows that lr classifier attains best result in unigram range followed by 1 + 2 + 3-g, bigram, and trigram, respectively. however, after applying chi-2, unigram, and 1 + 2 + 3-g accuracy dramatically increased with more than 5%, followed by bigram and trigram with a slight increase. finally, applying bca achieves a dramatical accuracy rate increase by approximately 5% with unigram, followed by the rest three approaches bigram and trigram and 1 + 2 + 3-g with a slight increase. the graph shows that lr classifier attains best result in unigram range followed by 1 + 2 + 3-g, bigram, and trigram, respectively, which is in all cases less than chi-2 result, that achieve big raise in accuracy with unigram by 10% bigram fig. 6. accuracies of trump data set. fig. 7. accuracies of chelsea fc data set. fig. 8. accuracies of cr7 data set. fig. 9. accuracies of amazon dataset. sasan sarbast abdulkhaliq and aso darwesh: sentiment analysis in social media 38 uhd journal of science and technology | jan 2020 | vol 4 | issue 1 14% and 1 + 2 + 3-g with 10% followed by trigram with 3% increase. finally, applying bca accuracy has a slight raise in all cases with less than 1%. the graph shows that lr classifier attains best accuracy with unigram range followed by 1 + 2 + 3-g, bigram, and trigram, respectively. after applying chi-2, unigram and 1 + 2 + 3-g accuracy increased with more than 3%, followed by bigram and trigram with a slight increase. finally, applying bca achieves a better accuracy rate increase in all cases. in unigram bigram, and 1 + 2 + 3-g accuracy increased by more than 1.5%, followed by trigram with a small increase. the graph shows that lr classifier attains best result with 81.549% in unigram range followed by 1 + 2 + 3-g, bigram, and trigram respectively. consequently, after applying chi-2, unigram, 1 + 2 + 3-g, and bigram accuracy increased by more than 3%, followed by and trigram with a small increase. finally, applying bca achieves even more accuracy rate increased by approximately 1.5% with unigram, bigram, and 1 + 2 + 3-g, and a small change can be observed with trigram. all bar charts illustrate that the accuracy attained from hybrid chi-2 +bca outperforms the accuracy of lr, lr when applied only with features selected by chi-2, in all n-gram ranges. moreover, unigram and (1 + 2 + 3-g) achieve higher results than bigram and trigram in (chi-2+bca) feature selection. results also show that with the growth of datasets, the accuracy of the classifier increases. the graph shows that svm classifier attains the best result in unigram range followed by 1 + 2 + 3-g, bigram, and trigram, fig. 10. accuracies of trump dataset. fig. 11. accuracies of chelsea fc dataset. fig. 12. accuracies of cr7 dataset. sasan sarbast abdulkhaliq and aso darwesh: sentiment analysis in social media uhd journal of science and technology | jan 2020 | vol 4 | issue 1 39 respectively. then, after applying chi-2, unigram, and 1 + 2 + 3-g accuracy dramatically increased with more than 3% and 5%, respectively, followed by bigram and trigram with a slight increase. finally, applying bca achieves a dramatical accuracy rate increase by approximately 5% with unigram and (1 + 2 + 3), followed by bigram and trigram with a slight increase. the graph shows that svm classifier attains the best result in unigram and 1 + 2 + 3-g, followed by bigram and trigram, respectively. then, applying chi-2, unigram, bigram, and 1 + 2 + 3-g accuracy dramatically increased with more than 10%, 8%, and 7%, respectively, followed by trigram with a slight increase. finally, applying bca to features selected by chi-2 achieves a dramatical accuracy rate increase by approximately 3%, and more than 1% with unigram and (1 + 2 + 3), followed by and trigram with a slight increase. the graph shows that svm classifier achieves the best result in unigram range, followed by 1 + 2 + 3-g, bi-gram, and tri-gram, respectively. however, after applying chi-2, unigram accuracy dramatically increased with more than 3%, followed by 1 + 2 + 3-g, bigram, and trigram with a slight increase. finally, applying bca to features selected by chi-2 achieves a dramatical accuracy increase by approximately 3% with unigram, followed by 1 + 2 + 3, bigram, and trigram with a slight increase. the graph shows that svm classifier attains the best result in unigram range, followed by 1 + 2 + 3-g, bigram, and trigram, respectively. however, after applying chi-2, unigram, and 1 + 2 + 3-g accuracy increased with more than 2% and 1%, respectively, followed by bigram with a slight change, while trigram increase remained same. finally, applying bca achieves a dramatical accuracy rate increase by approximately 4% with bigram, 1% with unigram, and more than 2% with 1 + 2 + 3, followed by trigram with a slight increase. all bar charts illustrate that the accuracy achieved from hybrid chi-2+bca outperforms the accuracy of svm, svm when applied only with features selected by chi-2, in all n-gram ranges. moreover, unigram and 1 + 2 + 3-g achieve higher results than bigram and trigram in most cases of (chi2+bca) feature selection. results also show that with the growth of datasets, the accuracy of the classifier increases, same as with lr. 4. conclusion in the context of our work, we developed a sentiment classification model for classifying tweets into positive and negative based on the sentiment of the author. as the amount of data becomes huge, the task of classifying them becomes more challenge and the need for reducing the number of features arises to improve classification accuracy. we proposed a hybrid feature selection method by incorporating a filter-based method chi-2, followed by wrapper-based method bca for reducing the number of irrelevant features and selecting optimal or sub-optimal features respectively, from features generated by bow and tf-idf, each of which used with a different classifier. after training our model with different n-gram ranges and five-fold cross-validation, we conclude that applying our proposed hybrid feature selection method (chi-2+bca) reduces features and improves classification performance in the term of accuracy up to 11.847% compared to using original feature set with linear svm and 10.882% with lr classifiers, both with unigram range. moreover, the maximum improvement of chi-2+bca over using only chi-2 was 4.915% and 4.826% for lr and svm, respectively. 4.1. future work using the same system with a greater number of tweets to inspect the effectiveness of bca with the growth of the dataset. using bca as a feature subset selection algorithm with deep learning algorithms such as lstm and rnn. applying bca to other feature generation techniques such as word2vec or doc2vec. hybridizing bca with other filter methods. references [1] h. p. patil and m. atique. “sentiment analysis for social media: a survey”. 2015 ieee 2nd international conference information science secur, 2016. [2] m. k. das, b. padhy and b. k. mishra. “opinion mining and sentiment classification: a review”. proceeding international conference inventory system control, pp. 4-6. [3] a. s. al shammari. “real-time twitter sentiment analysis using 3-way classifier”. 21st saudi computer society national computer conference’s, pp. 1-3, 2018. [4] r. d. desai. “sentiment analysis of twitter data”. proceeding 2nd international conference intelligence computing control system no. iciccs, pp. 114-117, 2019. [5] p. m. mathapati, a. s. shahapurkar and k. d. hanabaratti. “sentiment analysis using naïve bayes algorithm”. international journal of computational science and engineering, vol. 5, no. 7, pp. 75-77, 2017. [6] n. krishnaveni and v. radha. “feature selection algorithms for data mining classification: a survey”. indian journal of science and technology, vol. 12, no. 6, pp. 1-11, 2019. [7] y. zhai, w. song, x. liu, l. liu and x. zhao. “a chi-square statistics based feature selection”. 2018 ieee 9th internatinal conference software engineering services science, pp. 160-163, 2018. sasan sarbast abdulkhaliq and aso darwesh: sentiment analysis in social media 40 uhd journal of science and technology | jan 2020 | vol 4 | issue 1 [8] i. kurniawati and h. f. pardede. “hybrid method of information gain and particle swarm optimization for selection of features of svm-based sentiment analysis”. 2018 internatinal conference information technology system innovation, pp. 1-5, 2019. [9] s. kaur, g. sikka and l. k. awasthi. “sentiment analysis approach based on n-gram and knn classifier”. icsccc 2018 1st international conference security cyber computer communication, pp. 13-16, 2019. [10] x. zhang and x. zheng. “comparison of text sentiment analysis based on machine learning”. proceeding 15th internatioanl symposium parallel distributed computing ispdc 2016, pp. 230233, 2017. [11] r. joshi and r. tekchandani. “comparative analysis of twitter data using supervised classifiers”. proceeding international conference invention computer technology icict 2016, vol. 2016, 2016. [12] m. luo and l. luo. “feature selection for text classification using or+svm-rfe”. 2010 chinese control decision conference ccdc 2010, pp. 1648-1652, 2010. [13] r. maipradit, h. hata and k. matsumoto. “sentiment classification using n-gram idf and automated machine learning”. ieee software, vol. 7459, pp. 10-13, 2019. [14] s. rai, s. m. shetty and p. rai. “sentiment analysis of movie reviews using machine learning classifiers”. international journal of computer applications, vol. 182, no. 50, pp. 25-28, 2019. [15] s. naz, a. sharan and n. malik. “sentiment classification on twitter data using support vector machine”. proceeding 2018 ieee/wic/ acm international confernce web intell. wi 2018, pp. 676-679, 2019. [16] r. wagh and p. punde. “survey on sentiment analysis using twitter dataset”. proceeding 2nd international conference electronic communications aerospace technology iceca 2018, no. iceca, pp. 208-211, 2018. [17] n. iqbal, a. m. chowdhury and t. ahsan. “enhancing the performance of sentiment analysis by using different feature combinations”. international conference compututing communication ic4me2 2018, pp. 1-4, 2018. [18] a. rane and a. kumar. “sentiment classification system of twitter data for us airline service analysis”. proceeding international computing software appl conference, vol. 1, pp. 769-773, 2018. [19] a. jovi, k. brki and n. bogunovi. “a review of feature selection methods with applications”. 2015 38th international convention on information and communication technology, electronics and microelectronics, pp. 25-29, 2015. [20] s. rana and a. singh. “comparative analysis of sentiment orientation using svm and naive bayes techniques”. proceeding 2016 2nd interenational confernce next general computer technologies, 2016, pp. 106-111, 2017. [21] k. l. s. kumar, j. desai and j. majumdar. “opinion mining and sentiment analysis on online customer review”. 2016 ieee interenatioanl conference computing intelligence computing research iccic 2016, 2017. [22] f. iqbal, j. maqbool, b. c. m. fung, r. batool, a. m. khaytak, s. aleem and p. c. k. hung. “a hybrid framework for sentiment analysis using genetic algorithm based feature reduction”. ieee access, vol. 7, pp. 14637-14652, 2019. [23] a. zarshenas and k. suzuki. “binary coordinate ascent: an efficient optimization technique for feature subset selection for machine learning”. knowledge-based systems, vol. 110, pp. 191201, 2016. tx_1~abs:at/tx_2:abs~at 48 uhd journal of science and technology | jan 2021 | vol 5 | issue 1 1. introduction water is crucial for urban sustainability and in maintaining the sustainability of the environment [1], [2]. the extreme urbanization, industrial development, and agricultural expansion lead to increase demand of water in many parts of the world [3], [4]. urban area development continuously reduces the groundwater recharging areas and increases depletion of groundwater [5]. rainwater harvesting (rwh) is the collection and concentration of rainwater and runoff from catchment areas such as roofs or other urban structure and can be used for irrigation, industry, domestic, and for groundwater recharge purposes [6], [7], [8], [9]. rwh techniques have been used throughout time for the irrigation purpose by the ancient iraqi people around 4500 bc [10], and it is an environmentally vocal decision to address issues brought out by large projects utilizing centralized water resources management approaches [11]. many previous studies mentioned the use of rwh successfully as an effective and alternative water supply resolution [12], [13]. patra and gautam [4] conducted a study to assess the runoff coefficient (rc) method for rwh in dhanbad city in india. the runoff results indicated that rh system is an economic option for where in the areas where rainfall is adequate and could supply part of the water demand of the city. zakaria et al., 2013 [14], used macro rwh at koysinjaq (koya), in kurdistan region based on soil conservation service curve number (scs-cn) method. the findings urban rainwater harvesting assessment in sulaimani heights district, sulaimani city, krg, iraq kani namiq gharib, nawbahar faraj mustafa, haveen muhammed rashid department of water resources, college of engineering, university of sulaimani, krg, iraq a b s t r a c t rainwater harvesting is the collection of rainwater and runoff from catchment areas such as roofs or other urban surfaces. collected water has productive end-uses such as irrigation, industry, domestic, and can recharge groundwater. sulaimani heights have been selected as a study area, which is located in sulaimani governorate in kurdistan region, north iraq. the main objective of this study was to estimate the amount of harvested rainwater form sulaimani heights urban area in sulaimani city. three methods for runoff calculation have been compared, the storm water management model (swmm), the soil conservation service (scs) method, and the runoff coefficient (rc) using daily rainfall data from 1991 to 2019. the annual harvested runoff results with the three different methods swmm, scs, and rc were estimated as 836,470 m3, 508,454 m3, and 737,381 m3, respectively. the results showed that swmm method has the highest runoff result and could meet 31% of the total demand of the study area and 28% and 19% for rc and scs methods, respectively. index terms: rainwater harvesting, storm water management model, soil conservation service, runoff coefficient, runoff, sulaimani heights corresponding author’s e-mail: nawbahar faraj mustafa, department of water resources, college of engineering, university of sulaimani, krg, iraq. e-mail: nawbahar.mustafa@univsul,edu.iq received: 22-10-2020 accepted: 25-04-2021 published: 27-04-2021 access this article online doi: 10.21928/uhdjst.v5n1y2021.pp48-55 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2021 rashid. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l re se a rc h a rt i c l e uhd journal of science and technology gharib, et al.: urban rainwater harvesting uhd journal of science and technology | jan 2021 | vol 5 | issue 1 49 demonstrated that the macro-rwh method can be a new source of water to reduce the problem of water scarcity and to minimize the water shortages problem. in a research conducted by harb [9], different rwh techniques were evaluated to identify the most significant method for metu-ncc campus on the west of north cyprus. the runoff from roofs, pervious and impervious areas were collected and utilized and applying in two approaches: traditional scs method and storm water management model (swmm) rwhs for calculation of runoff volume and the findings could meet 41.2% of the campus irrigation demand. in 2016, a paper published by gnecco et al. [15] in which swmm was used to investigate the effect of domestic rwh and storage unit effects on control efficiency. the study area was located in neighborhood in albaro in italy, where covers 6000 m2. the survey of the land use data displays that 57% of the land cover was impervious surfaces and 33% of rooftops of the total area. the findings of the software pointed that rwh can be applied in in urban water management and methods for assessment and optimization of runoff storage and use as potable water. 2. study area the sulaimani heights are located in sulaimani governorate in kurdistan region, north iraq. the latitudes are between 35°35’ 55”and 35°36’ 51” n and the longitudes are between 44°26’25” and 45°27’35” e. the area has a topographic with elevations ranged from 950 m to 1113 m. sulaimani has a mean annual rainfall of 715 mm and has a mean daily temperature of 19℃ [16]. sulaimani heights spread over an area of 2.12 km2 and containing 2899 units of various sizes. the study area consists of three subcatchments, as shown in fig. 1, and the detail information about each subcatchment is shown in table 1. according to the map from sulaimani heights authority (qaiwan company), the area is divided into five zones, the green areas cover 17.14 % and the water pools cover 1.1% of the total area as shown in fig. 2. 3. materials and methods 3.1. data sets and data collection 3.1.1. climatology data the daily precipitation data for sulaimani city from 1991 to 2019 were used from directorate of meteorology and seismology of sulaimani (domsos). as there is no rain gauge station in the studied basins, therefore the closest meteorological station should be used; sulaimani rain gauge in ibrahim pasha street which is only 4 km away from the studied area that has an acceptable distance. daily rainfall data were used to represent the basin rainfall for the study area [17]. fig. 1. location of sulaimani heights on sulaimani map. gharib, et al.: urban rainwater harvesting 50 uhd journal of science and technology | jan 2021 | vol 5 | issue 1 fig. 2. land use of sulaimani heights. table 1: detail information of the study area parameter value elevation – m 915–1113 area km2 2.12 zone no. 5 subcatchment no. 3 residential no. 2899 mean annual rainfall – mm 715 mean daily temperature – °c 19 other climatology data which have effect on the volume of the runoff should be also considered [9]; the monthly average wind speed and pan evaporation data from 1991 to 2019 were obtained from the directorate of meteorology and seismology of sulaimani. 3.1.2. soil classification to find the common soil characteristics of the study area, the harmonized world soil database viewer (hwsd) version 1.21 and soil map of the area were used. the software is adopted by cooperation of the food and agriculture organization of the united nations (fao), the chinese academy of sciences (cas), the international institute for applied systems analysis (iiasa), the international soil reference and information centre (isric), and the joint centre of the european commission (jrc). the coordinates of the study area were pointed on the hwsd viewer software and the dominant soil group was found to be chromic vertisols with 100% light clay to be the most prominent soil textures. therefore, the dominant soil texture is clay and hence this satisfies the hydrologic soil group d [18]. 3.2. swmm swmm is widely utilized software throughout the world in associating of urban runoff quantity and quality [19]. swmm is a rainfall-runoff simulation model developed by the us environmental protection agency to assist and support local storm water management in minimizing the runoff discharges. swmm can forecast a single event or long-term (continuous) simulation set of model outputs parameters and inputs of runoff quantity and quality from primarily urban areas [20], [21], [22], [23], [24]. in accordance with the subcatchment properties, the average monthly surface runoff can be calculated through swmm gharib, et al.: urban rainwater harvesting uhd journal of science and technology | jan 2021 | vol 5 | issue 1 51 software, to estimate monthly results from the swmm software, the dates of simulation should be manipulated from the options tab and the software run gives runoff depth, infiltration depth, and runoff volumes in the form of a table. swmm uses the manning equation to express the relationship between flow rate (q), cross-sectional area (a), hydraulic radius (r), and slope (s) in all conduits [21], [25]. for standard s.i units: q n a�r s= 1 23 1 2 (1) where, n is the manning roughness coefficient. the slope s stands for either the conduit slope or the friction slope (i.e. head loss per unit length), depending on the flow routing method used. the r is hydraulic radius, which is fraction of area to wetted parameter of the conduit or the channel. 3.3. traditional scs method scs method is another suitable method for this case, as it includes all types of abstractions in the runoff calculation and the parameters needed for runoff estimation. the runoff volumes will be estimated based on scs-cn method. cn method is thoroughly used for estimating direct runoff volume for a particular rainfall event [26], [27]. for the scs, 1972 (scs-cn) method, the cn(i) stands for dry condition, cn(iii) stands for wet condition and tabulated cn is equal to cn(ii), for normal (average) conditions, and can be modified for dry and wet conditions, as explained by chow et al. [28] through the following equations 2 and 3 [29]: cn i = 4.2*cn 10-0.058*cn ii ii ( ) (2) cn iii = 23*cn 10+0.13*cn ii ii ( ) (3) the expression used in scs method for estimating runoff can be calculated through equation 4 [18]: q= p-i { p-i +s} a 2 a ( ) ( ) (4) where, q is the accumulated storm runoff in (mm); p is accumulated storm rainfall in (mm), s is potential maximum retention of water by the soil, i a is initial quantity of interception, infiltration, and depression which can be quantified through equation 5. s= 25400 cn -254 (5) while the data needed to calculate the runoff volume are present, scs method is also used to compute the runoff volume. the runoff depth from subcatchments is calculated using cn and rainfall depth. after the runoff depth is calculated, the volume of the runoff from each subcatchment is computed by multiplying the area of each subcatchment as shown in equation 6. v = r*a (6) where, 𝑉 is the volume of runoff (m3); 𝑅 is the rainfallrunoff (m); and 𝐴 is the area of the subcatchment (m2). 3.4. assessment of rc rc for any catchment is the ratio of the volume of water that runs off a surface to the volume of rainfall that falls on the surface [30]. the rc takes into account any losses due to evaporation, leakage, surface material texture overflow, transportation, and inefficiencies in the collection process [17]. the rwh potential or volume of water received from a given catchment can be obtained using the following equation 7 [17]. v r =ra c r c (7) where, v r the monthly volume of rainwater, r is average monthly rainfall depth, a c is area of the catchment, and r c runoff coefficient. to calculate the monthly r unoff produced for each subcatchment, rcs (equation 7) is used. the average r c for the different types of areas was selected [31], for the areas of constructed concrete and asphalt, the r c was selected as 0.65, 0.075 for green area, and 0.9 for water bodies. the flowchart in fig. 3 shows the steps followed for the calculation of runoff using the mentioned three methods. 3.5. water demand to determine domestic water demand for indoor and outdoor household purposes, the standard average daily water demand per capita (sulaimani water supply directorate) which is (250 l/capita/day) is used to calculate the average monthly demand for the study area [32]. harvested rainwater should be treated before using for drinking purpose [33]. in accordance with the study area, there are 2899 residential and the average of 5 members in a household counted to estimate the total water demand for the study area. the total population of the study area calculated and the total daily water demand found for sulaimani heights. using the map of the study area, the areas for each type of vegetation group for the study area gharib, et al.: urban rainwater harvesting 52 uhd journal of science and technology | jan 2021 | vol 5 | issue 1 rainwater harvesting calculation swmm precipitation data climatology data sub-catchment data area width surface slope curve number scs precipitaion data area antecedent moistur condition curve number rc precipitation data area runoff coefficient runoff fig. 3. flowchart summary for runoff calculation methods. table 2: total daily domestic water demand in sulaimani heights sub. no. no. of residence population water demand (m3/day) 1 2541 12,705 3176 2 358 1790 448 3 0 0 0 total 2899 14,495 3624 table 3: vegetation area and type of each subcatchment subcatchment green area (m2) crop type ground cover area (m2) trees and bushes area (m2) sub-1 447,182 302,080 145,102 sub-2 50,087 42,070 8017 sub-3 127,650 108,400 19,250 total area (m2) 624,919 452,550 172,369 are calculated in autocad and the irrigation months were selected with the amount of water for each m2 of vegetation area. thus, the monthly demands for the study area calculated by multiplying the number of days of the month and then by the total daily demand. the results of water demands are shown in tables 2-5. 4. results and discussion the results from the three models are shown in table 6 which shows that the swmm method has the largest annual runoff volume of 836,470 m3, rc method results with 737,381 m3 and scs method with 508,454 m3 for the average annual rainfall of 719 mm. table 7 and table 8 represent the monthly and annual water demand, respectively, with the corresponding percent demand met. the results showed that swmm method has the highest runoff result and could meet 31% of the total demand of the study area and 28% and 19% for rc and scs methods, respectively. comparison between respective runoff results clearly demonstrates that the runoff results are influencing by the serial urbanization [34]. gharib, et al.: urban rainwater harvesting uhd journal of science and technology | jan 2021 | vol 5 | issue 1 53 table 4: total monthly irrigation water demand in sulaimani heights crop type month of irrigation irrigation period (day/month) required water per (m2) (l/day) water demand (m2/month) sub-1 sub-2 sub-3 ground cover may 15 12 54,374.40 7572.60 19,512.00 june 30 16 144,998.40 20,193.60 52,032.00 july 31 16 149,831.68 20,866.72 53,766.40 august 31 16 149,831.68 20,866.72 53,766.40 september 30 16 144,998.40 20,193.60 52,032.00 october 15 12 54,374.40 7572.60 19,512.00 trees and bushes may 15 8 17,412.24 962.04 2310.00 june 30 12 52,236.72 2886.12 6930.00 july 31 12 53,977.94 2982.32 7161.00 august 31 12 53,977.94 2982.32 7161.00 september 30 12 52,236.72 2886.12 6930.00 october 15 8 17,412.24 962.04 2310.00 table 5: total demand in the three subcatchments month no. of days water demand (m2/month) total water demand (m3/month) sub-1 sub-2 sub-3 january 31 98,456 13,888 0 112,344 february 28 88,928 12,544 0 101,472 march 31 98,456 13,888 0 112,344 april 30 95,280 13,440 0 108,720 may 31 170,242.6 22,422.64 21,822 214,487.28 june 30 292,515.1 36,519.72 58,962 387,996.84 july 31 302,265.6 37,737.04 60,927.4 400,930.06 august 31 302,265.6 37,737.04 60,927.4 400,930.06 september 30 292,515.1 36,519.72 58,962 387,996.84 october 31 170,242.6 22,422.64 21,822 214,487.28 november 30 95,280 13,440 0 108,720 december 31 98,456 13,888 0 112,344 total 365 2,104,903 274,447 283,423 2,662,772 table 6: the runoff volume results of the three methods month sum of average monthly rainfall (mm) volume of runoff by swmm (m3/month) volume of runoff by scs (m3/month) volume of runoff by rc (m3/month) january 119.43 150,040 76,912.01 122,350.48 february 116.84 149,720 90,680.95 119,697.14 march 105.11 124,320 69,667.31 107,680.31 april 96.53 111,190 53,480.01 98,890.49 may 41.84 33,620 25,415.88 42,863.13 june 0 0 0 0 july 0 0 0 0 august 0 0 0 0 september 0 0 0 0 october 44.43 41,920 53,829.71 45,516.47 november 81.84 87,510 58,195.91 83,841.27 december 113.76 138,150 80,271.92 116,541.83 total 719.78 836,470 508,453.71 737,381 swmm: storm water management model, scs: soil conservation service, rc: runoff coefficient from the methods discussed previously, it appears that the traditional scs method and assessment of rc are respectable to be a combined with more losses method since the initial abstraction includes infiltration, evaporation, interception, and surface texture caused by these processes are calculated simultaneously [9], [35]. gharib, et al.: urban rainwater harvesting 54 uhd journal of science and technology | jan 2021 | vol 5 | issue 1 table 7: monthly water demand and corresponding percent demand met month percent of water demand met using swmm (%) percent of water demand met using scs (%) percent of water demand met sing rc (%) sub-1 sub-2 sub-3 sub-1 sub-2 sub-3 sub-3 sub-2 sub-3 october 18 44 5 23 49 16 20 50 2 november 68 100 100 42 100 100 65 100 100 december 100 100 100 57 100 100 88 100 100 january 100 100 100 54 100 100 92 100 100 february 100 100 100 71 100 100 100 100 100 march 94 100 100 49 100 100 81 100 100 april 86 100 100 39 94 100 77 100 100 may 15 36 4 10 27 8 19 47 2 june 0 0 0 0 0 0 0 0 0 july 0 0 0 0 0 0 0 0 0 august 0 0 0 0 0 0 0 0 0 september 0 0 0 0 0 0 0 0 0 swmm: storm water management model, scs: soil conservation service, rc: runoff coefficient table 8: annual water demand and corresponding percent demand met method swmm scs rc runoff (m3) 836,470 508,454 737,381 annual demand (m3) 2,662,772 2,662,772 2,662,772 annual demand met % 31 19 28 swmm: storm water management model, scs: soil conservation service, rc: runoff coefficient in addition, in the scs and rc methods, the infiltration in the initial abstraction does not change with rainfall events variation on a subcatchment, conversely, it would stay the same before and during the rainfall event [9], [35]. some parameters implied in the swmm model but not computed in the scs such as the depression storage, percent of impervious layer, the pervious roughness coefficient, and the soil drying time [9], [21], [24]. on the other hand, the swmm model has flexibility to route runoff and external inflows through the drainage systems, and the abstractions such as evaporation and infiltration vary with changing rainfall events [20]. due to these limitations, the swmm model is established in the prediction of comparable runoffs [36]. the swmm differs from the scs and rc approaches by that the swmm model can perform helpful and time saver tool in designing large catchments and swmm has better feasibility of determining peak flow and volume of runoff with in the nodes and pipes for designing urban drainage system [21], [24]. 5. conclusion this research studied the feasibility of applying rwh techniques as a water resource that should be associated into the management of urban areas. rwh for different types of catchments such as roofs, roads, and open areas has been founded. three approaches for runoff calculation were adopted, the swmm, the traditional scs method, and the rc. daily rainfall data from 1991 to 2019 were used to obtain the monthly and annual volume. moreover, to demonstrate the potential rwh system, the annual demand for the study area was found and compared with the total annual runoff volume using three methods, however, harvested rainwater harvested should be treated before using for drinking purpose. for the estimated total yearly water demand in the study area of demand in the study area of 2,662,772 m3, the annual runoff results with the methods swmm, scs, and rc were estimated of 836,470 m3, 508,454 m3 and 737,381 m3 respectively. the final results showed that swmm method has the highest runoff result and could meet 31% of the total demand of the study area and 28% and 19% for rc and scs methods, respectively. 6. acknowledgment i truly want to thank qaiwan group for assistance in complimenting this study and my family for their support and guidance. gharib, et al.: urban rainwater harvesting uhd journal of science and technology | jan 2021 | vol 5 | issue 1 55 references [1] a. daoud, k. swaileh, r. m. hussein and m. matani. “quality assessment of roof-harvested rainwater in the west bank, palestinian authority”. journal of water and health, vol. 9, pp. 525-533, 2011. [2] t. m. pinzón. “modelling and sustainable management of rainwater harvesting in urban systems”. theses, 2013. [3] u. wwdr. water and energy, the united nations world water development report 2014 (2 volumes). un world water assessment programme, unesco, paris. available from: https://sustainabledevelopment.un.org/content/ documents/1714water%20development%20report%202014.pdf. [last accessed on 2019 mar 19]. [4] a. k. patra and s. gautam. “a pilot scheme for rooftop rainwater harvesting at centre of mining environment, dhanbad”. international journal of environmental sciences, vol. 1, pp. 1542-1548, 2011. [5] s. n. baby, c. arrowsmith and n. al-ansari. “application of gis for mapping rainwater-harvesting potential: case study wollert, victoria”. engineering, vol. 11, pp. 14-21, 2019. [6] k. subagyono and h. pawitan. “water harvesting techniques for sustainable water resources management in the catchment area”. in: proceedings of international workshop on integrated watershed management for sustainable water use in a humid tropical region, tsukuba, 2008, pp. 18-30. [7] d. prinz, t. oweis and a. hachum. “the concept, components, and methods of rainwater harvesting”. in: 2nd arab water forum living with water scarcity, cairo, 2011, pp. 1-25. [8] c. bari. “emerging practices from agricultural water management in africa and the near east”. thematic workshop, 2017. [9] r. harb. “assessing the potential of rainwater harvesting system at the middle east technical university-northern cyprus campus”. middle east technical university library. available from: http://www.etd.lib.metu.edu.tr/upload/12619225/index.pdf. [last accessed on 2016 nov 10]. [10] r. h. handbook. “assessment of best practises and experience in water harvesting”. african development bank, abidjan, 2001. [11] j. julius, r. a. prabhavathy and g. ravikumar. “rainwater harvesting (rwh)-a review”. international journal of innovative research and development, vol. 2, p. 925, 2013. [12] g. freni and l. liuzzo. “effectiveness of rainwater harvesting systems for flood reduction in residential urban areas”. water, vol. 11, p. 1389, 2019. [13] i. a. alwan, n. a. aziz and m. n. hamoodi. “potential water harvesting sites identification using spatial multi-criteria evaluation in maysan province, iraq”. isprs international journal of geoinformation, vol. 9, p. 235, 2020. [14] s. zakaria, n. al-ansari, y. mustafa, s. knutsson, p. ahmed and b. ghafour. “rainwater harvesting at koysinjaq (koya), kurdistan region, iraq”. journal of earth sciences and geotechnical engineering, vol. 3, pp. 25-46, 2013. [15] i. gnecco, a. palla and p. la barbera. “the role of domestic rainwater harvesting systems in storm water runoff mitigation”. eur water, vol. 58, pp. 497-503, 2017. [16] n. f. mustafa, h. m. rashid and h. m. ibrahim. “aridity index based on temperature and rainfall data for kurdistan regioniraq”. journal of duhok university, vol. 21, pp. 65-80, 2018. [17] j. worm. “ad43e rainwater harvesting for domestic use”. agromisa foundation, 2006. [18] z. ara and m. zakwan. “estimating runoff using scs curve number method”. international journal of emerging technology and advanced engineering, vol. 8, pp. 195-200, 2018. [19] j. gironás, l. a. roesner, l. a. rossman and j. davis. “a new applications manual for the storm water management model (swmm)”. environmental modelling and software, vol. 25, pp. 813-814, 2010. [20] w. r. c. james and l. a. rossman. “user’s guide to swmm 5”. computational hydraulics international, 2010. [21] l. a. rossman. “storm water management model user’s manual, version 5.0”. national risk management research laboratory, cincinnati, 2010. [22] j. nipper. “measurement and modeling of stormwater from small suburban watersheds in vermont”. theses, 2016. [23] s. agarwal and s. kumar. “applicability of swmm for semi urban catchment flood modeling using extreme rainfall events”. the international journal of recent technology and engineering, vol. 8, pp. 245-251, 2019. [24] m. waikar and u. namita. “urban flood modeling by using epa swmm 5”. srtm university’s research journal of science, vol. 1, p. 20, 2015. [25] h. tikkanen. “hydrological modeling of a large urban catchment using a stormwater management model (swmm), thesis, 2013. [26] r. h. hawkins, t. j. ward, d. e. woodward and j. a. van mullem. “curve number hydrology: state of the practice”. american society of civil engineers, 2008. [27] a. bansode and k. patil. “estimation of runoff by using scs curve number method and arc gis”. international journal of scientific and engineering research, vol. 5, pp. 1283-1287, 2014. [28] v. chow, d. maidment and w. l. mays. “applied hydrology”. macgraw-hill,” inc., new york, p. 149, 1988. [29] n. al-ansari, s. knutsson, s. zakaria and m. ezz-aldeen. “feasibility of using small dams in water harvesting, northern iraq”. in: icold congress 2015: international commission on large dams 15/06/2015-16/07/2015, 2015. [30] m. awawdeh, s. al-shraideh, k. al-qudah and r. jaradat. “rainwater harvesting assessment for a small size urban area in jordan”. international journal of water resources and environmental engineering, vol. 4, pp. 415-422, 2012. [31] g. dadhich and p. mathur. “a gis based analysis for rooftop rain water harvesting”. international journal of computer science and engineering technology, vol. 7, pp. 129-143, 2016. [32] k. n. sharief. “water supply management of the sulaymaniyah city”. unpublished phd thesis, university of duhok, p. 24, 2013. [33] c. a. novak, e. van giesen and k. m. debusk. “designing rainwater harvesting systems: integrating rainwater into building systems”. john wiley and sons, hoboken, new jersey, 2014. [34] r. n. eli and s. j. lamont. “curve numbers and urban runoff modeling application limitations”. in: low impact development 2010: redefining water in the city, pp. 405-418, 2010. [35] d. a. size. “computing stormwater runoff rates and volumes”. new jersey storm water best management practices, pp. 5-22, 2004. [36] g. r. ghimire, r. thakali, a. kalra and s. ahmad. “role of low impact development in the attenuation of flood flows in urban areas”. world environmental and water resources congress, pp. 339-349, 2016. tx_1~abs:at/tx_2:abs~at 132 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 1. introduction supply chains can be defined as a series of interconnected activities between manufacturers and customers, which include the coordination, planning, and controlling of products and services [1]. the classical supply chain has many issues such as need for paperwork; speed limits; and lack of traceability, security, and reliability [2]. the esc records all the activities and transactions of supply chains electronically [1], [2], [3]. in the logistics process in an esc, the products can be transported in a way that the parties (manufacture, supplier, distributor, wholesaler, retailer, and customer) of the network can ensure the quality of goods [1], [4], [5]. the blockchain (bc) is a decentralized peer-to-peer (p2p) network distributed ledger, with unlimited digital transactions across the network without the need of third-parties. this technology can be used in different areas such as cryptocurrency, since it has a useful number of features such as reliability and traceability [6], [7], [8]. bc has different types depending on the organization or the company that uses it or on the architecture of the network [6]. in bc technology, every single node or client has a copy of the ledger in the network after a consensus from all participants [2], [6]. bc has four basic features: decentralization, openness, security, and privacy [2], [4]. bc and distributed databases (dbs) are different in their structures. in addition, in bc systems the transactions are controlled and managed by all the participants while in database system dbs the transactions are managed by a single entity [6]. a traceable and reliable electronic supply chain system based on blockchain technology shaniar tahir mohammed*, jamal ali hussien department of computer, college of science, university of sulaimani, sulaymaniyah, iraq a b s t r a c t electronic supply chain (esc) is a network among the parties of a supply chain system, such as manufacturers, suppliers, and retailers. it recodes all the processes involved in the distribution of specific products until transported to final customers. blockchain (bc) technology is a decentralized network that records all the transactions in real-time and is used in many areas such as cryptocurrency. in this paper, we work on an esc system that records all the transactions based on bc technology using a drug supply chain system as a case study. the recording of the transactions consists of three main stages. first, all the parties of the esc system are represented in the bc network as clients with unique identities. second, all the information related to a specific drug is recorded inside the transaction and each transaction has its own signature. finally, all the transactions of the drug from the manufacture to the patient are recorded inside a block with a unique identity for each block. these steps inside the bc are performed based on security cryptography mechanisms, such as rivest-shamir-adleman (rsa) and secure hash algorithm sha. the results illustrate that the proposed approach protects the drugs from counterfeiting, ensures the reliability, and provides a real-time tracking system for the transactions that have occurred among esc parties. index terms: blockchain, block, electronic supply chain, monitoring, reliability, traceability corresponding author’s e-mail: shaniar tahir mohammed, department of computer, college of science, university of sulaimani, sulaymaniyah, iraq. e-mail: shaniar.mohammed@univsul.edu.iq received: 28-08-2020 accepted: 03-12-2020 published: 14-12-2020 access this article online doi: 10.21928/uhdjst.v4n2y2020.pp132-140 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2020 shaniar. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l re se a rc h a rt i c l e uhd journal of science and technology shaniar: electronic supply chain with blockchain uhd journal of science and technology | jul 2020 | vol 4 | issue 2 133 many researchers have worked on the process of merging the electronic supply chain (esc) with bc using different approaches. however, either the researches stay theoretical and have not been implemented [4] or it does not fully benefit from all the security features of bc for improving reliability, traceability, and the monitoring process of the esc systems [9]. in this paper, bc technology is used to create a reliable architecture for the esc process and solves the problem of supply chain systems, such as reliability and traceability. regarding the drug safety issues, the aim of this research was to help authorities in iraqi kurdistan to monitor and trace drug transportation from the manufacturers all the way to the patients in a secure and reliable way. in particular, this model is helpful in reducing paperwork, monitoring drug flow between esc clients, protecting the drug from counterfeiting, tracing the transactions, reducing cost, and protecting the information flow between clients. the proposed model is based on three main parts of bc technology used in esc architecture: clients, transactions, and blocks. the clients represent the esc parties, such as suppliers and manufacturers. each client has a unique identity based on the security mechanism rsa-1024. the transactions hold the drug flow information in esc between all parties. the proposed model generates a signature for each transaction based on rsa-1024 algorithm, for protecting the esc from unknown transactions and also from modification. the blocks hold the transaction between the esc parties, each block holds five transactions from the manufacturer to the final destination, the patient. each block has a unique identity based on the sha-256 algorithm. the information flow of drug esc model is decentralized and shared between all the clients and recorded in real-time enabling tracking and monitoring of the activities by all the participants. 1.1. problem statement the classical supply chain has some problems such as depending on paperwork to exchange information; high cost; and lack of reliability, traceability, and trust. while in the esc, all the information and processes are captured and recorded in an electronic way, which is helpful in eliminating paperwork, reducing cost, and increasing reliability, since all processes can be monitored by all parties of the esc system. bc is a helpful technology that can be used with an esc to record all the information in a decentralized way. then, the processes of the ecs and can be monitored by all participants of the network without needing a third party. in addition, it increases reliability because in bc various security mechanisms are used when recording supply chain processes. furthermore, traceability can be improved based on bc technology, since the encryption of the information prevents unknown transactions and unknown clients protecting the data from modification. all the transactions that have occurred in the esc can be monitored in real-time. in iraqi kurdistan, there does not exist an electronic system for recording and monitoring drug information and the transactions between the parties of the supply chain system. therefore, it is difficult to establish reliability and provides an efficient traceability of drugs. for our case study, we work on the pharmaceutical system for drug transportation from manufacturers to final destination patients in iraqi kurdistan using esc with bc technology. the proposed system allows us to protect the drug from counterfeiting, enables traceability of drugs, and increase reliability between all parties in the esc system. the information flow between all parties is held in bc in a secure way without a third-party involvement; therefore, the authorities can easily monitor drug transportation and transactions. 2. background 2.1. the structure of esc systems the esc is a network between the producer of the specific product and the supplier to record the information about all the processes of product transportation and improving health and safety, speed, cost, scalability, and transparency while distributing to consumers [10], [11], [12]. in this network, there exist various entities, such as people, resources, and information that need to be recorded in a fast and secure way. the general structure of the esc includes the following participants: a) provider: it provides raw materials to the suppliers, such as the raw materials used in producing drugs, food ingredients, and automobile parts [13]. b) supplier: it is somebody who is responsible for the actions of producing the raw materials, like a farmer. c) manufacturer: the manufacturer performs various actions to produce specific products. d) distributor: the distributor is responsible for moving the product to the retailers. e) retailer: it is responsible for marketing the goods, such as local stores, supermarkets, pharmacies, or car shops [3], [14]. shaniar: electronic supply chain with blockchain 134 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 f) consumer: the consumer buys a product and checked it to ensure originality. for instance, a patient who buys the drug in a pharmacy [13]. fig. 1 illustrates all the stages of the general esc process: 2.2. bc architecture bc technology was conceptualized for the first time by satoshi nakamoto in 2008 and used in various areas especially for cryptocurrencies like “bitcoin” [15], [16], [17]. this technology is a decentralized p2p mesh network of nodes linked to each other and contains blocks without the need to be managed by a third-party. several layers govern bc operations and generate the protocols for bc applications. bc generally has two main types: public bc and private bc. public bc is decentralized where parties can access the current and previous records. while private bc is a centralized network that is privately available for organizations that have a limited number of participants. private bc is less secure than public bc. fig. 2 shows an overall structure of the bc architecture [18]: the bc contains a sequence of techniques that are used to recording transactions in real-time between the parties of the network (sender and recipient). the information is held inside blocks and each block links to a previous block, as shown in fig. 2. the bc provides security features such as cryptographic hash, digital signature, and distributed consensus mechanism [13], [18], [19]. fig. 3 illustrates the general architecture of bc in the bitcoin process [15]. each transaction that occurs in a bc encapsulates the phases is shown in fig. 3. in the case of supply chain systems, all logistic processes from producing a specific good until delivering it to customers are recorded. the bitcoin cryptocurrency process consists of the following six stages. the same stages are used for our proposed system. 1. the transaction: the transaction holds information, for example, in an esc transaction; it includes the sender and receipt identity, date and time, the quantity of the goods, name of the product, location, and so on.fig. 1. general structure of the electronic supply chain. fig. 2. block structure in the blockchain technology. fig. 3. cryptocurrency with blockchain architecture. shaniar: electronic supply chain with blockchain uhd journal of science and technology | jul 2020 | vol 4 | issue 2 135 2. a cryptographic signature is the second stage of the bc architecture to provide a secure algorithm function to implement a cryptographic signature to both sender and receipt identity using cryptographic algorithms, such as sha256, sha512, and rsa. 3. broadcasting the transaction: in this stage, the transaction must be broadcasted to the network for the authentication process. 4. the transaction verification: the transaction must be authenticated and verified by all the parties inside the decentralized network. 5. digital ledger: after notifying all the parties of the bc of the new transactions, they are added to the digital ledger and appended to the bc. 6. transaction completion: finally, the transaction is completed and the money or the information related to it is transferred to the receipt and added to the block. each block is linked to the previous one. 2.3. cryptographic algorithms the secure hash algorithm sha is one of the cryptographic hash functions designed to hold data securely by transforming the data into hash code using its own family of types such as sha1, sha224, sha256, and sha512. each of them has a different size in bytes [20]. the rsa cryptographic system is used for data encryption and decryption with two keys (private key and public key), when the private key is only known to the owner and the public key is publicly available and can be accessed by everyone [21], [22]. the size of the key of rsa should be 1024 bits or higher. the public key is used for encryption while the private key is used for decryption [21]. 3. the running example we describe the supply chain process in an earlier section. we show a small example of a supply chain process between two clients known as the manufacturer and the supplier for transporting a drug between them. table 1 shows the information related to one transaction occurred between two clients in the supply chain. this transaction is taken from a large set of transactions from the pharmaceutical system in the iraqi kurdistan. we use the information about a specific drug to demonstrate various transactions in our proposed esc with bc. in the upcoming sections, we show more transactions about a specific drug. the structure of the transactions contains all the information needed for the drug supply chain process, such as drug_id, drug name, sender_id, recipient_id, location, size, and date and time. in esc systems, the information can be stored electronically with a centralized system without real-time recording of the process, which is not secure. however, public bc systems with esc are secure since they are decentralized and there is no need for third-parties to manage the process. public bc systems are reliable and transactions can be traced at each stage of the esc in real-time. 4. related work randhit kumar and rakish tripathi [9] proposed a traceability structure for supply chain based on bc technology to medicine system or drug manufacturing to protect the drug while transported from the manufacturer to the consumer, they used bc to encrypt the qr code of the drugs to protect the drug from counterfeiting. the methodology of this research provides a structure to protect the medicine supply chain from a man-in-the-middle attack. bc in various case studies and different sectors of industry such as the pharmaceutical supply chain is used by s. aich, s. chakraborty, m. sain, h. lee, and h. kim [13] to ensure the tractability of drugs and to improve the efficiency of supply chain network. they mentioned the problems occurred in different sectors of the conventional supply chain and find solutions. they compare the traditional supply chain to a digital supply chain based on bc and internet of things (iot) for pharmaceutical process to track and protect drugs from counterfeiting. iot was used to record digital identification for all products to provide trustworthiness in the digital supply chain system. all the participants of the system in the network can ensure the transparency of the information recorded in the bc. m. p. caro, m. s. ali, m. vecchio, and r. giaffreda [23] proposed a new agriculture food supply chain based bc architecture with iot to keep the system from data tampering. the integrity of the data and traceability of the process is table 1: a simple transaction information occurred between two clients in a supply chain system manufacturer name sanofi supplier name awafi drug name plavix drug id 3622554532 date and time october 10, 2019, 02:00:13 shaniar: electronic supply chain with blockchain 136 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 conducted using various bc mechanisms such as ethereum and hyperledger sawtooth. a three-tier architecture based on bc was proposed by s. malik [10] to ensure data availability to consumers and to provide scalability for handling transaction loads while keeping the history and confidential information safe when delivering to other parties of a supply chain process. 5. bc -based esc system in this section, we propose an esc model, which enables traceability, reliability, monitoring of drug transactions in the iraqi kurdistan based on the bc technology. we worked on the pharmaceutical supply chain by recording logistic steps of the esc to transport a drug from the manufacturer to a patient using bc technologies. fig. 4 shows a block diagram of the proposed model. in our proposed system, transactions occur among supply chain parties in pharmaceutical process according to these instructions: 1. capturing the transaction n in the pharmaceutical supply chain system between client a and client b. 2. based on the rsa-1024 algorithm, generating a unique identity for each one of the five different types of clients. the identities are shared between all the participants of the bc. table 2 shows the client identities. 3. build a unique transaction signature using rsa-1024, which facilitates traceability and the identification of the drug transported between the clients. 4. each transaction holds the sender and recipient identities, item information, date and time, and other related information. 5. typically, each drug requires five transactions between the clients to be delivered from the manufacturer to the patient in the supply chain process. these transactions are added to a single block m. table 2: client identity while applied rsa-1024 algorithm in esc client type client name client identity manufacturer sanofi 30819f300d06092a864886f70d 01010105000381……. supplier awafi medicine store 06c48043be02d767f023a62ff5 b7ec7c0bc08e9bd……. wholesaler kmca c873e055bf23971a944882e78 eb82ed0045af93c2……. retailer shar f3a6765fb8755bfc8543ef2d691 d3799ec281cda96……. patient ahmed 0dfb14bfb66ada3c097353d2cc 58608e8659cbbe9……. fig. 4. block diagram of the electronic supply chain process using blockchain technology. shaniar: electronic supply chain with blockchain uhd journal of science and technology | jul 2020 | vol 4 | issue 2 137 6. before adding them to the bc, a unique identity based on the sha-256 algorithm is generated for each block. 7. the block m added to the bc and each block linked to the previous one. this makes blocks to be easily traceable and reliable, and facilitate real-time monitoring. in fig. 5, the main parts of the proposed esc system are illustrated. the system records any transaction that has occurred inside the esc using bc. the esc includes five types of clients: manufacturer, supplier, wholesaler, retailer, and patient. drugs are produced by the manufacturer, transported to the supplier, wholesaler, and then to the warehouse or the retailer. finally, the drugs will be delivered to the patient. all these transactions can be tracked and recorded securely without need to third-party’s involvement. 5.1. transaction processes each transaction has its own signature; they are created with rsa-1024 algorithm. with these signatures, we can protect our esc transactions from any unknown transactions. fig. 6 depicts a small transaction between two clients in the esc. we hold five transactions inside each block of the bc, one transaction between each two clients starting from the manufacturer and ending with the patient. table 3 shows three transaction signatures as an example. 5.2. the block structure of our bc system in this research, another main part is the generation of a unique block for the esc’s transactions before adding it to the bc. block identities are generated with the sha-256 algorithm. in general, inside the bc, we have an unlimited number of blocks, which means that the system can generate any number of blocks, but in the proposed approach worked on 2000 blocks, each block holds five transactions in the pharmaceutical supply chain process in iraqi kurdistan. the size of each block is between 830 and 890 bytes depending on the amount of information saved inside each transaction and held in a block, the bc is not reseted after holding the transaction, and the transactions history remain inside the blocks. each block in the bc system is protected from modification because these blocks are shared in a decentralized network and each block is linked to the previous block with its own identity. only the participants of the supply chain system can see the information inside the blocks, which enables the clients to easily monitor the processes of esc. the average time duration between the current block and the previous block is 0.0056 ms. table 4 shows four blocks with current and previous identities. 6. results and discussion in the proposed system, we can ensure drug quality and the correctness of information related to the drug manufacturing process. our system creates a secure channel among parties based on bc technology, which provides traceability of drugs, reduces cost, enhances reliability, eliminates paperwork, and facilitates monitoring by all clients in the pharmaceutical process. table 5 contains a chain of five transactions inside a single block (#block112) regarding the transportation of a specific drug from the manufacturer all the way to the patient. the fig. 5. pharmaceutical supply chain process in esc. shaniar: electronic supply chain with blockchain 138 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 drug has its own unique identifier. there are unique identifiers for each of the sender and the receiver per transaction along with the date and time of the transaction. in table 6, we choose a transaction (#transaction21) among the five transactions of the block (#block112) between the supplier, rasan, and the wholesaler, kmca, showing all the information of that transaction. the detail about the transaction includes the location, date, and time, and size. 6.1. traceability in our proposed system, each block contains the transactions that are related to one single drug. each drug has a unique identifier, for example, #block112 is encrypted with the sha256 cryptography algorithm and linked to the previous block (table 6). in this way, we can trace the transaction between these two clients. we can determine when the transaction has started (datetime), which clients participate in the transaction (sender and recipient ids), where is the current location of the drug (supplier and wholesaler locations), and whether the transaction has occurred or it is in a waiting state. the information is recorded in real-time, which means we can trace each drug transported between the clients in real-time. the transactions can be monitored by all the participants, without modification by an unknown client or user. since the network is decentralized, we do not need a third-party for managing the information of the transactions that are held inside the blocks. 6.2. reliability and monitoring the reliability of the information is another feature of the proposed approach. since our system uses several security mechanisms with bc technolog y, such as rsa-1024 and sha-256 algorithms. these algorithms are used for generating a unique identity for each participant (sender identity and recipient identity), transaction (transaction identity), and block (block identity) of the esc system, as shown in table 5. with these identities, we can trust the transactions. bc has an important process called consensus algorithm, which provides agreements among clients of the bc network about the data. based on the consensus algorithm a participant can generate new blocks that must be accepted by the other parties of the system. this is useful for improving the reliability and ensuring the traceability of the system. there is an important feature inside the transactions, which is the transaction type. its value is either verified or unverified. if a client notices that some identities have been changed or the fig. 6. transaction structure and signature in esc. table 3: three transaction signature while applied rsa-1024 algorithm in esc 1bee9567d03e61c48e748cb2983ee4bfd308009555e45d6fe9bbbed051e338055e9950a664e280048268b8457a16ce22a8b699b3e11e0f8d553e 23541b328e949f18c7db20cd44a9c60153d3dff3538984aae8eb1fa117d54fe84e2cfd261a659cc05c97e87e7b7078c101d81595ebcf4610f ca0172799935f82df0125b04e82 6f606e75128216b1562ca131ffdc6bae4d9967636244cac72478ea5e8764bf86824f6ad9ed250b7c031896604e969dd3cceb5989a556bf31 e0a39aca8d84d69e41740221592f4b0c0feacb34f0068c5ef5d731108cbe8e6c36503d1d158332eeb6b6cd62b734dd147409d11d085eb65 dccb4d0874c384ddbf2b0495aec858c8c 8b3c55fbe861e80937fb1ac70fa7781c041bd7af30249b0e42f7f6553e265a260a87d00d44a5942a34f3cbf168a7a54a1afb9dc40e946ff43d0 d6a7e69d72eea8c96b8da2306e7e334597cc7192e5f040318bfeea61af225c2b0da26b68b6195ef4066c6b8a9f65c9d49607bf68b340359ae bb8e464de7d1b6c87f687851bb70 table 4: block identity in our blockchain system generating based on sha-256 block number block current identity (sha-256) block previous identity (sha-256) #block112 6ad6d74005e15af42 18b693b3c…… 0a6b88cb1c75ab 970fd11e4………… #block251 23075fc5076b836 136a3a92e9…… a9f893431c9c961 bfb7896483…… #block512 97fb8026b07ed4f1 f73d23f31…… 1ca161f2094af3f6e fba34h11…… #block600 906886c8477237b 80b0de1fb2…… aa45b1772c74e330 74d0c061a…… shaniar: electronic supply chain with blockchain uhd journal of science and technology | jul 2020 | vol 4 | issue 2 139 drug information has been modified, then the value of the field transaction type is changed from verified to unverified and all the participants of the network are notified about this modification. this prevents the transaction to be faked and protects the drug from counterfeiting. this process enables the reliability of the esc system and establishes trust among the participants of the network. since all the transactions happen in real-time, all participants as well as the authorities can monitor the process of drug transportation between all the clients of the esc system. 7. conclusions and future work in this research work, bc technology is used for recording information flow between supply chain parties using the pharmaceutical supply chain system in iraqi kurdistan as a running example. the proposed system generates unique table 5: simple drug information for five transactions inside one block in the blockchain system #block112 drug name: omeprazole_cap e/c 10mg, drug id: 0103050p0aaafaf tr sender name sender id receiver name receiver id date time 20 sanofi (manufacturer) 30819f300d06… rasan (supplier) 300d06092a86… 2020-02-04 17:25:48 21 rasan (supplier) 08b1058b43…. kmca (wholesaler) dbded6379261… 2020-02-05 11:15:22 22 kmca (wholesaler) 002b6bd609e7... zhin (warehouse) 10af53600f2c… 2020-04-11 9:08:40 23 zhin (warehouse) 6a2744af1447… lia (retailer) 236a2744af14… 2020-04-21 16:21:54 24 lia (retailer) 49ad4dbd1e… ahmed (patient) a6a8c08fc87cd... 2020-04-29 06:22:11 table 6: one transaction inside a block in our bc system from the esc #block112 #transaction21 transaction id 1bee9567d03e61c48e748cb2983ee4bfd 308009555e45d6f……….. sender id 08b1058b43……….. recipient id 300d06092a86…….. supplier location iraq wholesaler location iraq datetime 10-1-2020 08:22:12 drug id 0103050p0aaafaf drug name omeprazole_cap e/c 10mg size in block 150 byte block_id 6ad6d74005e15af4218b693b3cb1fb79538 159d5db568f7528c…… block _prev_id 0a6b88cb1c75ab970fd11e49f1db8d237d9 bf819ef1165fdb0fb…… transaction type verified situation occurred identifiers for the clients and the transactions regarding a specific drug and stores them along with the drug information inside a block of the bc system. the generation of identifiers and storing of information happen in real-time when the transactions occur. therefore, the authorities and supply chain parties can easily monitor or track the transactions to protect the drug from counterfeiting and information from modification, thus establishing trust and reliability of the process. in addition, the process of recording and monitoring supply chain transactions electronically reduces cost and time and eliminates paperwork. in the future, we want to implement our proposed model in other sectors, such as food and clothes, to provide reliability and traceability among vendors and customers. furthermore, the internet of things iot can be used to hold transaction information about physical things that are included in the esc systems. 8. acknowledgment we would like to express thanks to dr. firdous nuri, dr. dana sardar, and sulaimani health center for providing information about the drug manufacturer companies and pharmaceutical parties in iraqi kurdistan. references [1] k. korpela and t. dahlberg. “digital supply chain transformation toward blockchain integration”. hawaii international conference on system sciences (hicss)at: big island, hawaii, pp. 41824191, 2017. [2] d. schniederjans, c. curado and m. khalajhedayati. “supply chain digitisation trends: an integration of knowledge management”. international journal of production economics, vol. 220, p. 107439, 2020. shaniar: electronic supply chain with blockchain 140 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 [3] s. chen, r. shi, z. ren, j. yan, y. shi and j. zhang. “a blockchainbased supply chain quality management framework”. ieee 14th international conference on e-business engineering (icebe), shanghai, pp. 172-176, 2017. [4] d. tse, b. zhang, y. yang, c. cheng and h. mu. “blockchain application in food supply information security”. ieee international conference on industrial engineering and engineering management (ieem), singapore, pp. 1357-1361, 2017. [5] w. kenton. “supply chain”. available from: https://www. investopedia.com/terms/s/supplychain.asp. [last accessed on 2020 apr 01]. [6] n. b. al barghuthi, h. j. mohamed and h. e. said. “blockchain in supply chain trading”. 5th hct information technology trends (itt), dubai, united arab emirates, 2018, pp. 336-341, 2018. [7] h. min. “blockchain technology for enhancing supply chain resilience”. business horizons, vol. 62, no. 1, pp. 35-45, 2019. [8] m. nofer, p. gomber, o. hinz and d. schiereck. “blockchain”. business and information systems engineering, vol. 59, no. 3, pp. 183-187, 2017. [9] r. kumar and r. tripathi. “traceability of counterfeit medicine supply chain through blockchain”. 11th international conference on communication systems and networks (comsnets), bengaluru, india, pp. 568-570, 2019. [10] s. malik, s. s. kanhere and r. jurdak. “productchain: scalable blockchain framework to support provenance in supply chains”. ieee 17th international symposium on network computing and applications (nca), cambridge, ma, pp. 1-10, 2018. [11] g. büyüközkan and f. göçer. “digital supply chain: literature review and a proposed framework for future research”. computers in industry, vol. 97, pp. 157-177, 2018. [12] s. kamble, a. gunasekaran and s. gawankar. “achieving sustainable performance in a data-driven agriculture supply chain: a review for research and applications”. international journal of production economics, vol. 219, pp. 179-194, 2020. [13] s. aich, s. chakraborty, m. sain, h. lee and h. kim. “a review on benefits of iot integrated blockchain based supply chain management implementations across different sectors with case study”. 21st international conference on advanced communication technology (icact), pyeongchang kwangwoon do, korea (south), pp. 138-141, 2019. [14] s. r. niya, d. dordevic, a. g. nabi, t. mann and b. stiller. “a platform-independent, generic-purpose, and blockchain-based supply chain tracking”. ieee international conference on blockchain and cryptocurrency (icbc), seoul, korea (south), pp. 11-12, 2019. [15] v. morkunas, j. paschen and e. boon. “how blockchain technologies impact your business model”. business horizons, vol. 62, no. 3, pp. 295-306, 2019. [16] c. schmidt and s. wagner. “blockchain and supply chain relations: a transaction cost theory perspective”. journal of purchasing and supply management, vol. 25, no. 4, p. 100552, 2019. [17] d. macrinici, c. cartofeanu and s. gao. “smart contract applications within blockchain technology: a systematic mapping study”. telematics and informatics, vol. 35, no. 8, pp. 2337-2354, 2018. [18] s. s. hazari and q. h. mahmoud. “a parallel proof of work to improve transaction speed and scalability in blockchain systems”. ieee 9th annual computing and communication workshop and conference (ccwc), las vegas, nv, usa, pp. 916-921, 2019. [19] a. a. maksutov, m. s. alexeev, n. o. fedorova and d. a. andreev, "detection of blockchain transactions used in blockchain mixer of coin join type," 2019 ieee conference of russian young researchers in electrical and electronic engineering (eiconrus), saint petersburg and moscow, russia, pp. 274-277, 2019. [20] s. gueron. “speeding up sha-1, sha-256 and sha-512 on the 2nd generation intel® core™ processors”. 9th international conference on information technology new generations, las vegas, nv, pp. 824-826, 2012. [21] a. karakra and a. alsadeh. “a-rsa: augmented rsa”. sai computing conference (sai), london, pp. 1016-1023, 2016. [22] s. a. nagar and s. alshamma. “high speed implementation of rsa algorithm with modified keys exchange”. 6th international conference on sciences of electronics, technologies of information and telecommunications (setit), sousse, pp. 639-642, 2012. [23] m. p. caro, m. s. ali, m. vecchio and r. giaffreda. “blockchainbased traceability in agri-food supply chain management: a practical implementation”. iot vertical and topical summit on agriculture tuscany (iot tuscany), tuscany, pp. 1-4, 2018. tx_1~abs:at/tx_2:abs~at uhd journal of science and technology | jul 2020 | vol 4 | issue 2 1 1. introduction in economic and social research, many types of regression models applied. their use is dependent on the nature of the data. the tobit model is regarded as the most appropriate statistical model for solving those cases that the dependent variable is censored or truncated [1]. tobit regression has been the subject of great theoretical interests in numerous practical applications. it has been developed and used in many fields, such as econometrics, finance, and medicine [1], [2]. furthermore, it is regarded as a linear regression model where only data on the response variable incompletely observed; the response variable is censored at zero. kidney diseases are common diseases worldwide; it is a global public health problem affecting 750 million persons globally [3]. it plays an important role in preserving normal body functions. most people are not aware of their impaired kidney functions. in using tobit model for studying factors affecting blood pressure in patients with renal failure raz muhammad h. karim*, samira muhamad department of statistics, college of administration and economic, university of sulaimani, sulaymaniyah, iraq a b s t r a c t in this study, the tobit model as a statistical regression model was used to study factors affecting blood pressure (bp) in patients with renal failure. the data have been collected from (300) patients in shar hospital in sulaimani city. those records contain bp rates per person in patients with renal failure as a response variable (y) which is measured in units of millimeters of mercury (mmhg), and explanatory variables (age [year], blood urea measured in milligram per deciliter [mg/dl], body mass index [bmi] expressed in units of kg/m2 [kilogram meter square], and waist circumference measured by the centimeter [cm]). the two levels of bp; high and low were taken from the patients. the mean arterial pressure (map) was used to find the average of both levels (high and low bp). the average bp rate of those patients equal to or >93.33 mmhg only remained in the dataset. the 93.33 mmhg is a normal range of map equal to 12/8 mmhg normal range of bp. the others have been censored as zero value, i.e., left censored. furthermore, the same data were truncated from below. then, in the truncated samples, only those cases under risk of bp (greater than or equal to bp 93.33mmhg) are recorded. the others were omitted from the dataset. then, the tobit model applied on censored and truncated data using a statistical program (r program) version 3.6.1. the data censored and truncated from the left side at a point equal to zero. the result shows that factors age and blood urea have significant effects on bp, while bmi and waist circumference factors have not to affect the dependent variable(y). furthermore, a multiple regression model was found through ordinary least square (ols) analysis from the same data using the stratigraphy program version 11. the result of (ols) shows that multiple regression analysis is not a suitable model when we have censored and truncated data, whereas the tobit model is a proficient technique to indicate the relationship between an explanatory variable, and truncated, or censored dependent variable. index terms: tobit model, censored regression, truncated regression, renal failure, blood pressure corresponding author’s e-mail: raz muhammad h. karim department of statistics, college of administration and economic, university of sulaimani, sulaymaniyah, iraq. e-mail: razmauhammad@gmail.com received: 16-05-2020 accepted: 26-07-2020 published: 01-07-2020 access this article online doi: 10.21928/uhdjst.v4n2y2020.pp1-9 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2020 karim and muhamad. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l re se a rc h a rt i c l e uhd journal of science and technology raz muhammad h. karim and samira muhamad: tobit model for studying factor affecting blood pressure 2 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 fact, kidney failure is a “silent illness” that sometimes has no obvious early symptoms. many people with kidney diseases are not conscious that they are at high risk of kidney failure, which could require dialysis or transplantation. often the disease such as diabetes with high blood pressure (bp) may cause kidney damage. hypertension (high bp) is both a cause and a consequence of renal diseases, which are difficult to distinguish its types clinically [4]. hence, the importance of this research comes as studies the factors that affect bp in patients with renal disease and knowing the real causes of it. this is crucial for medical staff and specialists (doctors) to eliminate problems and limit the spread of kidney diseases because high bp is both a cause and a consequence of kidney diseases. in this study, we find an influence on each independent variable of the dependent variable (bp). it is known that the normal bp range is 12/8 mmhg [5]. this value change due to many factors, and any change in this range make many health problems. therefore, controlling bp and finding factor, everybody should take care of it. in this study, the data collected from patients in a dialysis center at shar hospital in sulaimani city. the two levels of bp; high and low bp from the patients (as dependent variables) and some independent variables (age, blood, urea, body mass index [bmi], and waist circumference) were taken. each patient has their own specific bp (high, low), then we could not take high and low bp separately for our study. that is why the mean arterial pressure (map) was performed. it is an average arterial pressure contains high and low bp [6]. a threshold point equal to 93.33 is determined and found by map equation [7], equal to 12/8 mmhg, which is a normal range of bp. we assumed any value lower that range is equal to zero. therefore, the tobit regression model is used because some variables are equal to zero for a number of observations. this is a phenomenon that can generally be termed censored or truncated data. after that multiple regression model performed for the same data based on ordinary least square (ols) analysis, it is found that a multiple regression model is not suitable for analysis because there are a number of observations in the dependent variable equal to zero. the use of ols models in the case of censored sample datasets and depending on the number of zeros makes ols estimated bias [8]. 2. aim of the study the aim of this study is to detect the impact of the independent variables (age, blood, urea, bmi, and waist circumference) on dependent variables (bp) in patients with renal failure putting these results in front of specialists to eliminate a problem using a statistical model (tobit model). knowing which factor in the independent variable more effect on the dependent variable also comparison between (ols) and tobit model estimation to knowing which of them are suitable models for estimation. 3. related work odah et al. [9] displayed the most significant factors affecting loans provided by iraqi banks and the best methods to estimate the data using a tobit regression model and ols method. liquidity and loan repayment were found to affect loans from the iraqi banks, while the effects of interest rate and borrowers were not statistically significant. the outcome of tobit and ols estimations indicate that bias will result when estimating iraqi bank loans using ols if bank loans are limited. prahutama et al. [10] used a tobit regression model to study factors that affect household expenditure on education in semarang city. the dependent variable used in this study is household expenditure for education. the independent variables used include the education of the head of the household, occupation of the head of the household, number of household members, number of working household members, the proportion of household members who attend school in junior high school, senior high school and college, and food expenditure in households and regions. based on the tobit regression analysis proportion of household members who are taking education in college is the most significant contribution to the high cost of household expenditure. ahmed [11] applied a tobit (truncated), (censored) data regression models and multiple regression with the least square method for persons whose levels exceed 120 g/dl under the risk of diabetes injure, in the sample data (n = 500) on the assumption that blood sugar (y), depends on the explanatory (age: x1, cholesterol: x2 gram/deciliter, and triglycerides: x3 gram/deciliter). the results revealed that the censored regression model was more applicable than the other regression models (truncated, and multiple regression), the two factors (age and triglycerides have highly significant effects on the blood sugar. ahmad et al. [12] used tobit regression analysis and data envelopment analysis (dea) to address some of the important working capital management policies and efficiency regarding the manufacturing sector of pakistan. to achieve that data from 37 firms have been taken for the periods 2009–2014. raz muhammad h. karim and samira muhamad: tobit model for studying factor affecting blood pressure uhd journal of science and technology | jul 2020 | vol 4 | issue 2 3 tobit regression analysis concludes that the average period has significant negative impacts on efficiency and current ratio, gross working capital turnover ratio, and financial leverage ratio that have a positive significant impact on efficiency. samsudin et al. [13] applied the tobit model and dea to examine the efficiency of public hospitals in malaysia and identify the factors affecting their performance. the study analysis was based on 25 public hospitals in the northern region of malaysia. according to the result of this study found that the daily average number of admission, the number of outpatient per doctor, and hospital classification have significant influences on hospital inefficiencies. odah et al. [14] investigated the factors affect divorce decision, and determine the most important factors causing divorce in iraq through using the tobit regression model and probit regression model. the data were collected through the application of the questionnaire. according to tobit regression analysis results, marital infidelity is the main reason for the increase in divorce cases, as well as the preoccupation of the couple with social networking sites. after using the probit model, it found that age, social media sites, and income have a significant impact on the decision to divorce. zorlutuna et al. [15] applied tobit regression analysis for the measurement of lung cancer patients. data taken from sivas cumhuriyet university faculty of medicine research and application hospital oncology center consists of 535 patients who have lung cancer. tobit regression results show that when the dependent variable phase of the patient’s disease, the patient’s gender, patient’s condition, and the pathological consequences of the disease were found to be statistically significant variables. the sex of patient has positive effect on the stage of the disease, while pathological condition has negative influences. anastasopoulos et al. [16] provided a demonstration of tobit regression as a methodological approach to gain new insights into the factors that significantly influence accident rates. using 5 years of vehicle accident data from indiana, the estimation results show that many factors relating to pavement condition, roadway geometrics, and traffic characteristics significantly affect vehicle accident rates. 4. tobit model the regression analysis is one of the statistical methods used to explain the relationship between explanatory variables and the dependent variable. therefore, choosing an appropriate model for the available data is a necessity of this analysis. in many statistical analyses of individual data, the dependent variable is censored. if the dependent variable is censored, the use of a conventional regression model with this type of data will lead to a bias in the estimation of the parameters there for the best model for this type of data is the tobit model [17]. the tobit model family of statistical regression models defines the relationship between censored or truncated continuous dependent variables and some independent variables [18]. it has been used in many areas of applications, including dental health, medical research, and economics [2]. the tobit model refers to a regression model where the range of dependent variables is limited in some ways [16]. a model invents by tobin in which it is supposed that the dependent variable has a number of its values clustered at limited value, usually zero [19]. this model was first introduced statistical literature in the 1950s and was called “censored normal regression model.” it has been used for health studies since the 1980s. the tobit model is an efficient method for estimating the relationship from probit between an explanatory variable and truncated or censored dependent variable. the origin of the tobit model is from probit analysis and multiple regressions. the benefit of this model, using all the information that either probit models (or logit) or ols, would allow separately [20]. 4.1. the structural equation model [21], [22] the structural equation in the tobit model is y x ² e* i i i= + (1) where e i~ n(0,σ2). y* is a latent variable that is observed for values greater than τ and censored otherwise. the observed y is defined by the following measurement equation y y if y if y i y = > ≤     * * * � � � τ τ τ (2) in the typical tobit model, we assume that τ=0, i.e., the data are censored at 0. t use we have y y if if y i * * * = > ≤     � � �� � y 0 0 0 (3) 4.2. estimations as we have seen from earlier, the likelihood function for the censored normal distribution is l y di di i n = ∅ −      −∅ − −∏ [ ] [ ( )] 1 1 1 σ µ σ µ τ σ (4) raz muhammad h. karim and samira muhamad: tobit model for studying factor affecting blood pressure 4 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 where τ is the censoring point. in the traditional tobit model, we set t=0 and parameterize µ as x i β. this gives us the likelihood function for the tobit model: l y x xi i di i di i n = ∅ −      −∅ −∏ [ ] [ ( )] 1 1 1 σ β σ β σ (5) the log-likelihood function for the tobit model is ln { ln ln ln l d y x d x i i i i n i i = − + ∅ −            + −( ) −∅   =∑ σ β σ β σ 1 1 1         )} (6) the overall log-likelihood is made up of two parts. the first part corresponds to the classical regression for the uncensored observations, while the second part corresponds to the relevant probabilities that observation is censored. 5. truncation, censoring, (truncated and censored) distribution, and marginal effect the leading causes of incompletely observed data are truncation and censoring. 5.1. truncation the effect of truncation occurs when the observed data in the sample only drawn from a subset of a larger population [23]. on the other hand, a dependent variable in a model is truncated, if observations cannot be seen when taking value with a certain range. this means, both the independent and the dependent variables are not observed when the dependent variable is in that range [24]. there are two types of truncation: from below and from above (truncation from left and truncation from right). figs. 1 and 2 explain the probability distribution of truncated from below [11]. 5.2. censoring the idea of “censoring” is that some data above or below the threshold is misreported at the threshold. hence, the observed data are generated by a mixed distribution with both a continuous and a discrete component. the censoring process may be explicit in the data collection process, or it may be a by-product of economic constraints involved in constructing the data set [24]. when the dependent variable is censored, values in a certain range are all transformed to (or reported as) a single value [25]. fig. 3 explain the probability distributions of censored from below [11]. 5.3. (truncated and censored) distribution [21] after formally considering the tobit model, we need some results about truncated and censored normal distribution. these distributions are at the foundation of most models for truncation and censoring. the results are given for censoring and truncation on the left, which translate into censoring from below in the tobit model. corresponding formulas are given for censoring and truncation on the right, and both on the left and on the right. fig. 2. truncated normal distribution [11]. fig. 1. truncated from below with the probability distribution explaining (threshold = 3) [11]. fig. 3. censored from below with the probability distributions explaining (threshold = 5) [11]. raz muhammad h. karim and samira muhamad: tobit model for studying factor affecting blood pressure uhd journal of science and technology | jul 2020 | vol 4 | issue 2 5 5.3.1. truncated normal distribution [21] let y denote the observed value of the dependent variable. unlike the normal regression, y is the incompletely observed value of a latent depended variable y*. recall that with truncation, our sample data are drawn from a subset of a large population. in effects with truncation from below, we only observe y=y* if y* is larger than truncation point τ. in effect, we lose the observation on y* that are smaller or equal to τ when this is the case, we typically assume that the variable y/y >τ follows a truncated normal distribution. thus, if a continuous random variable y has pdf f(y) and τ is constant. then we have: f y y f y p y / ( ) ( ) >( ) = > τ τ (7) we know that p y >( ) = − −      = − ( )τ φ τ µ σ 1 1 φ α (8) where α τ µ σ = − and φ(.) is the standard normal cdf. the density of the truncated normal distribution is f y y f y y y / ( ) ( ) ( ) ( ) ( ) ( ) >( ) = − = − − − = − − τ α σ φ µ σ τ µ σ σ φ µ σ α1 1 1 1 1¦ ¦ ¦ (9) where φ(.) is the standard normal pdf. the likelihood function for the truncated normal distribution is l f y i n = −−∏ ( ) ( )11 φ α (10) or ln (ln ( ) ln )l f y n = [ ]− − ( ) −∏ 11 φ α 5.3.2. censored normal distribution [21] when a distribution is censored on the left, observations with values at or below τ are set to τ y y y y yy = > ≤     * * � * � � � � if if τ τ τ (11) the use of τ and τ y is just a generalization of having τ and τ y set as 0. if a continues variable y has a pdf f(y) and τ is constant, then we have f y f y f d di i( ) =   [ ] − ( ) ( )* τ 1 (12) in other words, the density of y is the same as that for y* for y >τ and is equal to the probability of observation of y* < τ if y=τ. d is an indicator variable that equals 1 if y >τ. the observation is uncensored and is equal to 0 if y = τ the observation is censored. p censored p y( ) = ≤( ) = −      = − −* ( )τ τ µ σ µ τ σ φ φ1 (13) and p uncensored( ) = − −      = − 1 φ τ µ σ φ µ τ σ ( ) (14) thus, the likelihood function can be written as �� ( ) ( )l y d d i n i i = −    − −    − ∏ 1 1 1 σ φ µ σ µ τ σ φ (15) 5.4. the marginal effect [11] the estimated (βk) vector shows the effect of (xk) on (yi). thus, coefficients and the marginal effect of a variable is the effect of a unit change of this variable on the probability p(y = 1|x = x), given that all other variables are constant, the slope parameter of the linear regression model measures directly the marginal effect of the variable on the other variable. there are three possible marginal effect 1. marginal effect on the latent dependent variable, y*: � [ ]*∂ ∂ = e y xk kβ (16) thus, the reported tobit coefficients indicate how a one-unit change in an independent variable alters the latent dependent variable. 2. marginal effect on the expected value for y for uncensored observations: �� [ | ] � ( ) ( )� ���∂ > ∂ − + ∝               e y y x x k k i0 1β λ α β ρ λ (17) 3. marginal effect on the expected value for y (censored and uncensored): ∂ ∂ = e y x x k i k [ ] ( � )φ β ρ β (18) raz muhammad h. karim and samira muhamad: tobit model for studying factor affecting blood pressure 6 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 6. results in this part, results will be presented to the applied side of the study using statically package (r program) version 3.6.1 and stratigraphy program version 11. table 1 shows that a sample is taken from (300) patients with kidney diseases in dialyzes center in shar hospitals. the two levels of bp; high and low bp from the patients (as dependent variables) and some independent variables (age, blood, urea, bmi, and waist circumference) were taken. we found the average of bp by map equation that is contain each (high and low) bp, we could not take high and low bp separately because we determined threshold point equal to 93.33 founded by map equation, equal to 12/8 mmhg which is a normal range of bp. 6.1. descriptive statistics of dependent and independent variables table 2 shows all measures of descriptive statistics. the descriptive statistics give an overview of working with the minimum, maximum, mean, and median of (age, blood urea, bmi, waist circumference), and the results are 18, table 2: descriptive statistics of dependent and independent variables in the study blood pressure (y) min max mean median 46.15 146.67 82.25 100.00 age (x1) 18.00 87.00 51.39 49.00 blood urea (x2) 17.40 404.00 118.86 117.00 bmi (x3) 13.84 42.97 25.46 23.44 waist circumference (x4) 30.0 150.0 68.4 60.0 table 3: results of censored regression model: censored (formula=y~x, left=0, right=infinity, data=my data) coefficients estimate std. error t value pr(>t) intercept -5.85378 15.93610 -0.367 0.713 age 0.80211 0.17584 4.562 5.08e-06 *** blood urea 0.48305 0.04437 10.886 2e-16 *** bmi -0.44822 0.45381 -0.988 0.323 waist circumference -0.06718 0.09777 -0.687 0.492 total (n=300 observations, left-censored=66 observations, uncensored=234 observations, left censored (y<93.33 then y*=0: observation) table 4: results of the truncated regression model coefficients estimate std. error t value pr(>t) intercept 3.669531 14.627314 0.2509 0.8019 age 0.754674 0.160378 4.7056 2.531e-06 *** blood urea 0.432102 0.039962 10.8128 2.2e-16 *** bmi -0.429849 0.417015 -1.0308 0.3026 waist circumference -0.061769 0.089385 -0.6910 0.4895 table 1: samples are taken from (300) patients id y: blood pressure rate above 93.33 mmhg x1: age x2: blood urea x3: bmi x4: waist circumference 1 96.67 28 132 17.82 110 2 105 31 140 21.8 65 3 86.54 (0) 35 50 35.16 32 4 104.67 55 70 28.72 102 5 45.66 (0) 32 35 28.96 101 6 90.41(0) 40 38 20.81 48 7 126.67 51 260.6 42.97 120 8 123.33 20 80 22.23 90 9 45.15(0) 25 36 32.53 115 . . . . . . . . . . . . . . . . . . 98 96.67 32 148 21.91 60 99 77.88 (0) 33 47 24.56 60 100 123.33 77 113 25 97 . . . . . . . . . . . . 197 100 40 172.4 16.89 100 198 96.67 35 147.8 37.5 60 199 93.33 73 195.4 20.9 60 200 88.99 (0) 34 67 21.12 55 . . . . 299 93.33 65 129.7 20.68 37 300 146.67 50 120 25.71 112 raz muhammad h. karim and samira muhamad: tobit model for studying factor affecting blood pressure uhd journal of science and technology | jul 2020 | vol 4 | issue 2 7 17.40, 13.84, and 30 respectively. the max numbers of those variables are 87, 404, 42.97, and 150 respectively. the mean and median of all independent variables are 51.39, 118.86, 25.46, 68.4, 49.00, 117.00, 23.44, and 60.0 respectively. 6.2. fitting tobit model (censored and truncated): regression model using statically package (r program) table 3 shows that p-value: 2.22e-16 and log-likelihood: -1278.455on 6 df, waldstatistics 173.8 on 4 df, akaike infor mation criterion (aic)=2566.91, aic={-2(loglikelihood)+2k}, where k is the number of model parameter plus the intercept. log-likelihood is a measure of model fit the higher the number the better the fit, and the minimum aic is the score for the best model. mean square error (mse)=0.9305 table 4 shows that log-likelihood= -1476.9 on 6 df, and the aic=2963.8. mse=0.993. from the output of tables 5-7 shows that the results of fitting a multiple linear regression model to describe the relationship between bp and 4 independent variables. since the p-value in the anova table is <0.05, there is a statistically significant relationship between the variables at the 95.0% confidence level. table 6 represent the r-squared statistic indicates that the model as fitted explains 39.9258% of the variability in bp. the adjusted r-squared statistic, which is more suitable for comparing models with different numbers of independent variables, is 39.1112%. the standard error of the estimate shows the standard deviation of the residuals to be 34.8559. table 7 shows the analysis variance of dependent and independent variables. fig. 4 is a standardized residual for multiple regression models using (ols). it is clear that the (ols) method not a suitable method when data censored. 7. discussion analyzing medical data with a tobit model when it has a threshold point; it help experts (doctors and medical staffs) to identify factors affecting blood pressure in patients with kidney failure. in this study, the tobit model (censored and truncated) regression model, and a multiple regression model with the least square method (ols) applied to the data size (n=300) for the cases their rates are greater than or equal to (93.33). by taking the hypothesis that the (bp y) depends on the expletory variables (age, blood urea, bmi, and waist fig. 4. standardized residual for multiple regression model using (ordinary least square). table 5: fitting multiple regression model (ols) using stratigraphy program model unstandardized confections standardized confections t sig. collinearly statistic b std. error beta tolerance vif (constant) 14.236 12.512 1.138 0.256 age 0.647 0.140 0.217 4.630 0.000 0.923 1.083 blood urea 0.388 0.034 0.535 11.325 0.000 0.913 1.095 bmi -0.315 0.358 -0.040 -0.881 0.379 0.982 1.019 waist circumference -0.048 0.077 -0.029 -0.627 0.531 0.976 1.024 table 6: model summery model r r square adjusted r square std. error of the estimate change statistics r square change f change df1 df2 sig. f change 1 0.632a 0.399 0.391 34.85585 0.399 49.015 4 295 0.000 a. predictors: (constant), waist circumference, bmi, blood urea, age raz muhammad h. karim and samira muhamad: tobit model for studying factor affecting blood pressure 8 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 circumference) and comparing their results, the following important points are concluded below. the result in table 2 shows all measures of descriptive statistics. the descriptive statistics give an overview of working with the minimum, maximum, mean, and median of (age, blood urea, bmi, and waist circumference), and the results are 18, 17.40, 13.84, and 30, respectively. the maximum numbers of those variables are 87, 404, 42.97, and 150, respectively. the mean and median of all independent variables are 51.39, 118.86, 25.46, 68.4, 49.00, 117.00, 23.44, and 60.0, respectively. the results of analysis censored regression model in table 3 show the final result with all significant variables for the phenomenon study, the results of parameter estimation and t value analysis, the significant factors affecting bp. p = 2.22e-16 and log-likelihood= -1278.455 on 6 df, wald statistic= 173.8 on 4 df, aic=2566.91. the log-likelihood is a measure of the model fit the higher number of it is a better fit. the minimum aic is the score for the best model. the mse is equal to 0.9305. we know that the (β) is the relationship between the response variable and covariates, if (+β) it means a positive relationship and if (-β) means a negative relationship. through the result in table 3 appear the relationship between the variables (age and blood urea) is positive because the variables have a positive relationship with the dependent variables (bp) and those variables (age and blood urea) have highly significant effects on bp. furthermore, the relationship between (bmi) and bp is negative. if there is an increase in (bmi) by one unit the (bp) decreases by (-0.44822). the factors (bmi and waist circumference) appeared to have no significant effects on bp. from tables 3-8 show that the censored regression model for the samples is a more suitable model than other regression models (truncated, marginal, and multiple). this result found by comparing their aic, log-like values, and mse. the censored with the marginal effects from table 8 shows that the two variables (age and blood urea) have highly significant effects. the changes in years make bp significantly increasing by 0.77%. this means that the effect of age for any case in the sample with std. error is by 0.16%. furthermore, one unit of blood urea for each point increases by 0.46% with stander error (0.04). in the result of multiple regression models, using (ols) method, we detected that since the p-value in the anova table is <0.05, there is a statistically significant relationship between the variables at the %95 confidence interval the r-square statistic indicates that the model as fitted explains 0.39 of the variability bp. and theoretically, as defined, the ols (unconditional estimates) are bias. 8. conclusion in this study, both tobit regression analysis and ols analysis were used for studying factors affecting the bp. in this work, the data collected from 300 patients in a dialysis center at shar hospital in sulaimani city. the two levels of bp; high and low from the patients (as dependent variables) and some independent variables (age, blood, urea, bmi, and waist circumference) were taken. each patient has own specific bp (high and low). then, we could not take high and low bp separately for our study. that is why the map was performed. it is an average arterial pressure contains high and low bp. when studying bp as a dependent variable, we find that variable data are censored at zero. in this case, the tobit model is most suitable model to use. it was found that the two factors (age and blood urea) have highly significant effects on bp. however, the two variables (bmi and waist circumference) appeared to have no effects on the dependent variable. the comparison of the result from tobit and ols estimations shows that biased can result when estimation bp using ols if bp restricted at the threshold point references [1] t. amemiya. “tobit models: a survey”. search results journal of economics, vol. 24, no. 1-2, pp. 3-61, 1984. [2] w. wang and m. e. griswold. “natural interpretations in tobit regression models using marginal estimation methods”. statistical methods in medical research, vol. 26, no. 6, pp. 2622-2632, 2017. table 7: analysis of variance source sum of squares df mean square f-ratio p-value model 238198.378 4 59549.595 49.015 0.000b residual 358404.477 295 1214.930 total (corr.) 596602.856 299 b. dependent variable: blood pressure table 8: results of marginal effects coefficients marg. eff. std. error t value pr(>t) age 0.772056 0.168918 4.5706 7.17e-06 *** blood urea 0.464948 0.042204 11.0168 2.2e-16 *** bmi -0.431430 0.436737 -0.9878 0.3240 waist circumference -0.064662 0.094095 -0.6872 0.4925 raz muhammad h. karim and samira muhamad: tobit model for studying factor affecting blood pressure uhd journal of science and technology | jul 2020 | vol 4 | issue 2 9 [3] d. c. crews, a. k. bello and g. saadi. “2019 world kidney day editorial-burden, access, and disparities in kidney disease”. brazilian journal of nephrology, vol. 41, pp. 1-9, 2019. [4] r. a. preston, i. singer, and m. epstein. “renal parenchymal hypertension: current concepts of pathogenesis and management”. archives of internal medicine, vol. 156, no. 6, pp. 602-611, 1996. [5] j. a. staessen, y. li, a. hara, k. asayama, e. dolan and e. o’brien. “blood pressure measurement anno 2016”. american journal of hypertension, vol. 30, no. 5, pp. 453-463, 2017. [6] r. n. kundu, s. biswas and m. das. “mean arterial pressure classification: a better tool for statistical interpretation of blood pressure related risk covariates”. cardiology and angiology: an international journal, vol. 6, no. 1, pp. 1-7, 2017. [7] d. yu, z. zhao and d. simmons. “interaction between mean arterial pressure and hba1c in prediction of cardiovascular disease hospitalisation: a population-based case-control study”. journal of diabetes research, vol. 2016, p. 8714745, 2016. [8] c. wilson and c. a. tisdell.“ols and tobit estimates: when is substitution defensible operationally?” in: economic theory, applications and issues working papers, university of queensland, school of economics, queensland , 2002. [9] m. h. odah, a. s. m. bager and b. k. mohammed. “tobit regression analysis applied on iraqi bank loans”. american journal of applied mathematics and statistics, vol. 7, no. 4, p. 179, 2017. [10] a. prahutama, a. rusgiyono, m. a. mukid and t. widiharih. “analysis of household expenditures on education in semarang city, indonesia using tobit regression model”. in: e3s web of conferences, vol. 125, p. 9016, 2019. [11] n. m. ahmed. “limited dependent variable modelling (truncated and censored regression models) with application”. vol. 7377. cambridge university press, new york, pp. 82-96, 2018. [12] m. f. ahmad, m. ishtiaq, k. hamid, m. u. khurram and a. nawaz. “data envelopment analysis and tobit analysis for firm efficiency in perspective of working capital management in manufacturing sector of pakistan”. international journal of economics and financial issues, vol. 7, no. 2, pp. 706-713, 2017. [13] s. samsudin, a. s. jaafar, s. d. applanaidu, j. ali and r. majid. “are public hospitals in malaysia efficient? an application of dea and tobit analysis”. southeast asian journal of economics, vol. 4, no. 2, pp. 1-20, 2016. [14] m. h. odah, a. s. m. bager and b. k. mohammed. “studying the determinants of divortiality in iraq. a two-stage estimation model with tobit regression”. international journal of applied mathematics and statistics, vol. 7, no. 2, pp. 45-54, 2018. [15] p. zorlutuna, n. a. erilli and b. yücel. “lung cancer study with tobit regression analysis: sivas case”. eurasian eononometrics, statistics and emprical economics journal, vol. 3, no. 3, pp. 1322, 2016. [16] p. c. anastasopoulos, a. p. tarko and f. l. mannering. “tobit analysis of vehicle accident rates on interstate highways”. accident analysis and prevention, vol. 40, no. 2, pp. 768-775, 2008. [17] a. henningsen. “estimating censored regression models in r using the censreg package”. r packag vignettes, vol. 5. university of copenhagen, copenhagen, p. 12, 2010. [18] a. c. michalos. encyclopedia of quality of life and well-being, springer, berlin, 2014. [19] m. h. odah. “asymptotic least squares estimation of tobit regression model. an application in remittances of iraqi immigrants in romania”. international journal of applied mathematics and statistics, vol. 8, no. 2, pp. 65-71, 2018. [20] c. ekstrand and t. e. carpenter. “using a tobit regression model to analyse risk factors for foot-pad dermatitis in commercially grown broilers”. preventive veterinary medicine, vol. 37, no. 1-4, pp. 219228, 1998. [21] j. s. long. regression models for categorical and limited dependent variables. vol. 7. sage publications, thousand oaks, 1997. [22] a. flaih, j. guardiola, h. elsalloukh and c. akmyradov. “statistical inference on the esep tobit regression model”. j. stat. appl. probab. lett., vol. 6, pp. 1-9, 2019. [23] b. r. humphreys. “dealing with zeros in economic data”. university of alberta, department of economics, vol. 1, pp. 1-27, 2013. [24] k. a. m. gajardo. “an extension of the normal censored regression”. pontificia universidad catolica de chile, santiago, chile, 2009. [25] w. h. greene. limited dependent variables truncation, censoring and sample selection. sage, thousand oaks, ca, 2003. tx_1~abs:at/tx_2:abs~at uhd journal of science and technology | jul 2020 | vol 4 | issue 2 99 1. introduction one of the most critical and important natural resources to sustain life is water. therefore, there is an increasing awareness that assessing the quality of water is essential for abundant activities such as drinking, agricultural, industrial, hydropower generation, and recreational purposes [1], [2] and [3]. worldwide impaired quality, scarcity, and pressure on water resources due to the growing demands, requires the critical researches on quantity and quality of water [1], [4] and [5]. consequently, the necessary step in national water planning and management is assessing the quality of water [6], [7], [8] and [9]. global water resource contamination is increasing due to diverse human activities and population growth, which raised international alarm about quality of water [6]. regular monitoring the quality of water and advising ways for protecting water resources are necessary in all countries, especially in developing countries which have a rapid urbanization [10], [11] and [12]. water quality can be defined in terms of its chemical, physical, and biological characteristics which show the suitability of water for different uses such as drinking, industrial, agricultural, and domestic. observing water quality parameters for annually (or with any intervals) sampled water are the key for detecting information about the pollution level and the quality variation in water over trends of time [2], [3], [5], [7] and [8]. according to the reports conducted by the world health organization (who) in iraq show that 25% of childhood deaths are due to the water borne diseases that can be prevented. in terms of obtaining safe water and sanitation, the condition in kurdistan region is better compared with the other cities in iraq but still problems temporal variation of drinking water quality parameters for sulaimani city, kurdistan region, iraq jwan bahadeen abdullah and yaseen ahmed hamaamin* department of civil engineering, college of engineering, university of sulaimani, krg, iraq a b s t r a c t water is vital for all forms of life on earth. assessing the quality of water especially drinking water is one of the important processes worldwide which affect public health. in this study, the quality of drinking water in sulaimani city is monitored for a study period of 1 year. a total number of 78 water samples were collected and analyzed for 17 physical and chemical properties of water supply system to the city. samples of water are collected from the three main sources of drinking water for sulaimani city (sarchnar, dukan line-1, and dukan line-2) from february to august 2019. the results of physical and chemical parameters of collected water samples were compared with the world health organization and iraqi standards for drinking water quality. the results of this study showed that mostly all parameters were within the standards except the turbidity parameter which was exceeded the allowable standards in some cases. this research concluded that, in general, the quality of drinking water at the three main sources of sulaimani city is suitable and acceptable for drinking. index terms: seasonal, variation, water, quality, sulaimani corresponding author’s e-mail: yaseen ahmed hamaamin, department of civil engineering, college of engineering, university of sulaimani, krg, iraq. email: yassen.amin@univsul.edu.iq received: 25-08-2020 accepted: 30-10-2020 published: 05-11-2020 access this article online doi: 10.21928/uhdjst.v4n2y2020.pp99-106 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2020 abdullah and hamaamin. this is an open access article distributed under the creative commons attribution noncommercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l re se a rc h a rt i c l e uhd journal of science and technology jwan bahadeen abdullah and yaseen ahmed hamaamin: quality of drinking water supply for sulaimani, iraq 100 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 with environmental issues are existing [13]. issa and alrawi [14] developed water quality index for erbil’s three water treatment plants, they found that the quality of water fallen within the good level of water quality. according to the official directorate of water, the quality of drinking water in sulaimani city is acceptable in terms of physiochemical and biological properties, but one of the concerned problems of the city’s water supply is the yellow tint occurred in a certain time in the years, especially in spring and autumn [15]. the main sources of municipal drinking water in sulaimani city are sarchnar, dukan line-1, and dukan line-2, due to unexpected growth population in the city and economic problems, continuous water supply does not exist in the city, it has intermittent water supply of about few hours each 2 days [16]. the main objective of this study is monitoring and assessing the quality of municipal drinking water supply for the sulaimani city by evaluating the chemical and physical characteristics of water sampled from the main sources of drinking water in the city. 2. materials and methods 2.1. study area sulaimani city is located in the north part of iraq which is one of the major cities in the iraqi kurdistan region with a population of over 1 million. the city is stretched between two mountains at the intersection point of longitude 45.44312° and latitude of 35.55719° with average level of about 850 m above the sea level. the weather of the city is dry and warm in summer with an average temperature of 31.5°c, while the weather is cold and wet during winter season with average temperature of 7.6oc [16] and [17]. fig. 1 shows water treatment plant and sampling locations which are both in sulaimani governorate. 2.2. water quality parameters in this study, 17 physical and chemical water parameters were used to measure the quality of drinking water for sulaimani city. temperature, electrical conductivity (ec), total dissolved solids (tds), salinity, ph, turbidity, free chlorine, total chlorine, total alkalinity, hardness, calcium, ammonia, ammonium, chloride, fluoride, nitrate nitrogen, and nitrate parameters were measured in the study. temperature as one of important water quality parameters has a great effect on growth and activity of aquatic life. temperature affects the solubility of oxygen in water which consequently affects the life of aquatic organisms. furthermore, biochemical reactions of aquatic bacteria can be doubled for each 10°c, to a certain limit of temperature [18] and [19]. ec of water measures the ionic content of a water sample which indicates the range of alkalinity and hardness of water. this parameter affects the acceptability of water for drinking purpose because it impacts the taste of water significantly. the amount of conductivity in any sample of water is influenced by the amount of tds concentration which is another important parameter in monitoring the quality of water. however, lime scaling and buildup can be occurred from high alkaline water. dissolve solids can occur in water from dissolving of soil minerals while it is in touch with soil layers. runoff from residential areas and farming, leaching pollutant from soil and industrial or sewage treatment to water resources can be other sources of tds [9] and [20]. salinity is a parameter that related to the presence of salt content in water. suitability of water may be rendered by the amount of salt content in it [21], [22]and [23]. ph is a parameter that indicates whether the water is acid or basic. furthermore, it shows if the water is suitable or not for various purposes. ph has a direct relation with every phase of water treatment processes. low ph cause corrosion in water distribution systems, on the other hand, high ph affects the palatability of water [21] and [22]. fig. 1. study area. jwan bahadeen abdullah and yaseen ahmed hamaamin: quality of drinking water supply for sulaimani, iraq uhd journal of science and technology | jul 2020 | vol 4 | issue 2 101 turbidity of water is occurred by the presence of suspended solids such as clay, silt, and other microscope organisms that are the very fine hard to be removed by routine water treatment methods. this parameter affects the acceptability of consuming water for drinking or other uses especially in certain industries [19]. chlorination is the most universal method of water disinfection, which is efficient and powerful way in the treatment process. after completion of the disinfection process, extra chlorine can remain in the form of residual chlorine which is very important, because it insures protecting water from recontamination especially in old water distribution systems [21]. total alkalinity is the size of water response with h+ ions and sometimes is indicated as alkali level. alkalinity relates with the scale deposition and corrosion in distribution systems. this parameter can be attributed to carbonates and hydroxides in natural water. the unit used for expressing alkalinity is mg/l caco 3 [24]. ammonia and ammonium concentration express as mg/l n and mg/l nh 3 , respectively. these parameters indicate the possibility of pathogenic micro-organisms presence in water and sewage pollution of water. in water treatment, presence of ammonia with high level can impair the chlorination process of water [19] and [21]. the characteristic of water that improves its palatability and shows the suitability of water for drinking purpose is hardness (mg/l caco 3 ) which is the soap destroy capacity of water. the main constituents of hardness are calcium and magnesium which are the widespread abundance metals in the formation of rocks [23]. calcium is the most important element which ensures the normal growth and health of human body. water with high amount of calcium consider as very palatable and acceptable water. calcium is the primary constituent of hardness and it is very beneficial to health. its concentration in water expressed as mg/l ca [20]. naturally, fluoride occurs rarely in waters, main sources of this element in water are public water supply fluoridation and industrial discharges. fluoride should be added to water supply because of its importance for growing children teeth and reducing tooth decay. fluoride concentration is expressed as mg/l f. chloride (mg/l cl) is the parameter which can be consider for accepting the palatability of water and it does not have hazard to human health. high amount of chloride raises the salty taste of water [16]. nitrate nitrogen (mg/l n) and nitrate (mg/l no 3 ) can be found in natural water due to the organic wastes, sewage discharges, far ming fertilizer, r unoff from surfaces, plants nitrogen fixing, and bacterial oxidation. high level concentration of nitrate is hazardous to human, especially infants because of its reaction with blood hemoglobin which causes methemoglobinemia [21]. 2.3. sample collection and data analysis water samples were collected from the three main water distribution pipes on weekly bases and after rainfall events from three main sources of drinking water in sulaimani city, sarchnar, dukan line 1, and dukan line 2. water samples were obtained at the same time and location from the main sarchnar control station where the three main distribution lines located which feed the city with drinking water supply. the date and time of samplings were altered by rainfall events and any other expected changes which can affect the quality of water. sarchnar water is natural spring water source, is pumped to the city without any treatment process except chlorination. the other two water supply main lines dukan 1 and 2 are from the city’s water treatment plant-1 and plant-2 which are located in peer qurban (n 35.88808° and e 44.99651°), receive water from lesser zab river downstream of water from dukan lake. in year of 2018, sulaimani city received following average amount of water from each source: 32535559 m3/year from sarchanr, 16721500 m3/year from dukan1, and 58893329 m3/year from dukan 2. water samples were collected for winter, spring, and summer seasons between months february and august 2019. the samples were preserved in clean plastic bottles, and transferred to laboratory for analyzing physical and chemical parameters according to the defined methods in the standard methods for examination water and wastewater [25]. physical parameters such as temperature, ph, conductivity, tds, and salinity were analyzed and recorded in situ immediately after running the water tabs for about 1 min and collecting the samples using palintest multi-parameter pocket meter (pt162). as soon as water samples were arrived laboratory, turbidity test and free chlorine and total chlorine test were conducted using hanna turbidity meter (hi98703-02) and palintest spectrophotometer 7500 (7500 2180072). after that the remained water parameters were conducted using palintest spectrophotometer 7500 (7500 2180072). water samples lab results of the tests were evaluated and jwan bahadeen abdullah and yaseen ahmed hamaamin: quality of drinking water supply for sulaimani, iraq 102 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 compared with the who water standards [19] and iraqi water standards [26] for drinking water (table 1). 3. results and discussion in this study, a total of 78 samples were collected and analyzed for 17 physical and chemical properties of water from the three main sources of drinking water in sulaimani city (sarchnar, dukan line1, and dukan line2) from february to august 2019. table 2 shows the descriptive statistics of all water quality parameters test results obtained in the study. in this study, the value of temperature for water samples collected from sarchnar was close to the value of dukan 2 which was ranged between 15.8 and 20.4°c. the temperature of samples from dukan 1 was lower in winter season (14°c) and higher in summer season (23.1°c). in general, the average value temperature of collected samples during the period of study was 17.8°c for sarchnar, 19°c for dukan 1, and 18.6 for dukan 2 (fig. 2), while sarchnar source is a spring and the intake from this source located at the source location which makes it is less affected by temperature rise during summer season compared to the other sources which are from lesser zab river. while there were some strong correlations between the three parameters (tds, ec, and salinity), only tds changes shown in fig. 3. the tds values starting with the minimum value in the middle period of winter season and continuous rising until march and then falling again to the minimum value in the beginning of april during the intensive rainfall occurred in kurdistan region and finally rising the values to the maximum value in august and continuous approximately with the same level to the end of summer season. this fluctuating of tds may be due to contribution of groundwater during dry season to the river water, while it was low during flood season because of more rainfall runoff contribution to the river flow during spring months. according to the who standards [19] and iraqi standards [26], the range of ph should be between 6.5 and 8.5. in this study, minimum value of ph for all samples was 7.32 for sarchnar source and the maximum value was 8.23 for dukan 2. fig. 4 table 1: drinking water standards [19] and [26] parameter who standards iraqi standards ph 6.5–8.5 6.5–8.5 conductivity (µs/cm) 600 1500 tds (mg/l) 1000 1000 alkalinity (mg/l caco3) 600 -turbidity (ntu) 5 5 free chlorine (mg/l) 5 5 hardness (mg/l caco3) 500 500 calcium (mg/l ca) 75 50 ammonia (mg/l nh3) 1.5 -fluoride (mg/l f) 0.7–1.5 -chloride (mg/l cl) 250 250 nitrate nitrogen (mg/l n) 10 -nitrate (mg/l no3) 50 50 table 2: descriptive statistics of water quality parameters for samples collected from main sources of drinking water of sulaimani city; sarchnar, dukan line 1, and dukan line 2 parameter sarchnar dukan line 1 dukan line 2 min. max. mean med. std. dev. min. max. mean med. std. dev. min. max. mean med. std. dev. temperature 15.8 19.6 17.8 18 0.84 14 23.1 19.0 19.6 2.90 15.8 20.4 18.6 19.15 1.32 conductivity 357 562 461 455 79.23 342 566 466 452 78.79 333 537 435 437 75.56 tds 251 398 327 322 55.29 245 402 331 321 56.15 235 381 309 310 53.70 salinity 168 269 220 217 38.15 163 271 223 217 39.80 157 258 208 210 38.01 ph 7.32 8.1 7.7 7.6 0.21 7.63 8.05 7.66 7.71 0.29 7.44 8.28 7.72 7.66 0.24 turbidity 0.28 10.5 1.9 1.41 2.23 0.26 11.6 1.93 0.88 2.69 0.31 14.1 2.43 1.15 3.11 free chlorine 0.15 0.92 0.64 0.69 0.15 0.11 0.53 0.33 0.36 0.12 0.65 1.96 1.05 0.96 0.34 total chlorine 0.17 0.94 0.67 0.72 0.14 0.17 0.53 0.35 0.37 0.11 0.66 1.98 1.07 1 0.33 alkalinity 135 220 190 203 26.98 130 235 191 200 28.79 135 200 169 173 19.60 ammonia 0 0.08 0.03 0.02 0.02 0 0.04 0.02 0.02 0.01 0 0.05 0.02 0.02 0.01 ammonium 0 0.08 0.03 0.03 0.02 0 0.05 0.03 0.03 0.02 0 0.06 0.02 0.03 0.02 hardness 121 175 152 151 17.55 124 193 153 154 17.39 128 171 145 144 12.42 calcium 48 70 61 63 6.62 50 78 62 62 7.03 52 68 59 58 4.65 fluoride 0 0.61 0.2 0.19 0.18 0.01 0.54 0.21 0.18 0.17 0 0.49 0.21 0.2 0.14 chloride 0 18 5 5 3.75 0 15 4 4 3.13 0 9 4 2 2.62 nitrate (n) 2.3 5.24 3.5 3.16 1.01 2.34 5.14 3.4 3.03 0.9 2.24 6.3 3.6 3.11 1.23 nitrate (no3) 10.4 23.6 15.4 14 4.46 10.4 22.8 15.0 13.60 4.0 10 28 16.1 14 5.40 jwan bahadeen abdullah and yaseen ahmed hamaamin: quality of drinking water supply for sulaimani, iraq uhd journal of science and technology | jul 2020 | vol 4 | issue 2 103 shows that the results of ph for the three sources were close, and the results were raised during the flood events occurred in february and march. the average value of ph for sarchnar, dukan 1, and dukan 2 was 7.7, 7.66, and 7.72, respectively. as the alkalinity has a high correlation to ph through buffering water against excessive change of ph, the same trend of ph is observed in its plot against time. maximum permissible limit of total alkalinity according to the who standards is 600 mg/l. in the current study, total alkalinity with the range of 130–235 mg/l was recorded for all water samples. the value of total alkalinity was lower in rainy seasons than in the arid one. furthermore, the results of sarchnar water samples were close to dukan 1, but dukan 2 water samples recorded lower results. turbidity as one of important parameter for drinking water should be under 5 ntu, the average results of turbidity were within the standards which were 1.9 ntu for sarchnar and dukan 1 and 2.43 ntu for dukan 2. however, the value of turbidity was raised significantly to the highest value which was 10.5, 11.6, and 14.1 ntu for sarchnar, dukan lane 1, and dukan lane 2, respectively, because of the storm occurred in the beginning of april. in summer season, the value of turbidity reached the minimum value of 0.26 ntu and it remains around this value until the end of the study without significant change (fig. 5). fig. 6 shows that the amounts of free chlorine residual in the collected samples from dukan 1 have the lower values ranged between 0 and 0.5 mg/l, samples from sarchnar ranged between 0 and 1 mg/l and dukan 2 have the higher values (0.5–2 mg/l). because the dukan 2 source feeds water to the far neighborhoods in the city, the amount of free chlorine was higher than the other sources so as to insure obtaining safe water to the residential. the results of free chlorine residual concentration in the water samples were ranged between 0.11 and 1.96 mg/l which were within the recommended value in the who standards [19] and iraqi water specifications [26]. the mean concentration of ammonia and ammonium in this study was very low (0.02 and 0.03 mg/l) for all water sources which were below the standard limit (1.5 mg/l). the results of ammonia and ammonium for the three sources were close to each other during the period of the study except in several data for sarchnar which were higher than the other with the value around (0.08 mg/l). fig. 7 shows temporal variation of ammonia in the three sources of water. 10 12 14 16 18 20 22 24 5-jan 4-f eb 6-m ar 5-a pr 5-m ay 4-jun 4-jul 3-a ug 2-s ep 2-o ct te m pe ra tu re (o c ) date sarchnar dukan 1 dukan 2 fig. 2. temporal variation of temperature. 100 150 200 250 300 350 400 450 5-jan 4-feb 6-mar 5-apr 5-may 4-jun 4-jul 3-aug 2-sep 2-oct td s (m g/ l) date sarchnar dukan 1 dukan 2 fig. 3. temporal variation of total dissolved solids. 7 7.5 8 8.5 5-jan 4-feb 6-mar 5-apr 5-may 4-jun 4-jul 3-aug 2-sep 2-oct p h date sarchnar dukan 1 dukan 2 fig. 4. temporal variation of ph. 0 2 4 6 8 10 12 14 16 5-jan 4-feb 6-mar 5-apr 5-may 4-jun 4-jul 3-aug 2-sep 2-oct tu rb id ity (n tu ) date sarchnar dukan 1 dukan 2 fig. 5. temporal variation of turbidity. jwan bahadeen abdullah and yaseen ahmed hamaamin: quality of drinking water supply for sulaimani, iraq 104 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 according to the drinking water quality standards, the limit of hardness should be within the limit of 500 mg/l. the value of hardness for all water samples was between 120 and 200 mg/l caco 3 which was below the permissible value. fig. 8 represents that the range of hardness for the water samples from sarchnar and dukan1 was close and the average range was around 150 mg/l caco 3 . the average value of hardness in dukan 2 was 145 mg/l caco 3. in this study, the average value of calcium concentration for the three sources was around 60 mg/l which was within the recommended limit (75 mg/l). as shown in fig. 9, most of the calcium concentration results for the three sources were close and all values during the study were below the standard limit except in two data recorded for dukan 1 water samples which exceeded the standard limit with the value of 78 mg/l. according to the who guidelines for drinking water standards [19], fluoride concentration should be under 1.5 mg/l. in the current study, fluoride mean value was 0.21 mg/l for all sources of drinking water. at the beginning of the study, minimum results were recorded which were sometimes near zero. the value of fluoride concentration continued rising until august, this rising was continued in sarchnar water samples but dukan 1 and 2 water samples started falling until the end of the study (fig. 10). the fluoride level all the time is below the recommended level of 0.7 mg/l for healthy teeth enamel. minimum observed that results of chloride for the three water sources during the study were recorded in winter season which were between 0 and 1mg/l. most of the chloride values were ranged between 1 and 10 mg/l, but in may 12 the concentration of chloride for water samples from sarchnar and dukan 1 reached maximum value which was 18 mg/l and 15 mg/l, respectively (fig. 11). the average value of chloride concentration in the study was 5 mg/l which was below the permissible value of 250 mg/l. in this study, the mean values of nitrate nitrogen and nitrate were 3.5 mg/l n and 15.6 mg/l no 3 , respectively, which were below the 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 5-jan 4-feb6-mar 5-apr 5-may 4-jun 4-jul 3-aug 2-sep 2-oct fl uo ri de (m g/ l f ) date sarchnar dukan 1 dukan 2 fig. 10. temporal variation of fluoride. 0 0.5 1 1.5 2 2.5 5-jan 4-feb6-mar 5-apr5-may 4-jun 4-jul 3-aug2-sep2-oct fr ee c hl or in e (m g/ l) date sarchnar dukan 1 dukan 2 fig. 6. temporal variation of free chlorine. 0 0.02 0.04 0.06 0.08 0.1 5-jan 4-feb6-mar 5-apr 5-may 4-jun 4-jul 3-aug2-sep2-oct a m m on ia (m g/ l n h 3) date sarchnar dukan 1 dukan 2 fig. 7. temporal variation of ammonium. 100 125 150 175 200 225 250 5-jan 4-feb6-mar 5-apr 5-may 4-jun 4-jul 3-aug 2-sep 2-oct h ar dn es s (m g/ l c ac o 3) date sarchnar dukan 1 dukan 2 fig. 8. temporal variation of hardness. 10 20 30 40 50 60 70 80 90 5-jan 4-feb6-mar 5-apr 5-may 4-jun 4-jul 3-aug 2-sep 2-oct c al ci um (m g/ l c a) date sarchnar dukan 1 dukan 2 fig. 9. temporal variation of calcium. jwan bahadeen abdullah and yaseen ahmed hamaamin: quality of drinking water supply for sulaimani, iraq uhd journal of science and technology | jul 2020 | vol 4 | issue 2 105 0 2 4 6 8 10 12 14 16 18 20 5-jan 4-feb 6-mar 5-apr 5-may 4-jun 4-jul 3-aug 2-sep 2-oct c hl or id e (m g/ l c l) date sarchnar dukan 1 dukan 2 fig. 11. temporal variation of chloride. 0.4 5.4 10.4 15.4 20.4 25.4 30.4 5-jan 4-feb6-mar 5-apr 5-may 4-jun 4-jul 3-aug2-sep2-oct n itr at e (m g/ l n o 3) date sarchnar dukan 1 dukan 2 fig. 12. temporal variation of nitrate (no 3 ). permissible limits according to standards (10 mg/l of nitrate nitrogen and 50 mg/l of nitrate). fig. 12 shows that the results of nitrate concentration in the three water sources are similar starting from highest value in the winter season and then decreasing to the lower values in summer season. the reason of increasing the amount of nitrate during the rainy season may be due the agricultural runoff which is one of the main sources of occurring nitrate in water [19], and [21]. 4. conclusions from the results of this study, it can be concluded that, in general, the quality of water at the sources of sulaimani city is suitable for drinking purpose, because the analyzed physical and chemical parameters of water samples during the period of the study were mostly within the standards recommended by the who guidelines [19] and iraqi specifications [26]. however, there was some turbidity exceedance of the permissible level during flood events and high rainfall events. furthermore, the level of fluoride exists in the water is below the standard level for healthy teeth which is between 0.7 mg/l and 1.5 mg/l. even though, the sarchnar source is natural spring and different from the other sources, but high differences were not observed in the results of measurements between the sources. this may be because of huge amount of water in lake this year due to the significant high amount of rainfall during the period of study which can dilute contaminants in the impounded water. finally, this study recommends addition a pretreatment unit to be used during flood events to reduce the turbidity and addition of fluoride to the drinking water to restore and strengthening the users’ teeth enamel. furthermore, it is crucial and important for future researches and water quality monitoring to include heavy metal and disinfection water quality parameters. references [1] c. prakirake, p. chaiprasert and s. tripetchkul. “development of specific water quality index for water supply in thailand”. songklanakarin journal of science and technology, vol. 31, no. 1, pp. 91-104, 2009. [2] a. h. m. alobaidy, h. s. abid and b. k. maulood. “application of water quality index for assessment of dokan lake ecosystem, kurdistan region, iraq”. journal of water resource and protection, vol. 2, no. 9, p. 792, 2010. [3] a. c. al-shammary, m. f. al-ali and k. h. yonuis. “assessment of al-hammar marsh water by uses canadian water quality index (wqi)”. mesopotamia environmental journal, vol. 1, no. 2, pp. 2634, 2015. [4] e. d. ongley. water quality management: design, financing and sustainability considerations-ii. in: invited presentation at the world bank’s water week conference: towards a strategy for managing water quality management. world bank, united states, 2000. [5] a. sargaonkar and v. deshpande. “development of an overall index of pollution for surface water based on a general classification scheme in indian context”. environmental monitoring and assessment, vol. 89, no. 1, pp. 43-67, 2003. [6] k. mosimanegape. integration of physicochemical assessment of water quality with remote sensing techniques for the dikgathong damin botswana. master’s dissertation, university of zimbabwe, harare, 2016. [7] h. j. vaux. “water quality (book review)”. environment, vol. 43, no. 3, p. 39, 2001. [8] a. parparov, k. d. hambright, l. hakanson and a. ostapenia. “water quality quantification: basics and implementation”. hydrobiologia, vol. 560, no. 1, pp. 227-237, 2006. [9] a. h. m. alobaidy, b. k. maulood and a. j. kadhem. “evaluating raw and treated water quality of tigris river within baghdad by index analysis”. journal of water resource and protection, vol. 2, no. 7, pp. 629, 2010. [10] h. c. kataria, m. gupta, m. kumar, s. kushwaha, s. kashyap, s. trivedi and n. k. bandewar. “study of physico-chemical parameters of drinking water of bhopal city with reference to health impacts”. current world environment, vol. 6, no. 1, pp. 95-99, 2011.‏ [11] h. q. khan. water quality index for municipal water supply of attock city, punjab, pakistan. in: survival and sustainability. springer, berlin, heidelberg, 2010. [12] m. alsawalha. “assessing drinking water quality in jubail industrial city, saudi arabia”. american journal of water resources, vol. 5, jwan bahadeen abdullah and yaseen ahmed hamaamin: quality of drinking water supply for sulaimani, iraq 106 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 no. 5, pp. 142-145, 2017. [13] n. othman, t. kane, and k. hawrami. environmental health assessment in sulaymaniyah city and vicinity. tech report, united states, 2017. [14] h. m. issa and r. a. alrwai. “long-term drinking water quality assessment using index and multivariate statistical analysis for three water treatment plants of erbil city, iraq”. ukh journal of science and engineering, vol. 2, no. 2, pp. 39-48, 2018. [15] f. salih, n. othman, f. muhidin and a. kasem. “assessment of the quality of water in sulaimaniyah city, kurdistan region: iraq”. current world environment, vol. 10, no. 3, pp. 781-791, 2015. [16] d. a. m. barznji and d.g.a. ganjo. “assessment of the chemical water quality in halabja-sulaimani, kurdistan region of iraq”. asian journal of water environmental and pollution, vol. 11, no. 2, pp. 19-28, 2014. [17] y. a. hamaamin. “developing of rainfall intensity-durationfrequency model for sulaimani city”. journal of zankoy sulaimani, vol. 19, no. 3-4, p10634, 2017. [18] s. m. razuki and m. a. al-rawi. study of some physiochemical and microbial properties of local and imported bottled water in baghdad city. iraq journal of market research and consumer protection, vol. 2, no. 3, pp. 1-7, 2010. [19] world health organization. guidelines for drinking-water quality: first addendum. 4th ed. world health organization, geneva, 2017. [20] c. e. boyd. water quality: an introduction. 3rd ed. springer, berlin, germany, 2015. [21] environmental protection agency. parameters of water quality: interpretation and standards. environmental protection agency, united states, 2001. [22] m. v. ahipathy and e. t. puttaiah. “ecological characteristics of vrishabawathi river in bangalore (india)”. environmental geology, vol. 49, no. 8, pp. 1217-1222, 2006. [23] h. boyacioglu and h. boyacioglu. “surface water quality assessment by environmetric methods”. environmental monitoring and assessment, vol. 131, no. 1-3, pp. 371-376, 2007. [24] world health organization. guidelines for drinking-water quality: incorporating first and second addenda. 3rd ed., vol. 1, world health organization, geneva, 2008. [25] l. apha, a. clesceri and a. greenberg. standard methods for the examination of water and wastewater. 20th ed. american public health association, washington, dc, 1998. [26] icsd, wcl. iraqi criteria and standards for drinking water, chemical limits. ics: 13.060.20, iqs: 417, 2nd update 2009 for chemical and physical limits. icsd, wcl, united states, 2009. tx_1~abs:at/tx_2:abs~at uhd journal of science and technology | jul 2020 | vol 4 | issue 2 107 1. introduction the universal connectivity network that we use today is built on different network topology such as the internet of things, data center network, and mobile caller network provides pervasive global coverage on a scale [1]. in addition, the multipurpose network which uses different types of the data video stream, voice over internet protocol, user datagram protocol, and transmission control protocols flow of data is significantly more cost efficient than specialized or dedicated network solutions [2]. there is an amount of data through the network with the very difficult process of routing which is hugely important in networking. data should be transferred continuously without any interruption which is one of the most performed functions of the network [3]. the modern network usually suffers link failures and topology changes. link failure is a common phenomenon in computer networks that may occur anytime due to network configuration change, network device error, power outage, or many other reasons and results in the disruption of service distribution. this disruption results in unimaginable losses in critical networks [4]. multipath routing protocols could be addressing adaptive software-defined network controller for multipath routing based on reduction of time hemin kamal1,2, miran taha2,3 1university of human development, college of science and technology, department of computer science, sulaymaniyah, iraq, 2college of science, department of computer, university of sulaimani, sulaymaniyah, iraq, 3integrated management coastal research institute, universitat politécnica de valencia, valencia, spain a b s t r a c t software-defined network (sdn) is a new paradigm in the networking that makes a programmability and intelligence the networks. the main sdn characterize is separating network management (control plane) from the forwarding device (data plane). sdn logically centralizes the network with the programmable controller which collects global knowledge about the network. the sdns can improve the performance of the routing packets in the networks because of agility and the ability to create a policy for a driven network. in the multipath routing, the sdns controller is responsible to calculate the routes of optimum path and alternative path wherever a link is failed. however, a high delay time calculation of selecting optimum and alternative paths in multipath routing by the sdn controller is observed in the recent investigations. in this paper, we propose an efficient algorithm for sdn multipath routing controller. the mechanism of the proposed approach calculates the best path from the source to the destination which is based on using adaptive packet size and observing network link capacity. the proposed algorithm considers reducing delay time of the link handling when the flow traffic switches from the main path to the recovery path. as a result, this approach is compared to some state of the arts according to the delay time of choosing the best path and alternative paths in a given network topology. sdn based on the proposed algorithm consumed approximately 1 ms for selecting recovery routes. on the other hand, the proposed algorithm can be integrated to an sdn controller which provides better consolidation of transmission for sensitive applications as video streaming. index terms: adaptive algorithm, multipath routing, mininet, software-defined network, simulation corresponding author’s e-mail: hemin kamal, university of human development, college of science and technology, department of computer science, sulaymaniyah, iraq; college of science, department of computer, university of sulaimani, sulaymaniyah, iraq. e-mail: hemin.kakahama@uhd.edu.iq received: 05-09-2020 accepted: 05-11-2020 published: 15-11-2020 access this article online doi: 10.21928/uhdjst.v4n2y2020.pp107-116 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2020 hemn. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l re se a rc h a rt i c l e uhd journal of science and technology hemn: adaptive software-defined network controller for multipath routing based on reduction time 108 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 this problem. thus, efficient algorithm should be presented to handle link failures to keep the flow of the interrupted networks. for handling link failure, network systems keep redundant links for each path so that in case of a link failure, the network systems may calculate a new backup path and keep the system uninterrupted. keeping redundant links may create loops in computer networks [5]. on the one hand, the most problem of multipath routing is selecting the optimal path on the network and divide the traffic amongst the paths [6]. on the other hand, to overcome this limitation of traditional networks in handling link failure of critical networks, one possible solution can be software-defined network (sdn) which is a new paradigm in computer networking. sdn makes computer networks programmable and more manageable by separating the control and forwarding components of the network where the centralized sdn controller programs and manages the network devices (e.g., switches and routers) by secured communication channel [7]. the controller detects the link failure by receiving messages from the data plane and it can handle link failure both proactively and reactively [8]. in the reactive approach, the sdn controller provides alternative paths on-demand in the sdn switches which is time consuming. in the proactive approach, the controller installs redundant paths in the switches initially and the switches handle link failures locally by directing the flow to the alternative path [9]. in this paper, we propose an algorithm for sdn controller which based on using adaptive packet to calculate multipath routing. the algorithm chooses the optimal path from some available paths on the network. the parameters of the proposed algorithm include the shortest/longest path, link capacity (bandwidth), and number of the hops. whenever flow of main path is down, the controller switched over the traffic from main to backup path in a short period of time. the structure of this paper is organized as follows. we review some related works in section 2 while section 3 presents an introduction to the sdn architecture and comparing with the traditional network architecture. therefore, section 4 explains the problem statements. section 5 presents the proposed algorithm for multipath routing based sdn. section 6 explains the experiments and the test results. finally, the conclusion and future works are given in section 7. 2. related work delay calculating routes in the sdn controller are one of the main issues. there are numerous articles that investigated how the sdn controllers were decided to select the optimum path and alternative paths. however, the issue investigated in both academic and industrial areas. in this section, we review some related works in detail to explain the existing gaps regarding the delay time of calculating paths by the sdn controllers. sharma et al., 2011 [10], detected the decision time problem for multipath routing which are hard timeout and idle timeout while the links failed or topology changed. they proposed the fast link restoration mechanism on multipath routing architecture. the proposed mechanism based on the new address resolution protocol (arp) request and the controller able to restore the path to an alternative path within 12 ms regardless of any timeout of traffic flow. this mechanism decentralized because based on the arp request on the local switch. sgambelluri et al., 2013 [11], determined the link failures in the sdn, designed the mechanism to enhance the openflow architecture to protect the segment on the network. the novel mechanism provides an effective network by providing the different priorities of resource utilization for main path and backup path based on openflow segment protection. the presented mechanism implemented the osp scheme enabling fast recovery in networks composed of openflowbased ethernet switches. the osp scheme does not require a full-state controller and, on failure, it does not involve the controller in the recovery process. the implementation results showed that the controller can decide to active the backup path on around 64 ms. dorsch et al., 2016 [12], outstanding the technical issue in link recovery time in dynamic topology. they suggested the technique for improving fault tolerant to link failure based on the sdn controller. the controller use a centralized approach for both failure detection and traffic recovery bidirectional forwarding detection and openflow allow to saving the delay time on switching the connection. bidirectional forwarding detection protocol for local link failure is detected between two switches, the drawback of this technique is the negotiation module needs to spend a time period in notifying the failure detection result, the proposed mechanism needs 4.5 ms to change the path to another path. lin et al., 2016 [13], detected the link failure and link conduction problem in datacenter network link congestion which occurs when a link is carrying too much data. a typical effect is packet loss which caused an actual reduction in network throughput. they proposed the mechanism for link handling for sdn enabled data center network for the traffic recovery. the sdn controller calculates the main and backup path based on a link weighting function. the experiment results showed the recovery time <100 ms. the hemn: adaptive software-defined network controller for multipath routing based on reduction time uhd journal of science and technology | jul 2020 | vol 4 | issue 2 109 proposed mechanism designed for data center network and they did not address the scalability issue. dloryhu et al., 2016 [14], focused on controller response time under normal consideration and presence the outage, which is generally impact by network latency. they presented the mechanism to deal with link fail handling issues. depend on this mechanism, controller pre-establishes the multipath routing from the source to destination. when the current path downed on the sdn, the controller can switchover the traffic to another active path averagely in 40 ms. aldwyan and sinnott, 2019 [15], investigated on the delay issue on multipath routing mechanism which raised a problem to sensitive application. they introduced the new approaches to improve the responsibility of the applications to provide latency aware of the failed link. they proposed an algorithm for path selection in the datacenter based on the sdn. their work utilizes container technologies and microservice-based application architecture. the approach autonomously generates latency aware failover capabilities by providing deployment plans for micro services and their redundant placement across multiple datacenter networks, with the goal of minimizing the amount switchover time as per implementation result the link handling time reduced to 40 ms. hsieh and wang, 2019 [16], determined the timeout issue for sdn multipath routing when the main link failed and the packet traffic not arrive to destination on time. the authors proposed an efficient mechanism to handle link failure detection and recovery. they explained an efficient mechanism to handle link failure detection and recovery. the proposed method based on a multicontroller in sdn architecture. the topology contains local and global controller to decide to find the best path by calculating the link cost weight based on switch controller propagation delay and load standard division. the experiment results in the implemented mechanism can handle the failed link only 7 ms of delay time. therefore, the proposed algorithm for the sdn controller is different from the aforementioned approaches, which uses adaptive packet size to select recovery routes in a given network topology further it reduces that the delay time is taken to select the paths compared to those recent works. 3. statement of the problem routing algorithms and protocols are one of the most important processes of selecting routes in static and dynamic networks. although internet services utilized by end users, the users are suffered due to inefficient designation of routing algorithms and the quality of service parameters such as bandwidth, delay, packet loss, and jitter. further, sdn controllers can be programmed to control the packets flows and this programmable approach used to provide multipath routing, however, the controller decided to specify a path to integrate the transmission. the decision process of multipath routing takes time, which called delay time, this time can be dynamic when the controller selects main path and recovery path for the routing packets. however, the problem becomes complex when the controller used insufficiently designed algorithms; thus, the delay time to retrieve the paths to end user takes longer time and end users unsatisfied with the service. for instance, one of the issues is calculating link capacity based on packet size and number of the packets, however, insufficient packets also introduced inaccurate path selection. it is important to take into consideration the number of parameters to design an efficient algorithm for the sdn controller. how the adaptive packet can be selected to test the capacity of the links and how long the sniffing of link capacity needed. to answer those questions, we design an adaptive algorithm for the sdn controller which can reduce the delay time of path selection. 4. sdn architecture in the traditional network, control plane routing protocols have been implemented together with data plane forwarding functions, monolithically, within a router as depicted in fig. 1. in network communication, the data is transmitted from the source to the destination and vice versa. generally, forwarding and routing processes are two important functions in the network layer [17]. the local process of the router executes the delivery of arrived packets from the input interface to the output interface. this is a simple hardware-based operation and takes a short period of time which can be finished on a few nanoseconds [18]. another function is routing which is a network-wide process to provide end-to-end packet delivery from the source to destination. routing process is more complex than forwarding and it takes a long time typically about the seconds, as well it implemented in software [19]. the routing algorithm calculates the paths from source to destination and it finds the best path to integrate the routing. conventionally, a routing algorithm runs in each and every router, and both forwarding and routing functions. the routing algorithm function in one router communicates with the routing algorithm function in other routers to compute the values for its forwarding table. the forwarding table hemn: adaptive software-defined network controller for multipath routing based on reduction time 110 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 contains the values of the outgoing link interface to decide which packet should be forwarded [20]. sdn makes a clear separation between the data and control planes, implementing control plane functions in a separate “controller” service that is distinct, and remote, from the forwarding components of the routers it controls. the routing device performs forwarding only, while the remote controller computes and distributes forwarding tables [21]. the routers and the remote controller communicate by exchanging messages containing forwarding tables and other pieces of routing information. the control plane approach is at the heart of sdn, as shown in fig. 2, where the network is “software defined” because the controller that computes forwarding tables and interacts with routers is implemented in software [22]. the data plane also is a part of a network its functionality only forwards the network packets instructed by the controller. this data plane cannot take any forwarding decision [23]. the communication between the data plane and the control plane is performed by openflow messages. the secured channel is used to connect the switches to the controller [24]. fig.3 illustrates the general description of the sdn layered architecture. the architecture generally compromised three different layers, the button layer is the infrastructure layer that is where the network forwarding equipment, it relies on new layer the control layer to provide it with its configuration and its forwarding instructions, the middle layer or control layer is responsible for configuring that infrastructure layer it does that by receiving service requests from the third layer (the application layer), the control layer maps the service requests on to the infrastructure layer in the most optimal manner possible dynamically configuring that infrastructure layer, the third layer where cloud (internet) applications or management applications place their demands for the network on to the control layer. in sdn, each of these layers and the application program level interfaces between them is designed to be open and provides agility by logically centralizing the full configuration of the networks. 5. proposed algorithm in this section, the description of the proposed algorithm, select the main and the backup multipaths, and choosing adaptive packet size to reduce delay time of the multipath routing, will explain in details. the decision of path selection in multipath routing is one of the most important parts for the sdn controller, the application layer multipath routing algorithm should be designed in sdn to provide adequate path selection with minimum cost. to provide best path selection, we propose an algorithm which uses important metrics to calculate the paths. the proposed mechanism for multipath routing describes the simple method to select a path set for multipath openflow controller (ofc), which calculates the best main and backup path from source to fig. 1. forwarding table in traditional networks. fig. 2. forwarding table in software-defined networks. fig. 3. architecture of software-defined networks hemn: adaptive software-defined network controller for multipath routing based on reduction time uhd journal of science and technology | jul 2020 | vol 4 | issue 2 111 destination, it is based on topology discovery information from the controller, as shown in fig. 1. the sdn controller’s algorithm chooses one best path as the main path, although two important metrics are considered to select the main path, respectively, included; availability of the links bandwidth and the paths length (number of hops). the mechanism calculates the link by sending series of adaptive packets with minimum cost. the controller’s algorithm sets the main and backup path in the corresponding table of the individual openflow switches. if there is no failure during the transfer, the incoming packet uses the main path to transfer flow to the destination. when the main path is failed, the ofc obtains notification from the openflow switch of the failure and the controller automatically switches the flow to the backup path to transfer the packet to the destination. the description of proposed system algorithm is shown in fig. 4. to manipulate and develop multipath mechanism that can reduce network failure, the configured backup path traces the existence of the main path record in the flow table. under normal conditions, the backup path stays in standby mode. when there is no any failure or any change of network circumstance, the timeout of selected backup path is automatically expired, and as soon as the recording is forcibly removed from the flow table, the backup path will be inactive mode and continue to run traffic, which guarantees the minimum network failed time. the mechanism implemented using the following algorithm. the proposed algorithm used depth first search to achieve the feature of finding the multipath between two nodes. after finding all the paths between source and destination, depth first search path finding algorithm explores possible vertices in a graph by finding the deepest vertex in the graph fig. 4. description of proposed algorithm function. hemn: adaptive software-defined network controller for multipath routing based on reduction time 112 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 first before it backtracks to find other possible vertices using a stack also evaluate those paths to find the best path for the transmission. this part is rather simple first calculate the link cost between two nodes, logically a packet might be having more delay in passing throw lower link capacity than higher link. respectively, it will take less time in crossing a higher bandwidth link than a lower bandwidth link. the proposed algorithm uses this logic to calculate the cost. higher link capacity has a lower cost. lower link capacity has a higher cost. proposed algorithm uses this formula to calculate the cost as explained in equation (1). = ( ) refereces link capcity cost interface link capacity in bps (1) reference (link capacity) bandwidth was defined as arbitrary value in sdn which define as fast ethernet uses 100 mbps (108) bandwidth as reference bandwidth. with this bandwidth, our equation would be (cost = 108/interface bandwidth in bps). for instance, if the link capacity is 10 mbps, the link cost is (10) and think capacity is 100 mbps, the link cost equal 1, then adds the link cost of the path to get the cost and get the optimal according to the cost of the path. considering the ospf, calculating the interface cost is an indication to compute maximum number of the packets across a certain interface. the cost of an interface is inversely proportional to the bandwidth of that interface. a higher bandwidth indicates a lower cost. there is more overhead (higher cost) and time delays. after that, the controller defined property of packets for every switch to execute specific actions such as forward or drop the packets. after getting all available paths, controller adds them into the sdn switch. the state of the selection path algorithm configuration is external to openflow. the proposed algorithm based on bucket weights which provide equal load sharing. when a specified port or link goes down instead of the packet dropping, controller decides to switchover the traffic to backup path. this process reduces the disconnection time of a downed link or switch. equation (2) explains the weight calculation for the proposed algorithm. weight pw pw = −      ∑ ( * )1 10 (2) where weight represents the bucket weight and pw is cost of path. furthermore, detail of the pseudo code is explained in algorithm 1. the link failure handling with sdn is an emerging concept, different fast failover mechanisms have been proposed as explained in the state of the art. we propose adaptive failover mechanism for the controller of sdn to make it more efficient and applicable when a delay of few milliseconds can be challenged. the proposed mechanism for handling path failures and restoration involves the controller calculates all available paths from the source to the destination and installs the flow entries on the switches. the controller contains information about the complete topology. in case of any path or link failure, a switch informs the controller. when the controller is notified about a path or link change, controller checks each calculated path p whether it is affected by the link change or not. if any path is affected, the transition rate adaptively decreases by the controller, further, the controller changes the current link to the backup link considering the minimum rate. it also checks that if the flow adds to the entries in openflow switches for older path. if so, then, the controller removes all flow entries from openflow switches for older path and adds in the switches for the new path. the connection of hosts might be recovered in short time and controller overloading also reduced. 6.experiments and test results 6.1.testbed description we used different tools for conducting our experiments, creating the experimental network, implementing multipath algorithm 1: multipath routing algorithm in sdn input: host x, y = h1,h2,h3,…hn, data transfer between host x and host y output: 1: sdn_c → sdn controller 2: n → switch number (srcswitch, desswitch) 3:cost → 0 4:datapath →path[] 5: while sdn_c do: 6:for j in n do 7: j.setfirstswitch (srcswitch) 8: while (nextswitch! = desswitch) 9:for i in range(len(path) 1): 10:cost += get_link_cost(path[i], path[i+1])//calculate link cost 11:return cost 12:end for 13: nextswitch= random(n) 14: if (nextswitch in datapath) 15: nextswitch = random(n) 16: else 17: j.append(nextswitch) 18: end if 19:end while 22:path. append(j) 23:end for 24: optpath = getlestcost(datapath) //evaluate the link cost and choose best path 26: end while 27:returnoptpath hemn: adaptive software-defined network controller for multipath routing based on reduction time uhd journal of science and technology | jul 2020 | vol 4 | issue 2 113 routing mechanisms, observing the network’s behavior, and calculating the delay. we install the sdn environment using virtualization install ubuntu 18.04 which is one of newest ubuntu version to implement sdn scenarios [25]. mininet has been used for implementing the research in sdn and openflow [26]. we can run unmodified code on virtual hardware on a simple pc using mininet to simulate the sdn and make a centralized network with centralized controller which can use both command line and api [27]. for writing our scripts, we use the api provided by ryu sdn controller for multipath routing. furthermore, ping tool is used to observe the path of the packets. the ping is common and quick tool to measurement the latency and how long it takes one packet to get from source to destination [28]. this time measured by the local clock in the pinging device, from the sending the request to get reply. the interval time between each packet is 1 ms and the default packet size is 56 byte with 8 byte from the packet header. in addition, we use vlc application to generate the video streaming which is very sensitive traffic through the network [29]. to apply the tests, we use a workstation with these characteristics; hp workstation, processor: 2, 4 ghz core i7, memory: 8 gb, graphics card: nvidia 1 gb, and hard disk: 500 gb. 6.2.experiments to achieve our goal of implementing the multipath routing and reduce the link handling time in sdn environment using openflow protocol and flow tables, we implement the proposed algorithm as described in section 5. therefore, we apply different experiments to evaluate the proposed algorithm, we demonstrate that our system performs using proof of concept tests to manifest the controller which can find the best path from some available path on the network and the controller can handle the link failed in shortest time without any packet lost. therefore, to show accuracy and efficiency of the proposed algorithm in the test results, we provide a video streaming test, the videos are streamed over vlc application from source to destination using multipath routing. the received video compares to the original one to realize that how link handling is affected on the quality of the videos. 6.2.1. network topology the scenario tested for the multipath network topology. we implemented the sdn using python script with the apis of mininet and ryu to create the data plane and the controller. the network topology consists of two hosts’ h1 and h2 with the ip address 10.0.0.1 and 10.0.0.2, respectively. the network topology provided six switches such such as s1, s2, s3, s4, s5, s6, s7, and s8. the switches are enabled with openflow version 1.3 and one remote sdn controller c0. there are seven available paths between the hosts. the fast path capacity is10, 20, 30, 40, 50, 60, and 70 mbps. this simple scenario is used to observe that if the presented solution is working correctly, as well as to observe the effect of the parameter values on the proposed algorithm. in addition, fig. 5 depicts the general description of the proposed network topology to find and calculate the link handling time. 6.2.2. multipath routing based on video streaming in this experiment, we apply video stream test on the implemented sdn topology not only to evaluate network performance of different network services but also to determine the network path which the controller detects the traffic of the links throughout the network. in this test, we use vlc application to provide video streaming from h1 as a video stream server and sending the video stream (broadcast) to the network and h2 receives the video stream as a client. the result of this experiment showed that the link handling time of streamed video traffic from the main path to backup path do not make any effect on the video stream service or frame frozen from client side and the video streaming could be transferred through the network without any delay or interruption. therefore, in the another test, the video stream is transmitted over the network, the main link is down manually and monitoring the flow traffic to observe that the controller's activity, the controller flows the traffic from the fast path to slow path (compared with each other) without any disruption in the video streaming. fig. 6 explains the change of the traffic strategy path from the proposed multipath network topology. 6.2.3. link handling time in this scenario, we test with icmp protocol between h1 and h2 with different packet size (32, 64, and 128 bit) (8-bit header in each packet and the remaining are datagram) and time interval (0.5, 1, and 2 ms). the experiment results are stored in text file. in this experiment, the main path is downed in the special timestamp and we compare the time in this time stamp. the result presented in table 1, in the first test, fig. 5. multipath network topology. hemn: adaptive software-defined network controller for multipath routing based on reduction time 114 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 the obtained result is 0.82 ms and i is a minimum delay time of link handling also the packet size for this test was 32 bit and time interval between two packets is 0.5 ms, which is the optimal data transfer for multipath routing. the experiments results showed that the link handling time according to proposed algorithm approximately 1 ms which the proposed algorithm provides minimum link handling time compare to the papers mentioned on related work. fig.7 illustrates the comparison the link handling time between proposed and related works algorithms. 6.2.4. video stream test experiment in this test experiment, we develop the testbed to be ready for streaming videos. purpose of this test is to show time of path recovery between the proposed mechanism and traditional mechanism. we use 40 s of bigbuckbunny [30], with x.264 tool [31], the video is encoded to sd quality resolution further average of the streamed video bitrate approximately reaches to 1000 kb/s, the bitrate of the video is dynamic which means the bitrate can be less or high than 1000 kb/s that is depended on the size of each i-frame. therefore, time of path recovery of multipath routing is observed for streaming video between the sender and the receiver. when a path is failed, the recovery paths are used. to save the result experiment, we monitor the network throughput and bitrate of the streamed video, the information is captured in the entrance of receiver side and the information extracted for analyzing. as shown in fig. 8, when the process initiated, the video is streamed to the receiver (end user), the received video at client side is received through vlc application. in this experiment, the current path is failed at the 10th s of video time, the programmed sdn controller is taken optimum path to integrate the transmission. the process through using proposed algorithm took 3 s, it means that the video is frozen 3 s at the client side also this value can be changed fig. 7. comparing the link handling time on proposed and related works algorithm. table 1: relation between packet size and link handling time test no. packet size (byte) interval(s) number of packets link handling time (ms) 1 32 0.5 20 0.82 2 32 1 (default) 20 0.87 3 32 2 20 1.23 4 64 (default) 0.5 20 1.14 5 64 (default) 1 (default) 20 1.28 6 64 (default) 2 20 1.11 7 128 0.5 20 1.23 8 128 1 (default) 20 1.26 9 128 2 20 2 fig. 6. path failing and handling for the proposed network topology. (a) flow over s1-s6-s7-s8, (b) flow over s1-s6-s5-s8, (c) flow over s1-s4s7-s8, (d) flow over s1-s4-s5-s8. a c b d hemn: adaptive software-defined network controller for multipath routing based on reduction time uhd journal of science and technology | jul 2020 | vol 4 | issue 2 115 according to the network conditions, however, in the traditional approach, the process takes 8 s, which means video is frozen for 8 s and the client is waited for a long time until the buffer’s client application received entire packets. this test presents that using previous programmable sdn controller (traditional) for multipath routing makes end users unsatisfactory with sensitive applications. on the other hand, the proposed mechanism can provide better result and less time consumption for path recovering. 7. conclusion and future work in this paper, we proposed a multipath routing algorithm for the programmable sdn ryu controller using mininet network emulator. the proposed algorithm is based on important metrics such as adaptive packet size and observing the network state. the sdn controller based on proposed algorithm decided to switch the flow traffic from the main path to optimal backup path when the main link is down. the proposed algorithm provided the efficient result for selecting routes for critical network infrastructures as well as for any network systems intolerant of delays. as a result, the experimental results showed that the link handling time approximately equal to 1 ms which is the best result presented than other recent researches as mentioned in the related works. on the other hand, the proposed mechanism also provided better quality of experience for the sensitive applications like video streaming. in future work, we plan to apply the reinforcement learning approach to the proposed algorithm to deploy an accurate decision of selecting recovery path. 8. acknowledgments this work has done at the university of sulaimani and human development university. 9. declaration of conflicting interest the author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article. references [1] m. amiri, a. sobhani, h. al osman and s. shirmohammadi. “sdn-enabled game-aware routing for cloud gaming datacenter network”. ieee access, vol. 5, no. c, pp. 18633-18645, 2017. [2] m. taha, l. garcia, j. m. jimenez and j. lloret. “sdn-based throughput allocation in wireless networks for heterogeneous adaptive video streaming applications”. in: 2017 13th international wireless communications and mobile computing conference (iwcmc), pp. 963-968, 2017. [3] p. faizian, m. a. mollah, z. tong, x. yuan and m. lang. “a comparative study of sdn and adaptive routing on dragonfly networks”. in: proceedings of the international conference for high performance computing, networking, storage and analysis 2017, 2017. [4] s. n. hertiana, hendrawan and a. kurniawan. “a joint approach to multipath routing and rate adaptation for congestion control in open flow software defined network”. in: 2015 1st international conference on wireless and telematics (icwt), pp. 1-6, 2016. [5] w. jiawei, q. xiuquan and n. guoshun. “dynamic and adaptive multi-path routing algorithm based on software-defined network”. international journal of distributed sensor networks, vol. 14, no. 10, pp. 1-10, 2018. [6] r. baruah, p. meher and a. k. pradhan. efficient vlsi implementation of cordic-based multiplier architecture. springer, singapore, 2019. [7] f. rhamdani, n. a. suwastika and m. a. nugroho. “equal-cost multipath routing in data center network based on software defined network”. in: 2018 6th international conference on information and communication technology (icoict), pp. 222226, 2018. [8] m. f. ramdhani, s. n. hertiana and b. dirgantara. “multipath routing with load balancing and admission control in softwaredefined networking (sdn)”. vol. 4. in: 2016 4th international conference on information and communication technology (icoict), pp. 4-9, 2016. [9] r. wang, s. mangiante, a. davy, l. shi and b. jennings. “qosaware multipathing in datacenters using effective bandwidth estimation and sdn”. in: 2016 12th international conference on network and service management (cnsm), pp. 342-347, 2017. [10] s. sharma, d. staessens, d. colle, m. pickavet and p. demeester. enabling fast failure recovery in openflow networks. pp. 164171, 2011. [11] a. sgambelluri, a. giorgetti, f. cugini, f. paolucci and p. castoldi. “open flow-based segment protection in ethernet networks”. fig. 8. comparing path recovery time in proposed and traditional mechanisms. hemn: adaptive software-defined network controller for multipath routing based on reduction time 116 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 journal of optical communications and networking, vol. 5, no. 9, pp. 1066-1075, 2013. [12] n. dorsch, f. kurtz, f. girke and c. wietfeld. enhanced fast failover for software-defined smart grid communication networks”. in: 2016 ieee global communications conference (globecom), pp. 1-6, 2016. [13] y. d. lin, h. y. teng, c. r. hsu, c. c. liao and y. c. lai. “fast failover and switchover for link failures and congestion in software defined networks”. in: 2016 ieee international conference on communications (icc), 2016. [14] r. h. hwang and y. c. tang. “fast failover mechanism for sdnenabled data centers.” international computer symposium, chiayi, taiwan, pp. 171-6, 2016. [15] y. aldwyan and r. o. sinnott. “latency-aware failover strategies for containerized web applications in distributed clouds”. future generation computing systems, vol. 101, pp. 1081-1095, 2019. [16] h. h. hsieh and k. wang. “a simulated annealing-based efficient failover mechanism for hierarchical sdn controllers”. in: ieee region 10 annual international conference, proceedings/ tencon, pp. 1483-1488, 2019. [17] f. y. okay, s. ozdemir and m. demirci. “sdn-based data forwarding in fog-enabled smart grids”. in: 2019 1st global power, energy and communication conference (gpecom), pp. 62-67, 2019. [18] a. tariq, r. a. rehman and b. s. kim. “forwarding strategies in ndn-based wireless networks: a survey”. ieee communications surveys and tutorials, vol. 22, no. 1, pp. 68-95, 2020. [19] y. zhang, z. xia, a. afanasyev and l. zhang. a note on routing scalability in named data networking.” in: 2019 ieee international conference on communications workshops (icc workshops), pp. 1-6, 2019. [20] a. jayaraman. comparative study of virtual machine software packages with real operating system, 2012. [21] j. f. kurose, r. rose. computer networking a top-down approach. 7th ed. pearson, united kingdom, 2017. [22] j. ali, s. lee and b. h. roh. “performance analysis of pox and ryu with different sdn topologies”. acm international conference proceedings series, pp. 244-249, 2018. [23] y. shi, y. cao, j. liu, and n. kato. a cross-domain sdn architecture for multi-layered space-terrestrial integrated networks. vol. 33. ieee network, piscataway, pp. 29-35, 2019. [24] m. alsaeedi, m. m. mohamad and a. a. al-roubaiey. toward adaptive and scalable open flow-sdn flow control: a survey. vol. 7. ieee access, piscataway, pp. 107346-107379, 2019. [25] f. keti and s. askar. “emulation of software defined networks using mininet in different simulation environments”. in: 2015 6th international conference on intelligent systems, modelling and simulation, pp. 205-210, 2015. [26] s. lee, j. ali, and b. h. roh. “performance comparison of software defined networking simulators for tactical network: mininet vs. opnet”. in: 2019 international conference on computing, networking and communications (icnc), pp. 197-202, 2019. [27] c. fernandez and j. l. muñoz. software defined networking (sdn) with open flow 1.3, open v switch and ryu, pp. 183, 2016. [28] z. h. zhang, w. chu and s. y. huang. “the ping-pong tunable delay line in a super-resilient delay-locked loop”. in: 2019 56th acm/ieee design automation conference (dac), pp. 90-91, 2019. [29] m. taha, j. lloret, a. canovas and l. garcia. “survey of transportation of adaptive multimedia streaming service in internet”. network protocols and algorithms, vol. 9, no. 1-2, pp. 85, 2017. [30] available from: https://www.peach.blender.org. [last accessed 2020 jun 01]. [31] available from: https://www.videolan.org/developers/x264.html. [last accessed on 2020 jun 20]. . uhd journal of science and technology | jan 2020 | vol 4 | issue 1 59 1. introduction higher education (he) scenery all over the world is in a continuous state of influx and development, mainly as a result of essential challenges stemming from efforts in adopting new and growing technologies. using technology will improve he which will result in providing high-quality education and prepare the students to face the challenges of the 21st century [1]. kurdistan regional government (krg) could be a developing area in several faces, he has been developed in this region, there are 28 universities, according to the kurdistan ministry of he (mhe) [2]. cloud computing (cc) is a collection or group of hardware and software to human beings through the internet. cc provides many advantages such as steady, rapid, sample, suitability, and simultaneous accessibility of belongings at low cost in comparison with other techniques through the internet to the users. resources can be requested by the consumers depending on their requirements. these requirements can be storing data, communication, data processing, and calculation cycles needed for their applications [3]. each cloud has its own users. the services of the cloud can be accessed by the user to retain the increasing daily and safety systems in the cc environments. the specific role of a review study on the adoption of cloud computing for higher education in kurdistan region – iraq abbas m. ahmed1, osamah waleed allawi2 1department of business administration, sulaimani polytechnic university, sulaimani, iraq, 2department of computer technology, al-hikma university college, baghdad, iraq a b s t r a c t cloud computing (cc) is considering as a popular computing model in the western world. it is still not well understood by much higher education (he) institutions in the developing world. cc will positively affect its consumers in executing their role in an economical way. it can be done using applications provided by the cloud specialist organizations. this study aims to evaluate the factors that influence the adoption of cc for he within the kurdistan region in iraq. the study was performed utilizing a non-experimental study exploratory research design. this exploratory study included an essential investigation into secondary data. the study development and modeling of secondary data to highlight the final results of the research. through reviewing the literature of the existing frameworks in cc adoption, it is showed that there are limited institutions developed over the latest years. moreover, he in kurdistan region needs continued attention to get government support and redesign the educational system to cover all the core aspects in a better way. here, at any time, there is a need to access the applications, software and hardware, platform, and infrastructure; the most required is to have the internet service. index terms: cloud computing adoption, higher education, education systems, kurdistan region – iraq, electronic learning access this article online doi: 10.21928/uhdjst.v4n1y2020.pp59-70 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2020 ahmed and allawi. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l re se a rc h a rt i c l e uhd journal of science and technology corresponding author’s e-mail: abbas m. ahmed, department of business administration, sulaimani polytechnic university, sulaimani, iraq. e-mail: abbas.ahmed@spu.edu.iq received: 16-12-2019 accepted: 18-03-2020 publishing: 20-03-2020 abbas m. ahmed and osamah waleed allawi: the possibility of cc application in higher education of northern iraq 60 uhd journal of science and technology | jan 2020 | vol 4 | issue 1 the cc is that it can be used through exploiting the internet and the pcs in the data centers. the role of cc is important in the academic and industrial domain [4]. as mentioned above, he is in continuous development according to the requirements of modern life. it is so-known that effectively using technology in he is one of the key factors for providing high-quality education. the cost is the main reason for the slow adopting of new technologies in he. the local societies and the whole world transformation demand huge funding and investment. these factors are difficult to come at the times of deep economic downturn and depleted budget reserves whether budgets of the government or the private institutions. the financial support provided to he institutes has sharply decreased in times of recession, leading to financial difficulties in he institutions (heis). to address their fiscal deficit, he institutes have recourse to a variety of cost-cutting measures, including important cuts to information technology (it) budgets [5]. however, in this paper, many studies have been reviewed, discussed, and critically analyzed to providing a solid literature review for future research. in addition, a wide range of case studies from past up to date is presented for a better understanding of the theory related to applying the cc in hes. articles, journals, books, and previous works had been listed in the following tasks. many well-established reviews and survey articles on applying cc in he available in the literature such as amron et al., qadri and qadri, rawajbeh et al., al-shqeerat et al. [6]-[9], singh and baheti [10]. amron et al. [6] reviewed three sectors in applying cc which will be the health-care sector, higher learning organization, and the public field. in the manner, qadri and quadri [7] reviewed numerous cc applications with emphases on the security aspect. in the work of al rawajbeh et al. [8], they clarify the roadmap of the successful adoption of cc in high education institutions. in the work of al-shqeerat et al. [9] provided baseline recommendations to avoid security risks efficiently when adopting cc in he. whereas, in the work of singh and baheti [10], they discussed the limitations and problems of the traditional education methods in additional education based on cc. fig. 1 shows the number of reviewed and discussed articles in this work based on the years, note, and * represents the number of review articles. 1.1. paraphrase a total of 39 research articles and five review articles are covered in this work. the review emphasizes the cc service and its applications in hes. furthermore, our work distinguishes itself from the previous by it is studying the possibility of applying the cc service in he of the kurdistan region – iraq. this review approach allows us to improve the scope and shape the direction of he based on it. this paper segmented into four parts starting with the section of introduction which describes the cc service and it is an application in education. furthermore, the related work has been discussed in section 2. furthermore, in section 3 an overview of the cc service and it is an application in hes. finally, section 4 presents the conclusion of this paper. 2. related work a literature search is a pre-requisite for reviewing the literature on any subject, in general, this is done by scanning some prominent journals and conferences exclusively dedicated to the subject, concentrating on limited outlets cannot be considered as enough justification for a literature review on cc in he as this is a recent phenomenon [2]. this is the reason why the publication channels till now are largely scattered for most of the concurrent phenomena, information science researchers and scholars are using online databases as their first literature collecting strategy. the cc aspect is useful when was implemented for some universities such as california university (uc). they found that implementation of cc enhanced the development. also, the implementation of software as a service (saas) applications cc which made the difference in increasing the advantages to students, such as make the exams online, have access to their exercises and solve it and resend it again, projects submitted by students, feedback facility between students and teachers. it also provides the facility of fig. 1. number of the reviewed article. abbas m. ahmed and osamah waleed allawi: the possibility of cc application in higher education of northern iraq uhd journal of science and technology | jan 2020 | vol 4 | issue 1 61 communication between students, using the applications by the students as well as the teachers without installing those applications on their computers and without store-related application files. furthermore, the ability to access any computer from anywhere and at any time when the internet service is available can be possible too. the core concern related to applying the cc in he fields is security issues. at the same time, this does not mean that there are some other obstacles related to trusting, trust, and assurance [11]. the educational cloud represents one of the most interesting applications of cc. to meet their most requirements, the private educational institutes are betaking toward using it technology. the increasing dependence on it requires the availability of the internet to students and institutes [12]. one of the main problems in iraq is the lack of network infrastructure. abdusalam et al. [13] mentioned useful information on the challenges and current status of the backbone infrastructure and internet in the krg which provided by private companies from several countries, namely, iraq, iran, turkey, and the others [14]. universities in krg are willing to use modern techniques in the education process and teaching methodologies. the cornerstone of the modern education system is using information and communication technologies (ict). unfortunately, the mhe in kurdistan suffers from a lack of ict infrastructure in its governorates [15], and clearly; this means that establishing ict infrastructure for universities requires extensive time, investment, and efforts. furthermore, some researchers in masud et al. [1] aimed to improve an instrument to investigate the factors of cc service based on the theory of planned behavior. some researchers focused on the materiality of each dimension and weight of each sub-dimension such as thabit and harjan [16]. the researchers checked the opinions of the avicenna center and gihan university academic staff with 40 questionnaires. the questionnaire also contains some points related to developing the activities of the avicenna center in erbil. thabit and harjan [16] concludes that the e-learning avicenna center has to develop a new department of training the staff to deal with e-learning centers and improves the university students’ skills to create a new generation compatible with e-learning technologies. in addition, riaz and muhammad [17] proposed that the limitation of the education requirements in growing countries like pakistan can be solved through the adoption of the cc. this action can guarantee that all the software resources and ict based possessions can be shared among learners. all the data confidentiality and integrity processed by the institutes can be defined as the data security according to many studies. this risk can be reduced through a model that complies with all the academic and administrative staffs’ requirements and at the same time, all the users’ devices are separated to reduce data theft chance. this action was done by controlling the data storage of each user device through different port from other users’ data storage [18]. amron et al. [6] reviewed three sectors in applying cc which will be the health-care sector, higher learning organization, and the public field. five key factors completely outclassed all three sectors; technology preparedness, human readiness, organization assistance, environment, and security, as well as privacy. factors of connection and feedback and access to the internet hereditary factors pertinent to the he community, generally the study is motivated by curiosity about the dependability of cc to become the leader in information storage technology. although several studies found the actual cc brings much more benefits than disadvantages, the particular negative effects of the applied cc should become also being noted especially in the aspect associated with safety and data personal privacy factors [6]. the study by qadri and quadri [7] has assessed the behavioral intention of the students of iraq, being in its infancy in terms of internet adoption; thus, going through the transformation of traditional modes of learning into e-learning modes. the study has employed the modified form of “technology acceptance model” (tam) model to assess the attitudinal behavior of the students of iraqi he toward the use of learning management system as the educational platform, the study has led to the conclusion that there exists a significant association between the variables under consideration. the standings of iraqi he are noted to be significantly improved with respect to the past statistics. besides, the study also affirms the credibility of the tam model in facilitating the assessment criteria for diverse technological deployments. cc has considerable standing in the heis worldwide and locally. as well as in saudi arabia, typically the it market is considered such as the largest sector in the gulf area. the saudi government offers allocated huge finance to improve the academic environment with the very best technological facilities. on the other hand, there are unique start-up universities in saudi arabia that absence e-learning tools in comparison to the elderly universities in sa, saudi universities even now slowly seek to embrace cc in the he atmosphere for distance studying and e-learning, while cc has been abbas m. ahmed and osamah waleed allawi: the possibility of cc application in higher education of northern iraq 62 uhd journal of science and technology | jan 2020 | vol 4 | issue 1 broadly used in universities within different countries to provide higher quality services to be able to he and also cc enables heis to deal with the needs regarding software and hardware modifications rapidly at lower expenses. therefore, the adoption involving cc into he encourages students’ academic level in addition to efficiency. the research came to the conclusion that there is a good urgent need to produce a new web software based on cc as well as cover some of the holes in existing web applications [8]. cc represents a great opportunity for universities so that you can take advantage of the actual enormous benefits involving cloud services and also resources in the educative process. however, the cloud end users remain concerned regarding security issues that symbolize the major obstacle which may prohibit the usage of cc on a large scale. typically, the limitations of cloud support models were investigated in addition to challenges as well as risks threaten cloud processing. the study demonstrates that the stakeholders are usually not familiar with feasible security risks or operations used to protect data as well as a cloud application. furthermore, this indicates that the many serious attacks may threaten cloud networks usually are denial of service and also phishing attacks [9]. the teaching materials may be made available through the cloud service workers to educate the customer on the available risk operations issues as it pertains to cloud usage. this shows how crucial it will be for educators who usually are cloud service users to be able to understand how to manage all their information used in often the cloud. to the students, this enhances their participation in studies, increases all their enthusiasm and motivation, therefore the time at that they study is raises while the cost is actually reduced. the students obtain limitless access to net-based teaching-learning sources needing little or absolutely no effort from the teacher. studying is gradually made electronic as educational institutions transfer their resources, students info system, learning management techniques, knowledge management techniques to the cloud, with that, students are capable to access the needed sources from anywhere in a versatile way [19]. the goal of sultana et al. [20] study is to determine the factors that will certainly influence cc adopting in university associated with dhaka of bangladesh. in this research, some significant factors possess been derived from information collection and data analysis in different functions of this university. the absence of proper infrastructure, services availability, and effectiveness in education are observed most important. some other factors are resource require, cloud control ability as well as lack of training of employees. an educational institution may focus on these elements to increase the utilization of cc technologies to provide studying to the student. singh and baheti [10] were conducted their study to overcome the limitations of traditional education and learning system, cc solutions tend to be very useful for academic institutions especially with regard to he institutes. along with the involvement associated with cc in learning system, students can obtain access to various sources (i.e., textbooks, magazines video lectures, demonstrative video, and lab facilities) which are not achievable in traditional education and learning system. teachers can assess students in a much better way; researchers can obtain all the facilities as well as infrastructures related to all their research field. definitely, not only teachers and learners but administrators should also opt for equipment for administration purposes. inside overall cc has different services that might be included in the actual traditional education and learning system. a study conducted by başaran and hama [21] to investigate university faculty members’ views toward the adoption of cc in he. the current status of the faculty on cc usage in education and regional differences was discussed. the data were collected through an adopted questionnaire based on these frameworks and demographic information was answered by 300 faculty members from the northern parts of cyprus and iraq. the results showed that faculty members agreed mostly on the opportunities followed by an awareness of potential threats and weaknesses and finally they accept the strengths of adopting cc in education. the study brought to light on the comprehension of faculty members’ views from comparative and integrated framework perspectives. in general sense, faculty members from the north part of iraq seem to be slightly more optimistic about the adoption of cc in educational settings. this might result from either they less frequently use cc services as compared to faculty members from the north part of cyprus who are younger in mean age and can be considered as being more capable consumers of cuttingedge technologies like cloud. interestingly, both parties are aware of the problems which could be resulted from adopting such innovation. with the number of works of literature reviewed above, it shows that a number of studies have been conducted on the adoption of cc at he in kurdistan. these studies are carried out in different environments, countries, and industries. abbas m. ahmed and osamah waleed allawi: the possibility of cc application in higher education of northern iraq uhd journal of science and technology | jan 2020 | vol 4 | issue 1 63 these studies showed that the cc is constantly evolving, and it became necessary for the different processes and activities within he universities, and this requires the universities to apply it and use it. despite the importance of cc usage and its role in activating the learning in iraq to enhance education, there was a lack of researches and studies on the factors that affect the adoption of cc in iraqi universities. as a result, there is a knowledge gap. this study aims to fill that gap. however, the analysis of the related work is shown in table 1. based on the above discussion, it is observed that most of the studies emphases on the readiness factors of technological, he, cost, educational, information security and cultural, as shown in table 2 below: table 1: the analysis and comparison of the related work. ref. type of service country employment place advantage disadvantage kadhim [11] cloud computing iraq basic education it can improve the educational sector it is suggesting the applying of decisionmaking a feature in education internalize storage for confidential work limited by geographical scaling hashim et al. [12] cloud computing north iraq higher education it gives the students an open and flexible environment by applying the vcl in bayan university the proposed system does not test yet the study is limited only on the bayan university abdusalam et al. [13] cloud computing north iraq governments organization it assists in locating missteps in the implementation stage it is focused on the information security aspect the study is limited on only three status which are dhok, erbil and sulaymaniyah al-hashimi et al. [14] cloud computing north iraq higher education the possibility of applying the cloud computing services in the north iraq universities operations have been proposed for budget reductions limited by geographical scaling abdulkadhim et al. [15] electronic document management system iraq governments organization it draws on the research results for the implications of it managerial practice it provides enhance in managing the edms implementation process in government the study focused only on government organizations asadi et al. [22] cloud computing general higher education demonstrated validity, reliability, simplicity, and functionality of the f the theory of planned behavior – cloud computing services use questionnaire tpb – ccsq the proposed system is tested only in higher education thabit and harjan [16] electronic learning north iraq higher education spread the culture of applying the e. learning in avicenna center of erbil develop a new department of training the staff to deal with e-learning centers the study focused only on the avicenna center of erbil riaz and muhammad [17] cloud computing pakistan higher education it presents the usability evaluation of public cloud applications across three universities in pakistan from stakeholders’ perspective, i.e., (teachers and students) they did not take into consideration the applying of google sites to find out the effects of public cloud application in the education sector nofan and sakran [18] cloud computing general education it is given a better understanding of the conception of cloud computing technology and its impact on teaching and learning in institutions they did not take into consideration the information security aspects ariwa and aiwa [19] cloud computing nigeria higher education cloud computing in nigeria will transform the traditional education model to computer-based virtual applications with a focus on e-pedagogy the study is limited only in nigeria the study is limited only in higher education sultana et al. [20] cloud computing bangladesh higher education it identified the factors that will influence cloud computing adoption in university of dhaka of bangladesh the scope of the study is limited only at the university of dhaka başaran and hama [21] cloud computing turkey, iraq higher education it offered education-specific solutions to institutions regarding cloud computing adoption they did not take into consideration the information security aspects vcl: virtual computing laboratory abbas m. ahmed and osamah waleed allawi: the possibility of cc application in higher education of northern iraq 64 uhd journal of science and technology | jan 2020 | vol 4 | issue 1 3. an overview of cc and its application in he in this section, an overview of cc and it is an application in hes is given. furthermore, it is consisting of four parts, which are cc definition, the benefits of cc, the applying of cc service in education, and the employing of cc service in he, as shown in fig. 2. 3.1. cc definition there are a set of important definitions and reviews on cc. the which is the national institute of standards and technology defined the cc as a model which enables handy, a network access when required to a shared pool of configurable computing resources, for example, networks, servers, storage, applications, and services that can be swiftly provisioned and released with the minimum effort of management or service provider interaction [11]. on a more elementary level, cc can be a systematic way for managing a set of virtual computers somewhere automatically and control them in such a simple way to create, manage, or even destroy over the network, without human action [12]. there are various security types within the cc technique, which can include networks, databases, operating systems, resource scheduling, virtualization, transaction management, concurrency control, and memory management [4]. hashim et al. [13] referred that the key benefits of the cc are the ability to allow users to access data and software anywhere whenever there is internet service available and the ability to smooth sharing of learning materials and data, while the concerns about the security and data privacy can be considered as the main obstacle in this field. the cc provides many types of services and when there is a full understanding of these services, it will be clear what this approach is all about. the main types of cloud services can be illustrated below: • what is so-called infrastructure as a service (iaas): services provided by this level include the remote delivery (through the internet) of a full computer infrastructure (e.g., virtual computers, networks, and storage devices) [14]. the perfect example of this kind of service is amazon1 which offers s3 for storage, ec2 for computing power, and simple queue service for network communication for limited businesses and individual consumers [23], just such as computer server and processing power [22]. • platform as a service (paas) which is paas: in this field, paas offered the ability to provide a software application without the need to install the software tool in consumer computers. the cloud development environment is the cc main access tool and their examples are operating systems, software testing tools [22]. • the other layer within these services is saas and under this layer, applications are delivered through table 2 : the most common readiness factors that have been used in the field of cloud computing service in higher education. readiness factors fac. technological hr cost educational info. security cultural ref. [11]  x    x [12]      x [13]       [14]    x   [15]    x  x [22]   x x  x [16]    x x x [17]     x  [18]    x  x [19]    x  x [20]   x x   [21]   x x x  fig. 2. an overview of cloud computing and its application in higher education. abbas m. ahmed and osamah waleed allawi: the possibility of cc application in higher education of northern iraq uhd journal of science and technology | jan 2020 | vol 4 | issue 1 65 the medium of the internet as a service. this type of service is running on the provider’s infrastructures and is accessed through the client’s browser (e.g., google apps and salesforce.com) [6]. to use the required software, it can be simply accessed through the internet and this action can nominate the need to install the software itself. the complete functionality of the applications is embedded within this type of cloud service and these functionalities are a variety from the productivity such as (office-type) applications to programs just like those for the management of enterprise-resource or which refers to customer relationship management, for example, billing software, image, and video editor [22], [23]. moreover, fig. 3 shows the types of cc service. 3.2. the benefits of cc in fact, there are multi benefits or advantages for the solutions provided by cc. some of these benefits over traditional technologies are illustrated below: • mobility: in general, the current orientation is to increase the dependency on the facilities provided by mobile devices. in the he field, the students harnessed the mobile devices’ facilities to access data whether these data were a textbook, researches, syllabi, or even have the privilege to do their own homework. the applications within the cloud-based classroom can be considered as the most efficient way to make the exchange between student and faculty easier [7]. • n e w s e r v i c e s : t h e c o s t o f t r ave l i n g ( f o r t h e international students), as well as other difficulties related to attendance in the classrooms, motivate the need for starting virtual classrooms through online learning and video conferencing which is provided nowadays by many colleges and universities. the universities which offer the facilities to enable the students to join the classrooms from anywhere around the world using their own mobiles, computers, or tablets; could not provide such a service without the cloud servers [18]. • storage: the actual usage of the cc by the universities provides them with the ability to quickly expand storage capabilities through scalable cloud storage. the data related to students are huge, starting from their own information, their marks, their medical records, and any other data. here, the core risks are the chance to have a situation where these data can overwhelm traditional storage or even lost. the scalable cloud storage property alongside with business continuity and disaster recovery can be used to avoid such situations [9]. • efficiency: in he, there is always a striving by the universities to improve their organizations. almost 55% of the higher learning institutions looking forward to increase the efficiency and they trust in cc as the best way to achieve this goal [5]. like everything in this life, there is advantages and benefits while in the other side there is risks and limitation and the cc is not an exception (table 3 shows these risks and limitations), multiple cloud cc can the he institutes choose but they have to take in their considerations the real need and the institute strategy itself [24]. 3.3. the applying of cc service in education a r udimentar y understanding of infor mation and communications technology in the education field is one of the motivations and key factors for what can be seen as fast-changing technology. it is a necessary issue for he actors to have a full understanding of how the cloud cc is adopted as well as involved. to transform the he systems to be cloud-based systems, knowledge use and creation are a critical factor to ensure the full social, economic, and cultural transformation [24]. coping with rapidly changing software and hardware needs at a lower cost in he motivates many researchers to migrate from the classical systems toward the cc technique. the he corporations planning to use 20% of the information techniques budget allocated for them, this will be done by shifting their applications toward the cloud. the challenges that would face this transition should be addressed within inclusive cc strategy; on the other hand, this step will ensure a smooth transition as well as optimal results to increase the institute’s organizational efficiency [5].fig. 3. service models of cloud computing [22]. abbas m. ahmed and osamah waleed allawi: the possibility of cc application in higher education of northern iraq 66 uhd journal of science and technology | jan 2020 | vol 4 | issue 1 in he, it is obvious that the main beneficiaries are students and all faculty staff (academic and administrative). these users have access to the data alongside the control on those data through the internet. the privileges and activities to all the users who connected to the cloud are a variety from uploading lectures, assignments, and tests (for teachers) as well as accessing those lectures, assignments, and tests (for students) re-upload the assignments and test if necessary. the main requirement to access the cloud anywhere and anytime is the availability of internet service [12], [25]-[27]. however, what so-called “the intelligent education” or the “electronic education” can get its entire requirement to be efficient like the application software itself as well as the required database alongside with email management from the saas. on the other hand, the shortage or the breakdown here is concentrated in the dependence of this technique on the internet. since that, the internet is a key factor for the permanence of the cc; all the he users (staff and students) have to ensure the continuity of the internet service as well as ensuring that the internet connection is fast enough to have the full access to the cloud services at any time [28]. moreover, in the work of al-khayat and al-othman [29] tried to make the design of the educational cc comprehensive and complete. the proposed generic cc model is to implement many frameworks for improving the quality of education of students and academic staff besides saving time. although the using of modern ict are the cornerstone of modern teaching and learning in the engineering colleges and institutes in iraq, the extensive use of educational technologies and investing time and efforts in buying and maintaining infrastructure was disrupting the aim of establishing effective teaching and learning environment. to face this big problem and to an emphasis on quality of education, there should be an awareness about the cc benefits on cost-effectively providing better education services in addition to making a real investment of cc in providing both saas and infrastructure [29]. the cc resources can be accessed whenever required and with the minimum effort needed for managing these resources. the goal of applying the cc on he is differing from one country to another and from one region within the world to another. in africa, the main goal is establishing systems that can provide students with services like e-library. from the perspective of the organizations themselves, the goal can be summarized in reducing the cost and improve their it capabilities. at the same time, fear and uncertainty still exist from applying the cc in the he system. taking into consideration the risks and struggles in adopting the cc in the he system in africa and comparing them to the benefits, the applying of cc is inevitable [13]. the education system will make it possible for teachers to highlight the weakness areas where the students used to make mistakes, this activity can be done through the students’ records analyzing. this analysis will enable teachers to improve or even change their teaching methods. applying this technique will allow the students to have access to the lectures during the classes or even at home. sharing learning materials and hardware (servers) by all the university colleges will reduce the operation cost for the universities effectively through the utilization of cloud cc systems [30]. furthermore, fig. 4 shows the users as well as the way that these users interact with the cc. 3.4. the employing of cc service in he in the kurdistan region – iraq the strategy of clouds must be related to the strategy of various academic institutions or universities. the transformation to ccbased establishments requires full knowledge about how it can function in different aspects and principles associated with organizational structure and relations between universities and institutions alongside with advantages and risks, security issues, and policies. recently, the cloud cc researches illustrated the best usage practices of cc contain the following phases [31], [32]. • evaluating the current level of the various institutions from the perspective of the it requirements, framework, table 3 : the benefits and limitations of applying cloud computing in higher educations [24]. benefits limitations it is available anywhere and anytime not all applications run in the cloud support for teaching and learning data protection and security issues low cost, since it is free or pay depend on the use organizational support opening to the business environment and advanced research dissemination politics, intellectual property using green technologies to protect the environment maturity of solutions increased openness of students to new technologies lack of confidence increasing functional capabilities standards adherence offline usage with further synchronization opportunities speed/lack of internet can affect work methods abbas m. ahmed and osamah waleed allawi: the possibility of cc application in higher education of northern iraq uhd journal of science and technology | jan 2020 | vol 4 | issue 1 67 and usage: this step includes the understanding of various institutions’ it structure. • experimenting with the cc solutions: the applications and projects cannot be transformed to be cloud-based applications or projects suddenly, this shifting has to be step by step starting from experimenting with the cc technique on the pilot application and then apply it on other chosen applications. to do so, setting cloud goals just such as development and testing the environment or storing some data within the cloud and continue processing the internal processes is required [33], [34]. • selecting the cc solution: within this step, determining the data and applications, structure, functions, and core processes within the academic institutions is done. they may be grouped according to teaching, research, and administrative support. it also contains the cloud model which has been chosen (private, public, community, and hybrid) for the specified processes, functions, and applications [35], [36]. however, table 4 shows the cc types. 3.5. brief history of he in kurdistan kurdistan is a federal state located in the north of iraq that has its own law and legislation, with a populace of 5.2 million and expanding the three governorates of erbil, slemani, and duhok, cover roughly 40,000 km2 [37]. krg has realized how he is important to upgrade the federal state infrastructure. krg starts allocating a big portion slightly from its budget for the education field in general and he in specific, just like in the 2013 budget where 16% of the budget was allocated to this issue [38]. although, there was only one university in kurdistan until 1992; in general, highly valued and has a special space in society. gradually, the krg policy has adopted huge investment in the h.e field which results in opening new he institutes [39]. kurdistan region oversees 33 universities. these include 14 public and 15 state-recognized private universities, two universities with different ownership and two institutions [40], [41]. some universities provide virtual computing labs which can be considered as a virtual environment where the students have the ability to reserves pcs [42]. the reserved pcs have their own specialized hardware as well as software. this environment enables the students when they have an internet service; to access those reserved computers from anywhere. therefore, within the suggested technique, iaas provides virtual machines (vms) when they are needed for students of the university. the main reason for using these vms is divided into two parts; the first one is to make courses and lab exercises. the other sub-reason is to build virtual labs. however, the advantages of using such an environment can be summarized in the ability of users (students and university staff) to use the resources. on the other hand, economic incomes can be achieved for the university. the work of hashim et al. [12] showed that bayan university achieved the task of using the system of cloud education through the technology of virtualization which illustrated the main point of virtual computing laboratory. this system allows the university to provide a flexible environment for their students to have access to the computers available within the university labs as well as reducing jams of using the computer hardware. a new step in developing he in iraq has done when the mhe in kurdistan applied an online registration system which enabled the students all over the republic of iraq to select their universities and colleges. this action and many other effective steps done by the kgr, raise the number of fig. 4. cloud computing service application in higher education’s [30]. table 4: cloud computing types [43]. type description public cloud services provided by organizations and customers pay for what they actually use, in terms of being cost-effective, public cloud is considered superior over the other, on the other hand, it raises other issues such as security, privacy, and levels of controls private cloud services provided to and managed by the organizations’ staff themselves or any third party vendors, this cloud service is not provided to the general public, the private cloud could be implemented locally or remotely community cloud type of cloud provided to a specific target group of people, the services are shard exclusively among the members of this group only hybrid cloud combination of two or more cloud – the private clouds, public cloud, and the community abbas m. ahmed and osamah waleed allawi: the possibility of cc application in higher education of northern iraq 68 uhd journal of science and technology | jan 2020 | vol 4 | issue 1 the he students to 94,700 according to latest statistics 48%. of this number are female students. the academic degrees offered by the universities within the kurdistan region are diploma (2 years), bachelor (4 years), master (2 years), and doctorate (ph.d.) (3-5 years) in multiple scientific and administrative academic fields and others [44]. the educational system in kurdistan can be considered as an unstructured system; alongside using, the technologies are real problems to the mhe of kurdistan. a survey has been conducted to determine using the extent of cc in kr universities; the questions within the survey (which included 222 academic staff and students from 14 universities) were varying to cover the requirements needed to apply the cc in their universities. the researchers in ahmed et al. [3] highlights the main reasons to use the cc within the kr universities just such as reducing the cost, enhancing the university structure, develop the performance, and other related issues. on the other hand, the researchers illustrated the drawback and challenges confront applying the cc in kr just like lack of ict infrastructure, security issues, privacy, and shortage of current systems, and data and documents ownership. to reduce the drawbacks of public cloud and get the advantages of cc heis and universities in krg have to change their strategies from using public clouds to use their owned clouds [3]. 4. conclusion the applying of cc in he is providing many advantages such as steady, rapid, sample, suitability, and simultaneous accessibility of belongings at low cost in comparison with other techniques through the internet to the users. in this study, the existence of cc adoption frameworks for universities in developing countries along with iraq has been discussed briefly. a review of these studies showed that the university within kurdistan region in iraq needs continued attention to get government support, cc within iraqi he universities has limited developed over the latest years in the private and public universities. the findings indicate that interventions designed to increase the cc adoption need to include a focus on the practice level because that is decisionmaking regarding adoption occurs, in addition, to help it, managers, within institutions to change their workflow to obtain the most services, along with addressing privacy concerns and explicitly acknowledging. in addition, the study offered a variety of university settings to ensure higher generalizability associated with the outcomes. all these results can be mainly relevant and timely concerning the decision maker who presently faces the obstacle of cc adoption in the iraqi education environment. the limitations of this study include that there was single-source bias, as the collection of information was from secondary sources only. furthermore, the study has more of a judgmental conclusion, as there is no post data assessment. he universities should figure out how to rationalize their students’ needs and priorities, applications, and their own premise information, and after that merge their framework accordingly. finally, the most related work to this study has discussed to attempt to fill a gap in the current research to develop an adoption model that can help iraqi he universities to adopt cc. therefore, it is recommended for future researchers to conduct a field survey by collecting primary data and conducting statistical tests on the variables implicated in the findings of this study. furthermore, due to the bold role of the internet and cyberspace in human life and its impact on behavior, lifestyle, it is suggested in future works monitor the role of social media in the use of cc in education. references [1] a. h. masud, x. huang and j. yong. “cloud computing for higher education: a roadmap”. in: 2020: international conference on computer supported cooperative work in design, pp. 552-557, 2012. [2] z. a. ahmed and m. i. ghareb. “an online course selection system: a proposed system for higher education in kurdistan region government”. international journal of scientific and technology research, vol. 7, no. 8, pp. 145-160, 2018. [3] z. a. ahmed, a. a. jaafar and m. i. ghareb. “the ability of implementing cloud computing in higher education-krg”. kurdistan journal of applied research, vol. 2, pp. 39-44, 2017. [4] q. k. kadhim, r. yusof, h. s. mahdi, s. s. al-shami and s. r. selamat. “a review study on cloud computing issues”. journal of physics: conference series, vol. 2018, p. 12006, 2018. [5] v. h. pardeshi. “cloud computing for higher education institutes: architecture, strategy and recommendations for effective adaptation”. procedia economics and finance, vol. 11, pp. 589599, 2014. [6] m. t. amron, r. ibrahim and s. chuprat. “a review on cloud computing acceptance factors”. procedia computer science, vol. 124, pp. 639-646, 2017. [7] m. n. qadri and s. quadri. “a study of mapping educational institute with cloud computing”. international journal of scientific research in computer science, engineering and information technology, vol. 2. pp. 59-66, 2017. [8] m. al rawajbeh, i. al hadid, j. aqaba and h. al-zoubi. “adoption of cloud computing in higher education sector: an overview”. indian journal of science and technology, vol. 5, no. 1, pp. 23-29, 2019. [9] k. h. al-shqeerat, f. m. al-shrouf and h. fajraoui. “cloud computing security challenges in higher educational institutions-a survey”. international journal of computer applications, vol. 161, pp. 22-299, 2017. abbas m. ahmed and osamah waleed allawi: the possibility of cc application in higher education of northern iraq uhd journal of science and technology | jan 2020 | vol 4 | issue 1 69 [10] u. singh and p. k. baheti. “role and service of cloud computing for higher education system”. international research journal of engineering and technology, vol. 9, p. 10, 2017. [11] t. a. kadhim. “development a teaching methods using a cloud computing technology in iraqi schools”. journal of university of babylon, vol. 26, pp. 18-26, 2018. [12] e. w. a. hashim, m. o. hammood and m. t. i. al-azraqe. “a cloud computing system based laborites’ learning universities: case study of bayan university’s laborites-erbil”. book of proceeding, p. 538, 2016. [13] a. s. abdusalam, d. faiq abd, z. a. hamid, z. a. kakarash and o. h. ahmed. “study of challenges and possibilities of building and efficient infrastructure for kurdistan region of iraq”. uhd journal of science and technology, vol. 2, pp. 15-20, 2018. [14] m. al-hashimi, m. shakir, m. hammood and a. eldow. “address the challenges of implementing electronic document system in iraq e-government-tikrit city as a case study”. journal of theoretical and applied information technology, vol. 95, pp. 3672-3683, 2017. [15] h. abdulkadhim, m. bahari, h. hashim and a. bakri. “prioritizing implementation factors of electronic document management system (edms) using topsis method: a case study in iraqi government organizations”. journal of theoretical and applied information technology, vol. 88, pp. 375-378, 2016. [16] t. h. thabit and s. a. harjan. “evaluate e-learning in iraq applying on avicenna center in erbil”. european scientific journal, vol. 11, pp. 1-14, 2015. [17] s. riaz and j. muhammad. “an evaluation of public cloud adoption for higher education: a case study from pakistan”. in: mathematical sciences and computing research (ismsc), international symposium, pp. 208-213, 2015. [18] m. w. nofan and a. a. sakran. “the usage of cloud computing in education”. iraqi journal for computers and informatics, vol. 42, pp. 68-73, 2016. [19] k. c. ariwa and e. aiwa. “engineering sustainability and cloud computing in higher education-a case study model in nigeria”. international journal of computing and network technology, vol. 5, pp. 65-75, 2017. [20] j. sultana, n. nipa, and f. a. mazmum. “factors affecting could computing adoption in higher education in bangladesh: a case of university of dhaka”. applied and computational mathematics, vol. 6, pp. 129-136, 2017. [21] s. başaran and g. o. hama. “exploring faculty members views on adoption of cloud computing in education. in: proceedings of the international scientific conference. vol. 5, p. 237, 2018. [22] z. asadi, m, abdekhoda and h. nadrian. cloud computing services adoption among higher education faculties: development of a standardized questionnaire. education and information technologies, vol. 25, no. 1. pp. 175-191, 2020. [23] p. r. maskare and s. r. sulke. “review paper on e-learning using cloud computing”. international journal of computer science and mobile computing, vol. 3, pp. 1281-1287, 2014. [24] m. m. seke. “higher education and the adoption of cloud computing technology in africa.” international journal on communications, vol. 4, p. 1, 2015. [25] m. s. abdullah and m. toycan. “analysis of the factors for the successful e-learning services adoption from education providers’ and students’ perspectives: a case study of private universities in northern iraq”. eurasia journal of mathematics, science and technology education, vol. 14, pp. 1097-1109, 2017. [26] n. sultan. “cloud computing for education: a new dawn”? international journal of information management, vol. 30, pp. 109116, 2010. [27] h. s. hashim, k. conboy and l. morgan. “factors influence the adoption of cloud computing: a comprehensive review”. international journal of education and research, vol. 3, pp. 295306, 2015. [28] a. o. akande and j. p. van belle. “cloud computing in higher education: a snapshot of software as a service”. in: adaptive science and technology, ieee 6th international conference, pp. 1-5, 2014. [29] m. s. al-khayat and m. s. al-othman. “a proposed cloud computing model for iraqi’s engineering colleges and institutes”. zanco journal of pure and applied sciences, vol. 28, pp. 1-5, 2016. [30] p. darus, r. b. rasli and n. z. gaminan. “a review on cloud computing implementation in higher educational institutions”. international journal of scientific engineering and applied science, vol. 1, pp. 459-465, 2015. [31] a. barnwal, d. kumar. “using cloud computing technology to improve education system”. asian journal of technology and management research, vol. 4, pp. 68-72, 2014. [32] d. f. fithri, a. p. utomo, and f. nugraha. “implementation of saas cloud computing services on e-learning applications (case study: pgri foundation school)”. journal of physics: conference series, vol. 1430, no. 1, p. 012049. [33] t. bozzelli. “will the public sector cloud deliver value? powering the cloud infrastructure”. available from: http://www.cisco.com/ web/strategy/docs/gov/2009_cloud_public_sector_tbozelli.pdf. [last accessed on 2010 oct 05]. [34] k. njenga, l. garg, a. k. bhardwaj, v. prakash and s. bawa. “the cloud computing adoption in higher learning institutions in kenya: hindering factors and recommendations for the way forward”. telematics and informatics, vol. 38, pp. 225-246, 2019. [35] i. arpaci. “a hybrid modeling approach for predicting the educational use of mobile cloud computing services in higher education”. computers in human behavior, vol. 90, pp. 181-187, 2019. [36] m. r. mesbahi, a. m. rahmani and m. hosseinzadeh. “reliability and high availability in cloud computing environments: a reference roadmap”. human-centric computing and information sciences, vol. 8, no. 1, pp. 20, 2018. [37] a. a. jaffar, m. i. ghareb and k. f. sharif. “the challenges of implementing e-commerce in kurdistan of iraq”. journal of university of human development, vol. 2, 2016. [38] r. avci and n. doghonadze. “the challenges of teaching efl listeningin iraqi (kurdistan region) universities”. universal journal of educational research, vol. 5, pp. 1995-2004, 2017. [39] d. s. atrushi and s. woodfield. “the quality of higher education in the kurdistan region of iraq”. british journal of middle eastern studies, vol. 14, pp. 11-16, 2018. [40] n. ahmed. “performance appraisal in higher education institutions in the kurdistan region: the case of the university of sulaimani”. cardiff metropolitan university, wales, 2016. [41] s. razzaghzadeh, a. h. navin, a. m. rahmani and hosseinzadeh, m. “probabilistic modeling to achieve load balancing in expert clouds”. ad hoc networks, vol. 59, pp. 12-23, 2017. [42] h. p. breivold and i. crnkovic. “cloud computing education strategies”. in: ieee 27th conference software engineering education and training, pp. 29-38, 2014. [43] m. a. wahsh and j. dhillon. “a systematic review of factors affecting the adoption of cloud computing for e-government abbas m. ahmed and osamah waleed allawi: the possibility of cc application in higher education of northern iraq 70 uhd journal of science and technology | jan 2020 | vol 4 | issue 1 implementation”. journal of engineering and applied sciences, arpn journal of engineering and applied sciences, vol. 23, pp. 17824-17832, 2015. [44] j. f. kakbra and h. m. sidqi. “measuring the impact of ict and e-learning on higher education system with redesigning and adapting moodle system in kurdistan region government, krgiraq”. in: proceedings of the 2nd e-learning regional conference state of kuwait, p. 13, 2013. tx_1~abs:at/tx_2:abs~at uhd journal of science and technology | jan 2021 | vol 5 | issue 1 1 1. introduction digital image processing plays a significant role in various areas such as medical image processing [1], image inpainting [2], pattern recognition, biometrics, content-based image retrieval (cbir), image compression, information hiding [3], and multimedia security [4]. the retrieval of similar images from a large range of images is becoming a serious challenge with the advent of digital communication technology and the growing use of the internet. several penetrating and retrieval utilities are essential for end users to retrieve the images efficiently from different domains of the image databases such as medical, education, weather forecasting, criminal investigation, advertising, social media, web, art design, and entertainment. the query information is either text format or image format. different techniques for image retrieval have been developed and they are classified into two approaches: text-based image retrieval (tbir) and cbir [5]. tbir was first introduced in 1970 for searching and retrieving images from image databases [6]. in tbir, the images are denoted by text, and then the text is used to retrieve or search the images. such a system is text-based search and is generally referred to as tbir. the tbir method relies on the manual text search or keyword matching of the existing image keywords and the result has been dependent on the human labeling of the an improved content based image retrieval technique by exploiting bi-layer concept shalaw faraj salih1, alan anwer abdulla2,3 1department of information technology, technical college of informatics, sulaimani polytechnic university, sulaimani, iraq, 2department of information technology, college of commerce, university of sulaimani, sulaimani, iraq, 3department of information technology, university college of goizha, sulaimani, iraq a b s t r a c t applications for retrieving similar images from a large collection of images have increased significantly in various fields with the rapid advancement of digital communication technologies and exponential evolution in the usage of the internet. content-based image retrieval (cbir) is a technique to find similar images on the basis of extracting the visual features such as color, texture, and/or shape from the images themselves. during the retrieval process, features and descriptors of the query image are compared to those of the images in the database to rank each indexed image accordingly to its distance to the query image. this paper has developed a new cbir technique which entails two layers, called bi-layers. in the first layer, all images in the database are compared to the query image based on the bag of features (bof) technique, and hence, the m most similar images to the query image are retrieved. in the second layer, the m images obtained from the first layer are compared to the query image based on the color, texture, and shape features to retrieve the n number of the most similar images to the query image. the proposed technique has been evaluated using a well-known dataset of images called corel-1k. the obtained results revealed the impact of exploring the idea of bi-layers in improving the precision rate in comparison to the current state-of-the-art techniques in which achieved precision rate of 82.27% and 76.13% for top-10 and top-20, respectively. index terms: bof, cbir, gabor, hsv, zernike corresponding author’s e-mail: shalaw faraj salih, department of information technology, technical college of informatics, sulaimani polytechnic university, sulaimani, iraq. e-mail: shalaw.faraj.s@spu.edu.iq received: 01-11-2020 accepted: 23-12-2020 published: 05-01-2021 access this article online doi: 10.21928/uhdjst.v5n1y2021.pp1-12 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2021 al-janabi, et al. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l re se a rc h a rt i c l e uhd journal of science and technology salih and abdulla: an improved content based image retrieval 2 uhd journal of science and technology | jul 2021 | vol 5 | issue 1 images. tbir approach requires information such as image keyword, image location, image tags, image name, and other information related to the image. it needs human intervention to enter the data of images in the database and that is the difficulty of the process. tbir has the following limitations: (1) it leads to inaccurate results when human has been doing datasets annotation process wrongly, (2) single keyword of image information is not efficient to transfer the overall image description, (3) it is based on manual annotation of the images, which is time consuming [5], [7]. to overcome those mentioned limitations of tbir, a new approach for image retrieval has been invented by researcher which is known as cbir. cbir can be considered as a common tool for retrieving, searching, and browsing images of a query information from a large database of digital images. in cbir, the image information, visual features such as low level features (color, texture, and/or shape), or bag of features (bof) have been extracted from the images to find similar images in the database [8]. fig. 1 shows the general block diagram of cbir approach [7]. in general, cbir entails two main steps: the feature extraction and feature matching. in the first step, features are extracted from a dataset of images and stored in a feature vector. in the second step, the extracted features from the query image are compared with the extracted features of images in the dataset using certain distance measurement. if the distance between feature vector of the query image and the image in the database is small enough, the corresponding image in the database is considered as a match/similar image to the query image. consequently, the matched images are then ranked accordingly to a similarity index from the smallest distance value to the largest one. finally, the retrieved images are specified according to the highest similarity, that is, lowest distance value [9]. the main objective of cbir techniques is to improve the efficiency of the system by increasing the performance using the combination of features [6]. image features can be classified into two types: local features and global features. local features work locally which are focused on the key point in images whereas global features extract information from the entire image [10]. when the image dataset is quite large, image relevant to the query image are very few. therefore, it is important to eliminate those irrelevant images. the main contribution of our proposed approach is filtering the images in the dataset to eliminate/minimize the most irrelevant images, then from the remaining images find the most similar/match images. in this paper, a new cbir approach based on two layers is developed. the first layer aims in filtering the images using (bof) strategy on the basis of extracting local features, while the second layer aims to retrieve similar images, from the remaining images, to the query image based on extracting global features such as color, shape, and texture. the rest of the paper is organized as follows: section 2 presents the literature review, section 3 gives the background, section 4 addresses the proposed approach in detail, section 5 illustrates the experimental results, and finally, section 6 presents the conclusion. 2. literature review there are several cbir techniques proposed for image retrieval applications using various feature extraction methods. each of these techniques competes to improve the precision rate of finding the best similar images to the query image. in general, all the cbir techniques have two main steps; the first step is feature extraction and the second step is feature matching. this section concerns the review of the most related and important existing cbir techniques. the concept of cbir was first introduced by kato in 1992 by developing a technique for sketch retrieval, similarity retrieval, and sense retrieval to support visual interaction [11]. sketch query image database of images feature extraction feature extraction feature matching retrieved images fig. 1. general block diagram of content-based image retrieval approach. salih and abdulla: an improved content based image retrieval uhd journal of science and technology | jul 2021 | vol 5 | issue 1 3 retrieval accepts the image data of sketches, similarity retrieval evaluates the similarity based on the personal view of each user, and sense retrieval evaluates based on the text data and the image data at content level based on the personal view. in 2009, lin et al. proposed a cbir technique depending on extracting three types of image features [12]. the first feature, color co-occurrence matrix was extracted as a color feature, while for the second feature, difference between pixels of scan pattern was used for extracting texture feature, and the third feature, color histogram for k-mean was extracted which is based on color distribution. consequently, feature selection techniques were implemented to select the optimal features not only to maximize the detection rate but also to simplify the computation of image retrieval. in addition, this proposed technique further uses sequential for-ward selection to select features with better discriminability for image retrieval and to overcome the problem of excessive features. finally, euclidean distance was used to find the similarity in the feature matching step. the results reported in this work claimed that the proposed technique reached a precision rate of 72.70% for the top-20. huang et al. proposed a new cbir technique, in 2010, in which combined/fused the gabor texture feature and hue saturation value (hsv) color moment feature [13]. furthermore, the normalized euclidean distance was used to calculate the similarity between the feature vector of the query image and the feature vector of the images in the dataset. this proposed technique achieved the precision rate of 63.6% for the top-15. in 2012, singha et al. proposed an algorithm for cbir by extracting features called wavelet based color histogram image retrieval as a color and texture features [14]. the color and texture features are extracted through color histogram as well as wavelet transformation, for the combination of these features is robust to object translation and scaling in an image. this technique was used the histogram intersection distance for feature matching purposes. the results reported in this work claimed that this technique achieved a precision rate of 76.2% for the top-10. another cbir technique was proposed by yu et al., in 2013, that aims to investigate various combinations of mid-level features to build an effective image retrieval system based on the bof model [15]. specifically, this work studies two ways of integrating: 1scale-invariant feature transform (sift) with local binary pattern (lbp) descriptors and, 2 histogram of oriented gradients with lbp descriptors. based on the qualitative and quantitative evaluations on two benchmark datasets, the integrations of these features yield complementary and substantial improvement on image retrieval even with noisy background and ambiguous objects. consequently, two integration models are proposed, the patch-based integration and the image-based integration. using a weighted k-means clustering algorithm, the image-based sift-lbp integration achieved a precision rate of 65% for the top-20. a new cbir technique was proposed by somnugpong et al., in 2016, by combining color correlograms and edge direction histogram (edh) features to give precedence for spatial information in an image [16]. color correlogram treats information about spatial color correlation, while edh provides the geometry information in the case of the same image but different color. evaluation is performed by simple calculation like euclidean distance between the query image and the images in the database. researchers claimed that their proposed technique achieved 65% of precision rate for the top-15. in 2018, al-jubouri et al. proposed a new cbir technique that addresses the semantic gap issue by exploiting cluster shapes [17]. the technique first extracts local color using ycbcr color space and texture feature using discrete cosine transform coefficients. the expectation-maximization gaussian mixture model clustering algorithm is then applied to the local feature vectors to obtain clusters of different shapes. to compare dissimilarity between two images, the technique uses a dissimilarity measure based on the principle of kullbackleibler divergence to compare pair-wise dissimilarity of cluster shapes. this work further investigates two respective scenarios when the number of clusters is fixed and adaptively determined according to cluster quality. the results reported in this work illustrate that the proposed retrieval mechanism based on cluster shapes increases the image discrimination, and when the number of cluster is fixed to a large number, the precision of image retrieval is better than that when the relatively small number of clusters is adaptively determined. authors claimed that their technique achieved a precision rate of 75% for the top-10. in 2018, nazir et al. proposed a new cbir technique in which used color and texture features [18]. the edge histogram descriptor is extracted as a local feature and discrete wavelet transform as well as color histogram features are extracted as global features. consequently, manhattan distance measurement was used to measure the similarity between the feature vector of the query image and the feature vector of the images in the dataset. the reported results of the work revealed that this proposed technique achieved a precision rate of 73.5% for the top-20. pradhan et al., in 2019, developed a new cbir scheme based on multi-level colored directional motif histogram [7]. the proposed scheme extracts local structural features at three different levels. the performance of this proposed scheme has been salih and abdulla: an improved content based image retrieval 4 uhd journal of science and technology | jul 2021 | vol 5 | issue 1 evaluated using different corel/natural, object, texture, and heterogeneous image datasets. regarding to the corel-1k, the precision rate of 64% and 59.6% was obtained for top-10 and top-20, respectively. qazanfari et al., in 2019, proposed a cbir technique based on hsv color space [19]. the human visual system is very sensitive to the color and edge orientation, also color histogram and color difference histogram are two kinds of low-level feature extraction which are meaningful representatives of the image color and edge orientation information. this proposed technique was used canberra distance measurement and this work has been evaluated using three standard databases corel 5k, corel 10 and ukbench and achieved 61.82%, 50,67%, and 74.77% of precision rate for the top-12, respectively. in 2019, rashno et al., proposed a new technique for cbir in which color and texture features were used. hsv, red, green, and blue (rgb) and norm of low frequency components were used as color features, while wavelet transformation was used to extract texture features [20]. consequently, ant colony optimization-based feature selection was used to select the most relevant features, to minimize the number of features, and to maximize f-measure in the proposed cbir system. furthermore, euclidean distance measurement was used to find the similarity between query and database images. the results reported in this work demonstrate that this approach reached the precision rate of 60.79% for the top-20. in 2019, rana et al. proposed a cbir technique by fusing parametric color and shape features with nonparametric texture feature [21]. the color moments and moment invariants which are parametric feature are extracted to describe color distribution and shapes of an image. the non-parametric ranklet transformation is performed to narrate the texture features. these parametric and non-parametric features were integrated to propose a robust and effective cbir algorithm. in this proposed work, four similarity measurements are investigated during the experiment, namely, chi-squared, manhattan distance or city block distance, euclidean distance, and canberra distance. the experimental results demonstrate that euclidean distance metric yields better precision and recall than other distance measuring criteria. authors claimed that their technique achieved a precision rate of 67.6% for the top-15 using euclidean distance. finally, sadique et al., in 2019, proposed a cbir technique in which investigates various global and local feature extraction methods for image retrieval [22]. the proposed work uses a combination of speeded up robust features (surf) detector and descriptor with color moments as local features, and modified grey level cooccurrence matrices as global features. both global and local features are used as the only local features are not suitable when the variety of images is large. finally, fast approximate nearest neighbor search was used for matching the extracted features. authors claimed that their proposed technique achieved a precision rate of 70.48% for the top-20. 3. background this section aims to present a reasonable amount of background information about useful techniques such as (surf) feature descriptor, color-based features, texturebased features, shape-based features, and feature matching techniques. 3.1. surf feature descriptor the most popular feature descriptor is surf, which is also the most important one. however, there are other available feature descriptors. surf can be considered as a local feature. local features can provide more detailed characters in an image in comparison with global features such as color, texture, and shape. it is a rotation and scale invariant descriptor that performs better with respect to distinctiveness, repeatability, and robustness. it is also photometric deformations, detection errors, geometric, and robust to noise [23]. surf is used in many applications such as bof which is used and successful in image analysis and classification [24]. in bof technique, surf descriptor is often used to extract local feature first. in the next stage, a quantization algorithm such as k-means is separately applied to the extracted surf features to reduce high dimensional feature vectors to clusters, which are also known as visual words. then, k-means clustering is used to initialize the m center point to build m visual words. the k-means clustering algorithm takes feature space as input and reduces it into m cluster as output. then, the image is represented as a histogram of code word occurrences by mapping the local features to a vocabulary [24]. the methodology of the image representation based on the bof model is illustrated in fig. 2. 3.2. color-based features extraction the color-based features have commonly been used in cbir systems because of its easy and fast computation [14]. color-based features can be extracted using a histogram of quantized values of color in hue (h), saturation (s), and value (v) of the hsv color space. hsv color space is more robust to human perception as compared to the rgb color space. due to the robustness of the hsv color space, first rgb images are converted to hsv color space and then uniform quantization is applied. feature vectors are generated by considering the values of h=9, s=3, and salih and abdulla: an improved content based image retrieval uhd journal of science and technology | jul 2021 | vol 5 | issue 1 5 v=3 to form the feature vector of size 81 (9 × 3 ×3) bins. representation of hsv color feature vector of an image is presented in fig. 3. 3.3. texture-based features extraction like color-based features, the texture-based features can be considered as powerful low-level features for image search and retrieval applications. there are certain works that have been done on texture analysis, classification, and segmentation for the last four decades. so far, there is no unique definition for the texture-based features. texture is an attribute representing the spatial arrangement of the grey levels of the pixels in a region or image. gabor filter is one of the widely used filters for texture-based feature extraction. it is a gaussian function modulated by complex sinusoidal of frequencies and orientations. in our proposed approach, texture features of an image are extracted using a gabor filter for five scales (s) and six orientations (o). the usage of multiple s and o makes the features rotation and scaling invariant on texture feature space. five scales and six orientations produce thirty magnitudes. consequently, mean and standard deviation need to be calculated for each magnitude and this leads to producing sixty features as a texture descriptor [14]. ------------------------------------------------bag of visual words input image surf feature clustering fig. 2. methodology of the bag of features based image representation for content-based image retrieval. fig. 3. hue saturation value feature vector. salih and abdulla: an improved content based image retrieval 6 uhd journal of science and technology | jul 2021 | vol 5 | issue 1 3.4. shape-based features extraction shape-based features are also useful to obtain more detailed characters of the images. shape-based features include turn angle, central angle, distance between two feature points, distance between center of mass and feature point. zernike moments (zms) are used as a shape-based feature extractor in the proposed approach. zms are invariant to rotation, translation, and scaling [25]. furthermore, zms are robust to noise and minor variations in shape and use zernike polynomials to form feature vectors to represent an image based on shape features [26]. the proposed approach used 21 initial zms to represent the images. 3.5. feature matching there are certain similarity measurements that used to compute the similarity between query image and images in the database, in our proposed approach, manhattan distance is used for the bof, see equation (1) [27], and euclidean distance is used for color, texture and, shape features, equation (2) [13]. manhattandistance md qf df i f � � � � �( ) = − = ∑ 1 (1) euclideandistance ed qf df i f � � � �( ) = −( ) = ∑ 1 2 (2) where qf qf qf qfl= … −( �,� ,� .�, �)1 2 1 is the feature vector of query image, df df df dfl= … −( �,� ,� .�, �)1 2 1 is the feature vector of the database of images, and l is the dimension of image feature. next section will present the proposed approach in detail. 4. proposed approach this section presents the detailed steps of the proposed bilayer approach as follows: 1let q be the query image, and idb = {i1, i2,…, in} be the database of n images. 2first layer entails the following steps: a) qbof and ibof represent feature vector of q and idb , respectively, after bof technique is applied. b) manhattan similarity measurement is used to find the similarity between qbof and ibof and as a result, m most similar images to the query image are retrieved. 3second layer will implement on the query image q and the m most similar images mi that were retrieved/obtained from the first layer. it includes the following steps: a) extract the following features from q and mi : • let c = {c 1 , c 2 ,……, c 81 } be the extracted 81 color-based features that represent the 81 bins of the quantized hsv color space. • let t = {t 1 , t 2 ,……, t 60 } be the extracted 60 texture-based features using gabor filter. • let s = {s 1 , s 2 ,……, s 21 } be the extracted 21 shape-based features using zms. • let f = c + t + s be the feature vector of the fused/ combination of all the extracted features above. • finally, qf and mfi represent the fused feature vector of q and mi. b) euclidean similarity measurement is used to find the similarity between qf and mfi to retrieve the n most similar match images to the query image. the block diagram of the proposed bi-layer approach is illustrated in fig. 4. 5. experimental results in this section, experiments are performed comprehensively to assess the performance of the proposed approach in terms of precision rate, the most common confusion matrix measurement used in the cbir research area. furthermore, the proposed approach is compared to the most recent existing works. 5.1. dataset the experiments are conducted on the public and wellknown dataset called corel-1k that contains 1000 images in the form of ten categories and each category consists of 100 images with resolution sizes of (256 × 384) or (384 × 256) [28]. the categories are organized as follows: african, people, beaches, buildings, buses, dinosaurs, elephants, flowers, horses, mountains, and foods. 5.2. evaluation measurements precision confusion matrix measurement is used to assess the performance of the proposed approach. the precision determines the number of correctly retrieved images over the total number of the retrieved images from the tested database of images and it measures the specificity of image retrieval system, as presents in the following equation [21]: precision r r c t = (3) where r c represents the total number of correctly retrieved images and r t represents the total number of retrieved salih and abdulla: an improved content based image retrieval uhd journal of science and technology | jul 2021 | vol 5 | issue 1 7 images. precision can also be expressed in the following equation. precision tp tp fp = + (4) where tp represents true positive and fp represents false positive. in this work, top-10 and top-20 have been tested. top-10 means the total number of retrieved images is 10 images, and top-20 means the total number of retrieved fig. 4. block diagram of the proposed bi-layer content-based image retrieval system for top-5. query image feature color features texture features shape features . . . … query image feature bof feature extraction bof1 bof2 . . . bofn idb feature database bof feature extraction i1 bof1 bof2 . . . i2 … . … in … idb feature database color features texture features shape features i1 c1 c2 … cn c1 c2 cnt1 t2 . . . tn t1 t2 tns1 s2 … sn s1 s2 sn i2 . . im query image feature extraction feature extraction similarity evaluation using manhattan idb feature database bof feature extraction i1 bof1 bof2 . . . bofn similarity evaluation using euclidean retrieved most similar images m most similar images to the query image are retrieved feature extraction la ye r ( 1) la ye r (2 ) feature extraction bofn . … … salih and abdulla: an improved content based image retrieval 8 uhd journal of science and technology | jul 2021 | vol 5 | issue 1 table 1: precision rate of bof technique for different number of clusters for top-20 categories different number of clusters k=100 k=200 k=300 k=400 k=500 k=600 k=700 k=800 k=900 k=1000 africa 52.05 55.85 55.25 56.25 55.85 56.35 55.8 55.5 54.15 55.5 beaches 44.35 45.7 45.85 46.35 47.2 45.4 47.15 45.35 46.65 48.15 buildings 41.3 44.75 46.5 47.55 49.05 50.4 51.25 52.25 52.55 52.15 buses 83.5 86.15 85.15 86.3 86.75 85.4 84.75 84.4 84.65 83.5 dinosaur 100 100 100 100 100 99.95 99.95 100 99.95 99.95 elephant 55.85 59.95 62 61.45 60.5 59.65 58.55 57.85 57.15 58.1 roses 84.35 84.4 84.95 85.4 85.45 84.55 85.75 86.15 84.1 86.1 horses 85.9 86.4 87.9 88.6 89.35 87.85 88.9 88.15 88.55 88.7 mountains 39.5 40.7 41.9 42.5 43.1 45.95 45 46.95 45.6 45.55 food 39.45 42 41.05 40.35 41 39.3 39.35 38.2 38.45 37.25 averages 62.625 64.59 65.055 65.475 65.825 65.48 65.645 65.48 65.18 65.495 images is 20 images. figs. 5 and 6 present examples for the query image based on top-10 and top-20. 5.3. results the experiments conducted in this work involve two phases: (a) single layer cbir model and (b) bi-layer cbir model. in the first phase, the single layer model (i.e., bof technique) is evaluated alone, and on the other hand, cbir technique based on extracting shape, texture, and color features is evaluated. in the second phase, the proposed bi-layer model is evaluated. the experiments are detailed in the following steps: 1. bof-based cbir technique is tested with different number of clusters, as bof technique depends on the k-means clustering algorithm to create clusters, which is commonly called visual words. the number of clusters cannot be selected automatically; it needs manual selection. to select the proper number of clusters, (i.e., value of k-means), the different number of clusters have been tested to obtain the best precision result of bof technique. the precision results of different numbers of clusters are illustrated in the following tables. from tables 1 and 2, one can observe that the best result is obtained when k = 500 for both top-10 and top-20. 2. the proposed cbir technique that relies on extracting shape, color, and texture features has been tested, and the results are presented in table 3. 3. the proposed bi-layer approach has been tested. it includes two layers: first layer implements bof technique (for k=500) and m most similar images are retrieved, m is user defined. in the second layer, shape, color, and texture features are extracted from the query image and the m images, as a result, n most similar images are retrieved. the following tables investigate the best value of m. in other words, tables 4 and 5 show testing different number of m for top-20 and top-10, respectively. results in tables 4 and 5 demonstrate that the best precision results are obtained when m = 200. for this reason, different small numbers of m, in the range of m = 100 to m = 300, are investigated to gain better precision results, tables 6 and 7. from tables 6 and 7, it is quite clear that the best result is obtained when m =225 for both top-20 and top-10. table 2: precision rate of bof technique for different number of clusters for top-10 categories different number of clusters k=100 k=200 k=300 k=400 k=500 k=600 k=700 k=800 k=900 k=1000 africa 58.9 60.4 61 62.1 62.6 60.8 62.1 61.4 61.9 59 beaches 50.8 52 51.6 51.3 52.5 50.6 51.4 51.7 50.5 51.1 buildings 50.1 54.4 56.1 58.1 58.4 59.7 59.5 58.3 60.6 61 buses 89.2 89.5 89.7 89.6 89.2 88.4 88.4 88.6 87.9 87.4 dinosaur 100 100 100 100 100 100 100 100 100 100 elephant 69.1 71.3 73.2 69.8 71.4 69.6 70.5 69.7 67.3 66.6 roses 86.6 88.3 88.5 88.4 87.8 88.9 88 87.6 88.4 89.3 horses 88.8 91.2 93.3 93.2 93.7 93.9 93.7 93.1 94.4 94 mountains 46 49.1 50.6 50.2 50.4 52.4 50.6 53 52.2 52.5 food 46.7 49 50.5 49.4 48.6 47.8 48.4 46.6 44.5 45.8 averages 68.62 70.52 71.45 71.21 71.46 71.21 71.26 71 70.77 70.67 salih and abdulla: an improved content based image retrieval uhd journal of science and technology | jul 2021 | vol 5 | issue 1 9 fig. 6. query result for top-20. fig. 5. query result for top-10. table 3: precision rate for the tested feature extractors for top-20 and top-10 categories top-20 top-10 africa 70.55 75.4 beaches 36.55 43.6 buildings 42.4 50.9 buses 72.85 79.8 dinosaur 92.45 95.9 elephant 40.5 54.3 roses 58.9 72.2 horses 84.9 89.5 mountains 34.9 40.8 food 65.7 70.5 averages 59.97 67.29 the ratio of correctly retrieved images over the total number of images of the semantic class in the image database is known as recall and it measures the sensitivity of the image retrieval system, equation (5): recall r t c s = (5) where rc is the total number of retrieved images and ts is the total number of images in the semantic class in the database. more experiments have been done to compare the proposed approach with the state-of-the-art salih and abdulla: an improved content based image retrieval 10 uhd journal of science and technology | jul 2021 | vol 5 | issue 1 table 5: precision results for different number of m for top-10 categories different number of m m=100 m=200 m=300 m=400 m=500 m=600 m=700 m=800 m=900 m=1000 africa 83.7 82.4 82.8 82.5 82 82 82 81.7 81.7 81.6 beaches 63.4 64 63 62.6 61.6 61 60.4 60 59.8 59.8 buildings 68.4 67.5 66.6 66.6 66.3 66.9 67 66.6 66.8 66.8 buses 97.5 96.8 96.3 96 95.6 95.1 94.6 94.6 94.6 94.6 dinosaur 100 100 100 100 100 100 100 100 100 100 elephant 77.8 78.4 79.4 78.9 79.1 79 78.7 78.5 78.5 78.3 roses 94.4 94.2 94 93.6 93.3 93 92.9 92.6 92.5 92.5 horses 97.6 97.4 97.4 97.4 97.3 97.3 97.2 97.1 97.1 97.1 mountains 60.4 62.4 61.3 60.3 59 58.6 58.3 57.6 56.9 57 food 77.8 78.6 79 78.9 78.6 78.3 77.4 77.5 77.4 77.4 averages 82.1 82.17 81.98 81.68 81.28 81.12 80.85 80.62 80.53 80.51 table 4: precision rate for different number of m for top-20 categories different number of m m=100 m=200 m=300 m=400 m=500 m=600 m=700 m=800 m=900 m=1000 africa 78.65 77.75 76.75 76.6 76.2 76.5 76.65 76.65 76.3 76.25 beaches 55.7 56.65 56.55 56.5 55.5 54.6 54 53.85 53.7 53.55 buildings 56.1 56.05 56.35 56.35 56.15 56.65 57 56.6 56.5 56.55 buses 94.45 94.25 93.8 93.45 93.1 92.55 92.25 92.15 91.85 91.8 dinosaur 100 100 100 100 100 100 100 100 100 100 elephant 64.05 64.25 64.5 64.8 65.15 65.15 65.05 64.65 64.65 64.45 roses 91.35 91.15 90.7 89.75 89.3 88.75 88.6 88.3 88.05 87.9 horses 94.65 94.95 94.9 94.9 94.8 94.7 94.6 94.6 94.6 94.6 mountains 52 52.65 52.15 52 50.9 50.65 50 49.55 49.25 49.3 food 68.55 72.15 72.6 72.75 72.35 71.55 70.85 70.85 70.55 70.4 averages 75.55 75.985 75.83 75.71 75.345 75.11 74.9 74.72 74.545 74.48 table 6: precision results for different number of m in the range 100–300 for top-20 categories different number of m m=100 m=125 m=150 m=175 m=200 m=225 m=250 m=275 m=300 africa 78.65 78.4 78.15 77.7 77.75 78 77.4 77.2 76.75 beaches 55.7 56.35 56.75 57 56.65 57.1 56.3 56.3 56.55 buildings 56.1 56.45 56.55 56.5 56.05 56.4 56.45 56.05 56.35 buses 94.45 94.8 95.05 94.6 94.25 94.55 94.2 94.1 93.8 dinosaur 100 100 100 100 100 100 100 100 100 elephant 64.05 64.5 63.9 64.1 64.25 64.3 64.2 64.4 64.5 roses 91.35 91.4 91.3 91.5 91.15 90.95 90.9 90.6 90.7 horses 94.65 94.9 95.05 94.95 94.95 94.95 94.85 94.95 94.9 mountains 52 52.8 52.6 52.3 52.65 52.65 52.55 52.4 52.15 food 68.55 69.4 71.3 71.45 72.15 72.4 72.6 72.7 72.6 averages 75.55 75.9 76.065 76.01 75.985 76.13 75.945 75.87 75.83 techniques, table 8. in all experiments, each image in the corel-1k database is used as a query image. the retrieval performance of tested techniques is measured in terms of average retrieval precision (arp) and average retrieval recall (arr). the higher arp and arr values mean the better performance. according to the results in table 8, the best result is achieved by the proposed approach for both top-10 and top-20. all the tested state-of-the-art techniques, except technique in al-jubouri and du [7], they tested their approach either for top-10 or for top-20, and this is why in table 8 some cells do not contain the arp and arr results. salih and abdulla: an improved content based image retrieval uhd journal of science and technology | jul 2021 | vol 5 | issue 1 11 table 8: arp results of tested cbir techniques approaches top-10 top-20 arp arr arp arr proposed approach 82.27 8.22 76.13 15.22 [7] 64.00 6.40 59.60 11.92 [18] 73.5 14.7 [21] 67.60 6.76 [22] 70.48 14.09 6. conclusion this paper has developed an effective cbir technique for retrieving images from a wide range of images in the dataset. the proposed approach involves two layers; in the first layer, all images in the database are compared to the query image based on the bof technique, and as a result, 225 most similar images to the query image are selected. color, texture, and shape features are used in the second layer to extract significant features from the selected 225 images to retrieve the most similar images to the query image. the obtained results depicted that the proposed approach has reached an optimal average precision of 82.27% and 76.13% for top 10 and top 20, respectively. references [1] z. f. mohammed and a. a. abdulla. “thresholding-based white blood cells segmentation from microscopic blood images”. uhd journal of science and technology, vol. 4, no. 1, pp. 9-17, 2020. [2] m. w. ahmed and a. a. abdulla. “quality improvement for exemplarbased image inpainting using a modified searching mechanism”. uhd journal of science and technology, vol. 4, no. 1, pp. 1-8, 2020. [3] a. a. abdulla, h. sellahewa and s. a. jassim. “secure steganography technique based on bitplane indexes”. 2013 ieee international symposium on multimedia, united states, pp. 287291, 2013. [4] a. a. abdulla. “exploiting similarities between secret and cover images for improved embedding efficiency and security in digital steganography, phd thesis”. 2015. available from: http://www. bear.buckingham.ac.uk/149. [last accessed on 2020 dec 15]. [5] a. sarwar, z. mehmood, t. saba, k. a. qazi, a. adnan and h. jamal. “a novel method for content-based image retrieval to improve the effectiveness of the bag-of-words model using a support vector machine”. journal of information science, vol. 45, no. 1, pp. 117-135, 2019. [6] s. hossain and r. islam. “a new approach of content based image retrieval using color and texture features”. current journal of applied science and technology, vol. 51, no. 3, pp. 1-16, 2017. [7] j. pradhan, a. ajad, a. k. pal and h. banka. “multi-level colored directional motif histograms for content-based”. the visual computer, vol. 36, pp. 1-22, 2019. [8] l. k. pavithra and t. s. sharmila. “optimized feature integration and minimized search space in content based image retrieval”. procedia computer science, vol. 165, pp. 691-700, 2019. [9] h. f. atlam, g. attiya and n. el-fishawy. “comparative study on cbir based on color feature”. international journal of computer applications, vol. 78, no. 16, pp. 9-15, 2013. [10] y. d. mistry. “textural and color descriptor fusion for efficient content-based image”. iran journal of computer science, vol. 3, pp. 1-15, 2020. [11] t. kato. “database architecture for content-based image retrieval”. international society for optics and photonics, vol. 1662, pp. 112123, 1992. [12] c. h. lin, r. t. chen and y. k. chan. “a smart content-based image retrieval system based on color and texture feature”. image and vision computing, vol. 27, no. 6, pp. 658-665, 2009. [13] z. c. huang, p. p. chan, w. w. ng and d. s. yeung. “contentbased image retrieval using color moment and gabor texture feature”. 2010 international conference on machine learning and cybernetics, vol. 2, pp. 719-724, 2010. [14] m. singha and k. hemachandran. “content based image retrieval using color and texture”. signal and image processing, vol. 3, no. 1, p. 39, 2012. [15] j. yu, z. qin, t. wan and x. zhang. “feature integration analysis of bag-of-features model for image retrieval”. neurocomputing, vol. 120, pp. 355-364, 2013. [16] s. somnugpong and k. khiewwan. “content-based image table 7: precision results for different number of m in the range 100–300 for top-10 categories different number of m m=100 m=125 m=150 m=175 m=200 m=225 m=250 m=275 m=300 africa 83.7 83.2 82.7 82.4 82.4 82.1 82.3 82.7 82.8 beaches 63.4 63.6 63.6 63.9 64 64.1 63.1 63.1 63 buildings 68.4 68.3 68.3 67.7 67.5 67.4 66.7 66.8 66.6 buses 97.5 97.3 97.2 97.2 96.8 96.9 96.6 96.6 96.3 dinosaur 100 100 100 100 100 100 100 100 100 elephant 77.8 78.6 78.4 78.3 78.4 79.1 79 79.4 79.4 roses 94.4 94.4 94.6 94.5 94.2 94.1 93.9 93.9 94 horses 97.6 97.4 97.4 97.3 97.4 97.3 97.4 97.4 97.4 mountains 60.4 61 61.5 61.7 62.4 62.7 62.5 61.8 61.3 food 77.8 77.8 78.5 77.8 78.6 79 79 78.7 79 averages 82.1 82.16 82.22 82.08 82.17 82.27 82.05 82.04 81.98 salih and abdulla: an improved content based image retrieval 12 uhd journal of science and technology | jul 2021 | vol 5 | issue 1 retrieval using a combination of color correlograms and edge direction histogram”. 2016 13th international joint conference on computer science and software engineering (jcsse), thailand, pp. 1-5, 2016. [17] h. al-jubouri and h. du. “a content-based image retrieval method by exploiting cluster shapes”. iraqi journal for electrical and electronic engineering, vol. 14, no. 2, pp. 90-102, 2018. [18] a. nazir, r. ashraf, t. hamdani and n. ali. “content based image retrieval system by using hsv color histogram, discrete wavelet transform and edge histogram descriptor”. 2018 international conference on computing, mathematics and engineering technologies (icomet), pune, pp. 1-6, 2018. [19]. h. qazanfari, h. hassanpour and k. qazanfari. “content-based image retrieval using hsv color space features”. international journal of computer and information engineering, vol. 13, no. 10, pp. 537-545, 2019. [20] a. rashno and e. rashno. “content-based image retrieval system with most relevant features among wavelet and color features”. arxiv preprint arxiv, vol. 2019, pp. 1-18. [21] s. p. rana, m. dey and p. siarry. “boosting content based image retrieval performance through integration of parametric and nonparametric approaches”. journal of visual communication and image representation, vol. 58, pp. 205-219, 2019. [22] f. sadique, b. k. biswas and s. m. haque. “unsupervised content-based image retrieval technique using global and local features”. 2019 1st international conference on advances in science, engineering and robotics technology (icasert), bangladesh, pp. 1-6, 2019. [23] s. jabeen, z. mehmood, t. mahmood, t. saba, a. rehman and m. t. mahmood. “an effective content-based image retrieval technique for image visuals representation based on the bag-ofvisual-words model”. plos one, vol. 13, no. 4, pp. 1-24, 2018. [24] j. zhou, x. liu, w. liu and j. gan. “image retrieval based on effective feature extraction and diffusion process”. multimedia tools and applications, vol. 78, no. 5, pp. 6163-6190, 2019. [25] f. rajam and s. valli. “a survey on content based image retrieval”. life science journal, vol. 10, no. 2, pp. 2475-2487, 2013. [26] j. olaleke, a. adetunmbi, b. ojokoh and i. olaronke. “an appraisal of content-based image retrieval (cbir) methods”. asian journal of research in computer science, vol. 3, pp. 1-15, 2019. [27] m. sharma and a. batra. “analysis of distance measures in content based image retrieval”. global journal of computer science and technology, vol. 14, no. 2, p. 7, 2014. [28] j. li and j. z. wang. “automatic linguistic indexing of pictures by a statistical modeling approach”. ieee transactions on pattern analysis and machine intelligence, vol. 25, no. 9, pp. 1075-1088, 2003. tx_1~abs:at/tx_2:abs~at uhd journal of science and technology | jul 2020 | vol 4 | issue 2 75 1. introduction in the recent years, medical imaging has become the most and widest techniques to disease diagnose of human organs and anatomic vision of body. it is a broad range in digital image processing known for its effective, easiness, and safety to diagnose and follow-up diseases. growing of huge multimodality data caused to growing of data analytics especially in medical imaging. the architecture of deep learning (dl) has depended on the neural network that includes layers to feature extraction and classification in medical image processing and includes many methods for different tasks [1]. dl has evolved in many fields such as computer-aided diagnosis (cad), radiology, and medical image analysis which can include tasks, such as finding shapes, detecting edges, removing noise, counting objects, and calculating statistics for texture analysis or image quality [2]. in such short period, dl has owned of great role in training artificial agents to replace the complicated human manually scientific works at a reasonable time in various fields related to medical image analysis depend on public and private datasets [3]. the organs of human body vary in terms of complexity; thus some organs are more affected by the review research of medical image analysis using deep learning bakhtyar ahmed mohammed1,2, muzhir shaban al-ani1 1university of human development, college of science and technology, department of computer science, sulaymaniyah, krg, iraq, 2university of sulaimani, college of science, department of computer, sulaymaniyah, krg, iraq a b s t r a c t in modern globe, medical image analysis significantly participates in diagnosis process. in general, it involves five processes, such as medical image classification, medical image detection, medical image segmentation, medical image registration, and medical image localization. medical imaging uses in diagnosis process for most of the human body organs, such as brain tumor, chest, breast, colonoscopy, retinal, and many other cases relate to medical image analysis using various modalities. multi-modality images include magnetic resonance imaging, single photon emission computed tomography (ct), positron emission tomography, optical coherence tomography, confocal laser endoscopy, magnetic resonance spectroscopy, ct, x-ray, wireless capsule endoscopy, breast cancer, papanicolaou smear, hyper spectral image, and ultrasound use to diagnose different body organs and cases. medical image analysis is appropriate environment to interact with automate intelligent system technologies. among the intelligent systems deep learning (dl) is the modern one to manipulate medical image analysis processes and processing an image into fundamental components to extract meaningful information. the best model to establish its systems is deep convolutional neural network. this study relied on reviewing of some of these studies because of these reasons; improvements of medical imaging increase demand on automate systems of medical image analysis using dl, in most tested cases, accuracy of intelligent methods especially dl methods higher than accuracy of hand-crafted works. furthermore, manually works need a lot of time compare to systematic diagnosis. index terms: medical image analysis, medical image modalities, deep learning, convolutional neural network corresponding author’s e-mail: university of human development, college of science and technology, department of computer science, sulaymaniyah, krg, iraq. e-mail: bakhtyar.mohammed@uhd.edu.iq received: 08-04-2020 accepted: 19-08-2020 published: 27-08-2020 access this article online doi: 10.21928/uhdjst.v4n2y2020.pp75-90 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2020 mohammed and al-ani. this is an open access article distributed under the creative commons attribution noncommercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l re se a rc h a rt i c l e uhd journal of science and technology bakhtyar ahmed mohammed and muzhir shaban al-ani: medical image analysis using deep learning 76 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 process of ionization. hence, it is important to carefully employ medical image modalities with techniques related to medical diagnosing. furthermore, the accuracy of these modalities is too important at the first step of medical image processing [4]. the accuracy depends on different sensors or medical image devices to take images according to the ray spectrums to the modality types. many spectrums are used to imaging body modality, some of them have too strong radiation as; gamma, while others have weak radiation to the human body, such; magnetic resonance imaging (mri) which uses radio frequency (rf) [5]. deep artificial neural network (deep ann) model was innovated in 2009, from the very beginning this branch is developing till now. in the present time, deep neural network types are the strongest machine learning methods to analyze various medical imaging [6]. in general, medical image analysis consists of five processes, such as medical image classification, medical image detection, medical image segmentation, medical image registration, and medical image localization. furthermore, graphical processing unit (gpu) is imperative hardware part that supports improvement and acceleration of medical imaging analysis processes, such as image segmentation, image registration, and image de-noising, based on various modalities such as x-ray, ct, positron emission tomography (pet), single photon emission computed tomography (spect), mri, functional mri (fmri), ultrasound (us), optical imaging, and microscopy images. it enables parallel acceleration medical image processing to work in harmony with dl [7]. dl is rapidly leading to enhance performance in different medical applications [8]. some important criterions have great role in the development of medical image analysis processes, such as region of interest (roi) which has great role of early detection and localization, in such processes as predicting the bounding box coordinates of optic disk (od) to diagnose of glaucoma and diabetic retinopathy diseases using dl methods [9], and colonoscopy diseases such as adenoma detection rate (adr) using convolutional neural network (cnn) [10]. within this process, automatic analysis supports with report generation, real-time decision support, such as localization and tracking in cataract surgery using cnn [11]. large training set is another essential element since dl methods can learn strong image features for volumetric data as 3d images for landmark detection with many good ways to train these datasets [12]. advancements in machine learning, especially in dl, can learn many medical imaging data features resulting from the processes such as identify, classify, and quantify patterns that aid of handcrafted processes for medical image modalities using dl methods to automate interpretations [8]. however, medical imaging data includes noise, missing values, and inhomogeneous roi which cause inaccurate diagnose. roi provides accurate knowledge that aids clinical decisionmaking for diagnostics, treatment planning, and accurate feature extraction process cause accurate diagnostic and increases the accuracy [13]. edge detection is another key process for medical imaging applications that can be used in image segmentation, usually according to homogeneity in the way of two criterions; classification, and detection of all pixels by cnn using filters [14]. cnn method can avail local features and more global contextual features, at the same time; regardless of the different methods adopted in the architecture of cnn [15]. the architecture of cnn capable to change such as used fully connected network fcn instead of cnn method using semantic segmentation to effectively and accurately detect brain tumor in mri images [16]. certainly, the advancement of medical image analysis is slower than medical imaging technologies, mostly because of the study of dl for components of medical image analysis and specifically cnn is a big necessity to improve accuracy of methods for components by working on lessens obstacles such as training datasets, and declining error rate. 2. medical image modalities the essentials of data types in medical image processing are medical images. there are various cases according to the body places, organs, and different disease that became physiologists to think of different techniques to show significant features related to the medical cases. most of the techniques that used in medical imaging rely on visible and non-visible radiations except mri. these techniques use various body organs based on cases. multi-variability of these modalities is necessary because of some reasons. the most significant reason is effectiveness of some of these techniques to some specific tasks, such as mri for brain and ct for lung. another reason is the impact of ionizing radiation to human body according to impacts of ionizing which damages dna atom and nonionizing rays which does not have any side effect to human body organs [5]. mri uses radiofrequency signals with a powerful magnetic field to produce images of the human tissues. mri is dominant among other modality types because of its safety and rich information [17]. usually, it is used in neurology and neurosurgery of brain and spinal. it shows human anatomy in all three planes; axial, sagittal, and coronal. it is used for quantitative analysis for most of the neurological diseases as bakhtyar ahmed mohammed and muzhir shaban al-ani: medical image analysis using deep learning uhd journal of science and technology | jul 2020 | vol 4 | issue 2 77 brain [18]. furthermore, it is able to detect streaming blood and secret vascular distortions. in spite mri takes priority over others because of its characteristics which are superior image quality and ionizing radiation [19]. it is beneficial to process of accuracy enhancement, reduce noise, detection speed improvement, segmentation, and classification [17]. mri of sub-cortical brain structure automatic and accurate segmentation using cnn to extract prior spatial features and train the methods on most of complicated features to improve accuracy which is effective for the processes, such as pre-operative evaluation, surgical planning, radiotherapy treatment planning, and longitudinal monitoring for disease progression [20]. it provides a wealth of imaging biomarkers for cardiovascular disease care and segmentation of cardiac structures [21]. furthermore, it provides rich information about the human tissue anatomies so as to earn soft-tissue contrast widely. it is considered as a standard technique [17]. it provides detail and enough information about different tissues inside human body with high contrast and spatial resolution subsequently. it engages broadly to anatomical auxiliar y examination of the cerebr um tissues [18]. bidani et al. (2019) showed that mri is important to diagnose dementia disease by scanning brain mri which indicates by declining memory [22]. geok et al. (2018) used mri to brain stem and anterior cingulate cortex to classify migraine and none-migraine data using dl methods [23]. another application of brain mri is early detection and classification of multi-class alzheimer’s disease [24]. suchita et al. (2013) showed complexity of mri brain diagnosis which is challengeable because of variance and complexity of tumors [25]. padrakhti et al. (2019) showed brain mri useful to age prediction, as brain age estimation [26]. during mri data acquisition group of 2-d mri images can represent as 3-d because a lot of frame numbers, like in brain. many different contrast types of mri images exist, including axial-t2 cases use to edematous regions and axial-t1 cases use to the healthy tissues and t1-gd uses to determine the tumor borders, cerebrospinal fluid (csf) uses to edematous regions in fluid-attenuated invasion recovery (flair). there are several types of contrast images such as flair, t2weighted mri (t2), t1-weighted mri (t1), and t1-gd gadolinium contrast enhancement [17]. brain mri is one of the best imaging techniques employed by researchers to detect the brain tumors in the progression phase as a model for both steps of detection and treatment [27]. it is useful to supply information about location, volume, and level of tumor malignancy [28]. talo et al. (2018) showed that traditionally the radiologists selected mri to find status of brain abnormality. the analysis of this process was time consumer and hard, to solve this problem, utilized computer-based detection aid accurately and speedy of diagnosis process [29]. magnetic resonance spectroscopy (mrs) is a specific modality for the evaluation of thyroid nodules in differentiation of benign from malignant thyroid tissues [30]. pet is a type of nuclear medicine images, as scintigraphy technique, it is a common and useful medical imaging technique that is used clinically in the field of oncology, cardiology, and neurology [7]. spect can supply actual three-dimension anatomical image using gamma ray [7]. elastography uses to liver fibrosis, tactile imaging, photo-acoustic imaging thermography, such as passive thermography, and active thermography, tomography, conventional tomography, and computer-assisted tomography [31]. accurate features of ct images for chest diagnosis, such as ground glass opacity to detect covid-19 pneumonia cases, made it use in training process in improving computer-aided methods as a fast process, also it aids the clinicians especially in the diagnosis of covid-19 infection cases [32]. optical coherence tomography (oct) uses low coherence light to take two and three-dimension micrometer resolution within optical scattering. it is used to early diagnosis of retinal diseases [33]. oct images show clearly intensity variances, low-contrast regions, speckle noise, and blood vessels. [34]. furthermore, retinal image is another modality to measure retinal vessel diameter [35]. sun et al. (2017) used another sensor which is portable fundus camera used for huge datasets of retinal image quality classification which is differ from diabetic retinopathy screening systems, using cnn algorithms [36]. papanicolaou (pap) smear is another medical image modality used to identify the cancerous variation of uterine cervix using the learning-based method by segmenting separated papsmear image cells [37]. nguyen et al. (2018) tested microscopic image as another type of medical image modality that took from 2d-hela and pap-smear datasets [38]. confocal laser endoscopy (cle) is another medical image modality type that relied on to diagnose and detect brain tumor for its accuracy and effectiveness in carrying out the automatic diagnosis [39]. it is a type of advanced optical fluorescence technology which undergoing application assessments in brain tumor surgery while most of the images distorted and interpreted bakhtyar ahmed mohammed and muzhir shaban al-ani: medical image analysis using deep learning 78 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 as non-diagnostic images [40]. in gastrointestinal diseases, new medical imaging technique innovated which known as wireless capsule endoscopy (wce) to record wce frame images to detect abnormal patterns [41]. it uses to diagnose of gastrointestinal diseases through a sensor which is quite small to swallow and capture every scenes of anatomical parts that pass through them [41]. dermoscopic image is another useful modality and is dermoscopic images that use to skin lesion [42], [43]. breast cancer (brc) image is another type of well-known cancers that rely on such medical image modalities as mammography which known as x-ray of breast, us which is called sonogram [44]. furthermore, histology images use to determine multi-size and discriminative patches to classify brc [45]. masood et al. (2012) determined fineneedle aspiration (fna) data as another way to take breast sample [46]. m. hyper-spectral image (hsi) is another new modality use to diagnosis and early detection of oral cancer using cnn before surgery [47]. dey et al. (2018) used it to early detect of oral cancer in habitual smokers [39]. n. single x-ray projection uses to monitoring and radiotherapy tumortracking to analyze tumor motion [48]. 3. medical image analysis it is the process of analyzing medical images through medical image analysis techniques. these techniques are composed of five main components named, medical image classification, medical image detection, medical image segmentation, medical image localization, and medical image registration. 3.1. medical image classification this element of medical image analysis techniques is responsible for classifying labeled image classes based on their features. in this process, the homogeneity and heterogeneity features determine how the classes are categorized. in traditional methods, shape, color, and texture used to be key features for categorizing labeled image classes. whereas, in modern methods, where dl is essential for labeling images, various algorithms have become fundamental tools for accurate multi-class label classification [49]. categorization process is a technique that follows extraction process. it runs on selected features [27]. litjens et al. (2017) departed classification process into two phases; either image classification and object or lesion classification. image classification is the first medical image analysis process that depart the image into some smaller image sizes, but object classification works on the small data that identified earlier [50]. suchita et al. (2013) identified different objects in the image as the main function of classification technique. hence, she classified images into two main subdivisions; supervised; and unsupervised [25]. in supervised learning, datasets are the most significant reasons to teach the methods and increase accuracy through feature extraction process [22]. wong et al. (2018) showed that mri brain images are used to diagnose tumors and classify them according to classes as, no tumor, low grade gliomas, and glioblastomas. those classes can also be subdivided as in gliomas which are classified to i and iv according to the world health organization classification [51]. image quality determines the class of the examined images. low image quality is considered inappropriate for diagnosis [52]. it is worth mentioning that some researchers use synonyms for classification, such as cadx. among them, ker et al. (2017) employed different terms to represent various cnn algorithms [53]. rani (2011) explained data mining can be performed in many ways, all techniques are important in special manners and classification is an analysis technique used to retrieve important and relevant information about data. it can be applied as micro-classifications in mammograms, classification of chest x-rays, and tissue and vessel classification in mri. when this technique in dl counts on cnn, it can come up with valuable benefits translated as proper working in noisy environments [54]. suzuki (2017) compared between massive training artificial neural network (mtann) and cnn models. they are used to classifying lung nodules and non-nodules. each has advantages that distinguish it from the other. for instance, in classification of lesions and non-lesions in cad, mtann scored a better result of decreasing false positives. on the other hand, cnn is able to score higher accuracy level within areas under the roc curve (areas under curve). for example, if mtann manages to score 0.882 for lung nodules under roi, then cnn will score 0.888 for seven tooth types under the same circumstances in computer vision [1]. yamashita et al. (2018) explained that cad has become a part of routine clinical work for detecting brain, breast, eye, chest, etc. for each organ, this classification process plays special role. for brain, cad applies fmri in two stages to detect autism spectrum disorder (asd). during the first stage, cad bakhtyar ahmed mohammed and muzhir shaban al-ani: medical image analysis using deep learning uhd journal of science and technology | jul 2020 | vol 4 | issue 2 79 will identify the bio markers for asd. while in the second stage, in two subdivision steps, cad depends on fmri with accuracy of 70% to identify the anatomical structure. certainly, cnn can be used as a magic tool for classification. another advantage for cnn in this regard is using it for processing target objects separated from medical images. however, it is not deniable that this process requires a large number of training data [55]. ruvalcaba-cardenas et al. (2018) tested that 2d-cnn 3d-cnn models are well used for small class separation using single-photon avalanche diode sensor in low-light indoor and outdoor daytime conditions as long as using noise removal algorithm with 64 x 64-pixel resolution [56]. the process of identifying labels and lesions types requires a lot of sufficiency work specially to determine early treatment [14]. the whole chain process extracts the features of microscopic image classification [38]. table 1 illustrates some important reviews of classification process. 3.2. medical image detection finding abnormal objects are the main goal of medical image detection. usually, detecting the abnormality happens through comparing two cases on the images. most of the time, this process takes place with the aid of computer-aided detection (cad). this starts with identifying objects on the images through the application of detector algorithms [16]. to reduce time consumption and reach efficient detection, experts have dedicated time and efforts to find faster and accurate methods. marginal space learning is one of the significant approaches in which more efficient and faster in function compare to traditional methods [3]. the function of cad in this process is to de-stress the radiologists who use manual diagnosis by easily selecting the abnormality on the images. from this standpoint, cad can take different forms based on its function. the forms are, detection regions aid of processing techniques, set of extracted features, and extracted features fed in to classifier [8]. diagnosing brain tumor through automatic detection may face difficulties that require smart intervention [64]. actually, mri is multi used for diagnoses for other diseases. alkadi et al. (2018) used it for prostate cancer diagnosis to provide information on, location, volume, and level of malignancy [28]. the good thing about automated diagnosis for all medical imaging fields is the attempt to increase accuracy and reduces time consumption [65]. in neurodegenerative diseases, dementia for instance which causes of lessens in memory, language, and lack of wise [22]. that can boost the performance of cnn and improve detection and localization accuracy [41]. for super-pixel image analysis, different structure detection required. this engages image augmentation to aid cnn to extract the features from the original dermoscopy image data [43]. the role of detection lies in identifying abnormal among normal cases. the whole process is called cnn-based cad system. ker et al. (2017) employed computer aided in collaboration with 2d and 3d-cnn detection for various detection purposes, especially it used for lymph node detection to diagnose infection or tumor [53]. tajbakhsh et al. (2016) shaded the light on detection process which is a complicate process. he divided into two stages, polyp detection, which works on increasing the rate of misdetection by finding perception changing features such as, color, shape, and size of colon features. however, the feature of shape is more affective compare to other features. moreover, pulmonary embolism (pe) detection, which causes of blocking pulmonary arteries because blood clots that barrier transmit blood from lower extremity source to lung using ct pulmonary angiography (ctpa) which is time consuming, and death rate of pe is 30% but it becomes 2% with right treatment with implementing the deep cnn method [52]. the advantages and disadvantages of each practical technique lie in its outcome balance between accuracy and cost of operation in edge detection. laplacian of gaussian edge detector, convolve image by filter of high pass filter to find edge pixels so as to analyze edge pixel places from both sides. canny edge detector which considers optimal edge detector so as to get the lowest error rate in the detection of real edge point and 2d gabor filter which its utilization rely on frequency and orientation representations [35]. it is agreed that cad works on roi in image analysis. meaning that, detection gathers the regions of interest in one limited area. this is can be seen in mri of brain tumor, and it determines earliest signs of abnormality. altaf et al. (2019) used 3d cnn to detect brc using automated breast us image modality using sliding window technique to extract volume of interests, then using 3d-cnn to determine the possibility of existing tumor [59]. experts and technology development have been working hard in this field to make medical image analysis more sufficient and fruitful. the attention of experts is not limited to software only, but hardware section is also bakhtyar ahmed mohammed and muzhir shaban al-ani: medical image analysis using deep learning 80 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 author type of application method modality used dataset accuracy advantage mohsen et al. [27] brain classification deep neural network (dnn) brain mri 66 private mri 96.97% dnn more accurate than; knn, lda, and smo suchita et al. [25] brain classification ann brain mri 70 mri 98.60% object identification tajbaksh et al. [52] colonoscopy frame classification deep cnn colonoscopy frame images 6-colonoscopy videos; 4000 frames from imagenet reduced fp by 10%, 15%, and 20% improved cnn method ahn et al. [57] skin disease classification convolutional sparse kernel network (cskn) x-ray, and dermoscopy images irma, and isic 2017 95.30% supervised cskn classification higher than others ker et al. [53] classification cnn multimodality partitioned different public dataset different compared between them yuqian et al. [45] breast cancer classification cnn histology images breast histology dataset 88.89% image-wise classification rani [54] classification ann multilayer heart disease images heart disease dataset 94% shows advantages of cnn wu et al. [58] face skin classification different cnn algorithms clinical facial images xiangya-derm best accuracy is 92.9% compared between of; resnet-50, inception-v3, and densenet-121 suzuki [1] classification mtann lung nodule images 76 malignant and 413 benign 88.20% accuracy of cnn is higher than mtann altaf et al. [59] brain, breast, diabetic retinopathy, chest, abdomen, and miscellaneous classification kso, alexnet, cnn, 3d densenet (3d cnn), gans, and cnn fmri, mammogram, retinopathy, ct, and ct liver multiple datasets, 1713 of carolina breast cancer study, messidor, public; lidc-idri 68.6-85.6%, 94-95%, 90.4, and 7% improved robust results for neurological function of biomarkers yamashita et al. [55] binary classification 2d and 3d cnn ct/mri mentioned various public datasets ---showed importance of training dataset khaled et al. [17] brain tumor classification dl with traditional ml brain mri mentioned many datasets max accuracy is 100% ---muthu et al. [18] classification cnn brain mri in dicom format public datasets 100% training cnn shahin et al. [42] skin lesion classification deep nn framework rgb dermoscopic jpeg image isic 2018 up to 89.9% differentiate between seven skin lesion types nguyen et al. [38] deep learning proposed feature concatenation network microscopic image 917 images of papsmear, and 862 2d-hela 92.63±1.68%, 92.57±2.46% ---kopoulos et al. [43] dermoscopy image super pixel classification cnn rgb dermoscopic jpeg image isic 85.2% used some beneficial filtering techniques murtaza et al. [44] breast cancer classification deep neural network (dnn) mammography 55% used public and others private ---assess brc classification ken et al. [51] brain tumor classification, and cardiac classification pre-trained cnns, and vggnet brain mri, and 2d cardiac cta 191 testing and 91 training, and 263 testing and 108 training 82%, and 86% classify according to tumor stages and grades from magnetization sun et al. [36] retinal fundus image quality feature classification hybrid cnn retinal fundus image kaggle 97.12% used alexnet, google net, vgg-16, and resnet-50. hosny et al. [60] skin lesion classification alexnet transfer learning dermoscopic image derm; is and quest, med-node, isic 96.86%, 97.7%, 95.91% angle rotation with gpu bidani et al. [22] classification dcnn brain mri oasis >80% indicates importance of dataset arevalo et al. [61] mammography mass lesion classification supervised cnn3 mammography film images bcdr-fo3 86% compared between baseline and learned methods table 1: mentions the classification methods for different body organs (contd...) bakhtyar ahmed mohammed and muzhir shaban al-ani: medical image analysis using deep learning uhd journal of science and technology | jul 2020 | vol 4 | issue 2 81 author type of application method modality used dataset accuracy advantage daysi et al. [56] classification 2d and 3d-cnn spad image spad dataset 95% 3d accuracy higher than 2d fauzi et al. [62] brain classification svm and radiant basics function (rbf) t2 mri 60 original patients 65% linearly combine different groups litjens et al. [50] classification cnn mri/ct public ---classified according to; image/exam, and object or lesion classification talo et al. [29] classification resnet-34 mri 5-fold of 613 images 100% abnormality brain detection geok et al. [23] classification 3d-cnn mri migraine 198 mr images 85% deep learning methods more accurate islam and yanqing [24] multi-classification google net brain mri oasis 73.75% accuracy of google net higher than inception rajan et al. [47] oral cancer classification partitioned cnn hyperspectral medical image 500 trained patterns 94.50% compared with conventional methods sajedi et al. [26] age prediction classification 2d and 3d cnn brain mri used many datasets ---show age via mri hamad et al. [63] classification hybrid colon endoscopy image mentioned some datasets 96.70% ---dey et al. [39] oral cancer recurrence classification ann, svm, knn, and pnn confocal laser endoscopy (cle) oral squamous cell carcinoma (oscc) dic and pap datasets 86% any task needs specific method table 1: (continued) receiving a good share of care. every now and then, cad is witnessing development in one way or another. every trail to the purposes of reduces the errors and increases the accuracy [66]. table 2 illustrates some important reviews of detection process. 3.3. medical image segmentation it is the process of analyzing a digital image to partitioning it into multiple regions. the main purpose of segmentation is to shade lights on objects detected on the image [68]. in another definition, medical image segmentation is the process of selecting anatomy body organ outlines accurately [3]. from the given definitions, we realize that segmentation is a complicate process. therefore, researchers have been working on developing procedures to make it easier [15]. to accelerate different applications of automated segmentation process, pre-operative assessment, surgical planning, radiotherapy treatment planning, and longitudinal monitoring are added to the process [20]. improvement of medical image segmentation can happen in many manners. to improve the physical support of this process, gpu is the key answer to do so [7]. segmentation is either semantic or non-semantic. semantic segmentation links each pixel in an image to a class label, whereas, non-semantic segmentation works on the similar shapes, such as clusters [51]. in segmentation process, the methods are changeable. yet, the quality of the process will change accordingly. in medical image segmentation, mri plays significant role in quantity image analysis [16]. through mri, the image is cut into many regions sharing similar attributes [6]. dividing the images into roi means that the image is divided to sections including objects, adjacent regions, and similar region pixels [13]. through the application of cnn models, brain tumor tissues will be ready to labeling any small patch around each point. the labeling process will highlight intensity information inserted by multi-channel cnn methods [69]. certainly, a successful segmentation requires detecting object boundaries. this process is called edge detection. by looking at the name, it indicates that the process involves many other factors that affect the edge shapes including geometrical and optical properties, separation conditions, and noise, in addition use for feature detection and texture analysis [14]. within all this complicity, cnn will be able to diagnose brain tumor through mri and automatic segmentation simplifies [64]. like other medical image analysis techniques, segmentation is also a process of stages. segmentation is either organ segmentation, or lesion segmentation. the role of organ segmentation is to analyze quantity such bakhtyar ahmed mohammed and muzhir shaban al-ani: medical image analysis using deep learning 82 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 table 2: mentions the detection methods for different body organs author type of application method modality used dataset accuracy advantage mair et al. [3] detection deep reinforcement learning ct ------detected via marginal space learning shen et al. [8] detection deep learning (cnn) multi-modality mentioned a lot varies extracted morphological digital information ouseph and shruti [64] tumor detection cnn mri private mri dataset of tumorous patients 89.21% reduced operators and errors alkadi et al. [28] prostate cancer detection deep convolutional encoder-decoder t2 mri 19 patients from public (12 cvb) 89.40% used 3d sliding window srivastava et al. [65] detection deep cnn and transfer learning gastrointestinal, and brain mri 464 high resolution images (wsls) and oasis 97.6%, and >80% detected dementia disease lan et al. [41] wce abnormal pattern detection two types of hybrid methods using cnn wce wce2017 70% got better accuracy kopoulos et al. [43] detection cnn rgb dermoscopy image isic 85.2% exhibited different filters are necessary to augmentation litjens et al. [50] detection deep learning (cnn) mri/ct public ---detection process involved localization and detection yamashita et al. [55] pulmonary tuberculosis detection on chest alexnet and google net in 2d cnn radiographs (x-ray) 1007 chest radiographs 99% detected chest pulmonary tuberculosis ker et al. [53] detection google net ct lymph node ilsvrc 2013 95% ---tajbaksh et al. [52] colonic polyp, and pulmonary embolism detection fine-tuned alexnet, and alexnet colonoscopy, and lung ctpa 40 short colonoscopy videos to frames p<0.05, %25, %50, decreased the rate of misdetection by; 4%, 12%, 25%, 10%, 50% masood et al. [46] breast cancer detection type i, and type ii cgpann fine-needle aspiration (fna) wdbc database, 200 images for each case 99%, and 99.5% used fna to feature extract suzuki [1] lymph node detection mtann and cnn ct ------mtann used to enhance lesion detection morariu et al. [35] vessels detection log, c, g filters retinopathy image 18 healthy patient image, and 12 retinopathy images varied trade-off between the processes of accuracy and cost altaf et al. [59] brain, breast, eye, chest, abdomen, miscellaneous detection inception-v4 and resnet, 3d cnn, vgg-16, cnn mri, abus, oct, cmr, wgd, and inner ear ct oasis, 171 tumors, imagenet, 8428 99%, >95%, 98.6%, 98%, 98.51% used various; methods, datasets, modalities with different accuracies summer et al. [66] lung nodules, and polyp detection deep learning (cnn) x-ray and ct miccai ---automated disease detection and organ and lesion detections carneiro et al. [67] detection deep learning (cnn) x-ray, ct, mri, and microscopy mentioned some public datasets ---indicated performance of each technique as volume and shape segmentation in clinical parameters. while lesion segmentation, combines object detection, organ, and substructure segmentation, and apply them in dl algorithms [50]. the outer look of segmentation is similar to quantitative assessment of medical meaningful pieces. actually, in some functions segmentation depends on quantitative assessment for its application within a short period of time [55]. in surgical planning, segmentation is applied on 2d image slices to determine accurate boundary of the lesions to prepare them to the operation [53]. medical image segmentation is either automatic or semi-automatic. both work on extracting roi, but for different body organs such as coronary angiograms, surgical planning, surgery simulations, tumor segmentation, and brain segmentation [70]. bakhtyar ahmed mohammed and muzhir shaban al-ani: medical image analysis using deep learning uhd journal of science and technology | jul 2020 | vol 4 | issue 2 83 it separates and bounds different components of body organs automatically or semi-automatically to different tissue classes, pathologies, organs, and some biological criterions, according to various body organs [69]. in short, segmentation process aims to solve problems appears on regions of body organs such as brain, skin, and so on. for this purpose, medical image uses mri, and ct to select optimal weights [71]. another important process for medical imaging applications is edge detection which use in image segmentation usually according to homogeneity in the way of two criterions; classification, and detection of all pixels by cnn using filters [14]. hamad et al. (2018) focused on pathology image segmentation as pre-requisite disease diagnosis to determine features, such as shape, size, and morphological appearances, for cancer of nuclei, glands, and lymphocytes [63]. dey et al. (2018) shaded the lights on the three subdivisions of segmentation naming them, otsu, to calculate quality of global threshold, gradient vector flow active contour method, to analyze dynamic or 3d image data [39]. image quality has impact on segmentation process for it has to do with feature extraction, model matching, and object recognition [72]. rupal et al. (2018) determined three soft tissues in normal brain using mri technique, such as gray matter (gm), white matter, and csf. he showed both of algorithms and gpu has a big role to speed up this process, with its many methods innovated to enhance the segmentation process [73]. despite the factors that impact segmentation process, there are other reasons that enhance segmentation such as organs of body, modality image, and algorithm. on the other hand, segmentation faces challenges that hold the process back such as large variability in sensing modality, artifacts which vary from organ to organ, etc. ngo et al. (2017) classified segmentation to active contour models, machine learning models, and hybrid active contour and machine learning models [74]. table 3 illustrates some important reviews of segmentation process. 3.4. medical image localization every method has different contour to select the location of the destination shapes from images, wei et al. (2019) studied tumor localization on 3d images of three patients depending on; contour, location of tumor centroid in 3d space, and the angle of tumors to find error of tumor localization at different angles. the results showed that according to tumor motion and projection angles which exhibits that the cnn based method was more robust and accurate in real-time tumor localization [48]. lan et al. (2019) explored that multiregional combination such as selective search, edge boxes, and abjectness is used to improve object localization that account as essential of the non-rigid and amorphous characteristics to improve object localization [41]. urban et al. (2018) showed adr aim of colonoscopy and accuracy according of colonoscopies for adr. advancements of computer-assisted image analysis especially dl models, such as cnns which aid of making agent to perform its tasks to improve performance. it exhibits any increasing point of accuracy in manually work, as the result shows that real-time localized polyps and detection polyps higher than hand-crafted work [10]. muthu et al. (2019) verified that appropriate hardware is beneficial of adequately localize brain tumor to achieve high accuracy of detection and classification using cnn [18]. localization uses in every steps of applications while the radiology systems individually analyze and prepare reports without any human intrusion, especially in mri and ct modality using cnn, such as ct images of neck, lung, liver, pelvis, and legs [53]. mitra et al. (2018) improved localization process using od in color retinal fundus images predicting the bounding box coordinates which work same as roi. some methods used to renew the frames of roi as solitary regression predicament of image pixel values to roi coordinates. cnn can predict bounding boxes depending on intersection of the union. it increases the chance of recovery and strengthens the detection diagnosis accuracy [9]. oliver et al. (2018) proposed localizing multiple landmarks in medical image analysis to easily transfer our framework to new applications. it integrated various localizers, low test application algorithms, low amount of training data algorithms, and interoperability. the pros of this approach is detecting and localizing spatially correlated point landmarks [78]. localization process usually comes before the detection process. almost they are integrated together, especially in misdetection which relies on localization process [59]. zheng et al. (2017) divided localization process into two steps, in the first process the abdomen area is selecting, while the bakhtyar ahmed mohammed and muzhir shaban al-ani: medical image analysis using deep learning 84 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 author type of application method modality used dataset accuracy advantage mair et al. [3] segmentation cnn mri cardiac ------selected anatomy body organ accurately havaei et al. [15] segmentation cnn mri brain brats 2013 ---accelerated segmentation process kushibar et al. [20] segmentation cnn sub-cortical brain public miccai 2012 and ibsr18 ---increased segmentation accuracy eklund et al. [7] general image segmentation cnn multiple modalities big datasets ---faster than other methods ken et al. [51] semantic segmentation deep learning brain 43 3d images ---exhibited it can track brain tumor kumar et al. [16] semantic segmentation dl and ml mri brain dataset of 15 cases 96% improved detection of mri brain tumor selvikvag et al. [6] segmentation deep learning (cnn) mri ------quantitatively analyze images berahim et al. [13] segmentation morphological ls, rg-ls multimodality both 0.03% improved segmented primary boundary accurately zikic et al. [69] segmentation cnn brain tumor brats 20132014 83.7±9.4 enhanced network architecture shen et al. [8] segmentation 3d cnn 3d brain mri mentioned a lot of datasets different accuracies used to skull extraction mohamed et al. [14] segmentation cnn ---international datasets ---beneficial to edge detections to detect edge boundaries ouseph and shruti [64] brain tumor segmentation cnn brain mri private mri from cancerous patients 89.21% improved segmentation level litjens et al. [50] cardiac and brain segmentation cnn ct/mri public ---useful for substructure from lesion segmentation mohsen et al. [27] segmentation fuzzy c-mean and cnn brain mri private dataset of 66 brain mri ---improved progressing process yamashita et al. [55] uterus malignant tumor segmentation cnn mri isbi ---quantitative assessment ker et al. [53] tumor segmentation 3d cnn brain 22 pre-term, 35 adults 82-87% assisted surgical planning sajedi et al. [26] segmentation multiple methods brain mri oasis ---useful to age prediction tajbaksh et al. [52] intima-media boundary segmentation carotid alexnet (cimt) cardiac image 121 ctpa datasets with 326 pes p < 0.0001 segmentation error used to intima-media boundary (cimt) risk stratification nourouzi et al. [70] knee bone segmentation multi-method mri ------classified segmentation process according to types bernal et al. [75] segmentation fcnn brain mri ibsr18, miccai2012, and iseg2017 improved accuracy by 1% compared between 2d and 3d suzuki [1] segmentation dl based method lung tissue different dataset 82-95% neural edge enhancer altaf et al. [59] brain, breast, eye, chest, abdomen, miscellaneous segmentation compnet (brain) normal mri image oasis 98% compared different accuracies dey et al. [71] segmentation metaheuristic; ga, pso, aco, abco mri ------determined suspicious various regions hamad et al. [63] segmentation nn based method pathology mentioned a lot of datasets ---segmentation improved accuracy table 3: mentions the segmentation methods for different body organs (contd...) bakhtyar ahmed mohammed and muzhir shaban al-ani: medical image analysis using deep learning uhd journal of science and technology | jul 2020 | vol 4 | issue 2 85 author type of application method modality used dataset accuracy advantage jeylan et al. [76] mass segmentation kapur-scpso and otsu-pso mammogram 10 benchmark images ---it assisted detection process padmusini et al. [72] binary retinal vascular segmentation oct multiple modalities of retinal images mentioned a lot of datasets 96% segmenting abnormal images rajan et al. [47] segmentation partitioned cnn hyper-spectral machine learning ---94.50% showed reasons of segmentation process ngo et al. [74] segmentation deep belief networks cardiac mri miccai and jsrt 10 times faster selected endocardium dhungel et al. [77] mass segmentation crf, cnn, and dbn mammogram ddsm-bcrp for breast 95% indicated obstacles of mammogram mass segmentation zheng et al. [12] pathology kidney segmentation msl ct 370 ct scans mean segmentation error is 2.6 determined amount of abnormality of chronic kidney disease table 3: (continued) second process is detecting and localizing the kidneys places. according to this, the body consists of three parts; above abdomen; head and thorax, abdomen, and legs. diaphragm separates abdomen and thorax and an optimal slice index maximizing separation between abdomen and legs, second step is kidney which localize same as abdomen detection by axial image to determine the place of kidneys which use surrounding organs to determine the location of kidneys because kidney place is next to liver and spleen but the position of abdomen organs is not fixed, same as abdomen localization [12]. banerjee et al. (2019) designed a framework that consists of cnn methods which implement to enhance the performance of localization, detection, and annotation of surgical tools. the proposed method can learn most of the features [11]. table 4 illustrates some important reviews of localization process. 3.5. medical image registration imag e registration involves deter mining a spatial transformation or mapping that relates positions in one image to corresponding positions in one or more other images. image registration is transformation an image to the same digital image form according to the mapping points. rigid is known as image coordinate transformation and only involves translation and rotation processes. transformation maps parallel lines fixed with parallel lines is affine for map lines onto maps is projective and map lines on curves is curved or elastic [72]. the purpose of developing medical image modalities is to get higher resolutions and implementing multi-parametric tissue information at proper accuracy and time. it causes increasing the visually of image registration. nowadays, it is very common to improve accuracy and speeding up in dl [6]. it involves two forms; mono-modal inside same device or multimodal inside different devices. in general, it consists of four steps; feature detection, feature matching, transform model estimation, and image resampling and transformation [13]. registration known as common image analysis task, its form of working is iterative framework. dl can properly increase registration performance and especially using deep regression networks to direct transformation [50]. ker et al. (2017) exhibited that another benefit of medical image registration has indicated in neurosurgery or spinal surgery, to select the place of the mass or destination landmark, and to obtain systematic operation [53]. the most necessity to transmit from source to destination using appropriate method rely on; selecting modality into spatial alignment, and the fusion that is necessary for showing integrated data [72]. marstal et al. created collaborative platform to registration process, as an open source for the medical algorithms which is the continuous registration challenge (crc) that involves eight common datasets [79]. ramamoorthy et al. (2019) showed that polycystic ovary syndrome is another women disease made from imbalance hormone of follicle stimulating hormone, and monitoring of cysts grow up by registration technique which apply through these steps; first step, is initial registration which inputs pre-processed us images. second step is similarity measure–implement correlation coefficient on reference and source image. third step is image transformation which monitors the growth of the cyst at initial stage and periodic checkups. fourth step is final registration alignment. it bakhtyar ahmed mohammed and muzhir shaban al-ani: medical image analysis using deep learning 86 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 table 4: mentions the localization methods for different body organs author type of application method modality used dataset accuracy advantage wei et al. [48] lung tumor realtime localization cnn with mm and mm-fd with pca x-ray projection 3d took from three patients <1 mm more robust and accurate lan et al. [41] localization mrc wce wce 2017 70.30% improved object localization and rpn urban et al. [10] polyp localization cnn colonoscopy images 8641 images 96.4% improved accuracy of cnn muthu et al. [18] brain tumor localization cnn mri private dataset ---showed localization is important to detection ker et al. [53] localization cnn axial ct 4000 99.80% it decreased error rate after augmentation mitra et al. [9] distance metric localization various augmenting techniques used to amplify dataset color retinal fundus images messidor for test set, and kaggle 98.78, 99.05% coordination bounding box as roi oliver et al. [78] localization lower limbs, spine, thorax general framework of cre topology x-ray, and ct 2d, 660, 302 public 3d dataset 94.3, and 84.1% localized spatially correlated landmarks altaf et al. [59] brain, breast, eye, chest, abdomen, and miscellaneous tumor localization inception-v4 and resnet, 3d cnn, fine-tuned vgg-16, cnn mri, abus, oct, cmr, wgd, and inner ear ct oasis, 171 tumors, imagenet, and 8428 maximum 99%, >95%, 98.6%, 98%, and 98.51% used various; methods, datasets, modalities with different accuracies zheng et al. [12] abdomen and kidney localization caffe cnn ct 370 ct scans mean segmentation error 1.7 mm used local context to localize kidney banerjee et al. [11] localization alexnet, vggnet, and resnet-18/50/152 medical image frame 8 videos from imagenet dataset 82% improved the efficiency of surgical tools alkadi et al. [28] prostate cancer localization mono-modal deep learning t2 mri 12cvb 89.40% showed its importance of treatment planning table 5: mentions the registration methods for different body organs author type of application method modality used dataset accuracy advantage selvikvag et al. [6] registration deep learning and deep neural network mri ------improved accuracy and speed up berahim et al. [13] registration mutual information (mi) multimodality ------made mapping point to transform images litjens et al. [50] registration deep learning ct/mri public ------ker et al. [53] registration lddmm mri brain oasis ---improved the process in computational time padmusini et al. [72] registration ransac method sdoct retinal images ------measured the size of geographic lesions marstal et al. [79] registration continuous registration challenge (crc) lung ct, and brain mri popi and dirlab ---appropriately take new datasets ramamoorthy et al. [80] registration image registration techniques ultrasound abdomen scan image doppler scan, pandiyan scan, devaki scan 93.00% monitored pcos using registration during reproductive cycle altaf et al. [59] registration deep cnn, fcn brain mri, oct, ct lung, 3d mri, and mr-trus abide, dirlab, creatis 1.5% enhanced registered 2d and 3d, and speeded up reconstruction maier et al. [3] deformable registration deep learning ---------used non-rigid registration, and point-based registration is either mono-modal image or multi-modal images. last step is optimization that optimizing the spatial information which is executed by changing affine point optimizer radius at various appointments that determine by gynecologists in addition to correlation coefficient similarly metrics and affine transformation [80]. bakhtyar ahmed mohammed and muzhir shaban al-ani: medical image analysis using deep learning uhd journal of science and technology | jul 2020 | vol 4 | issue 2 87 in addition, registration process goes through the following steps; first is initial registration which feed them preprocessed images, second is similarity measurement in the way of correlation coefficient of reference and source image, third is image transformation which involve monitoring growth of the cyst monitoring at initial stage and periodic checkups, fourth is final registration, and fifth is optimization [80]. table 5 illustrates some important reviews of registration process. 4. discussion and conclusion this study is a review over medical image modalities and most significant types. in this regard, the study focuses on medical image analysis and its components using dl. medical image modalities clearly show how much the techniques or devices are important for medical image processing tasks, especially for medical image analysis. for a better approach, the study demonstrates the tremendous role of modalities that used in medical image processing by mentioning the most common modalities, such as mri, spect, pet, oct, cle, mrs, ct, x-ray, wce, brc, pap smear, hsi, and us. furthermore, it exhibits how the modalities imperative to extract significant features from medical image values. some significant diseases are reviewed after being diagnosed using some specific modalities. this is too beneficial to motivate to improve these tasks to implement those automatically using different approaches. in medical image analysis, both medical image analysis and its components are properly introduced. it enumerates the components which are medical image classification, medical image detection, medical image segmentation, medical image localization, and medical image registration and defining them. for the sake of accurate results, the study reviewed some researches performed on each modality in various cases. localization of anatomical structures is a prerequisite for many tasks in medical image analysis [81]. medical image segmentation is defined in many ways according to its understanding. in simple words, image segmentation is the process of partitioning medical images into smaller parts [82]. medical image detection is the process of localizing and detecting such important desired things inside medical imaging as objects detection, edge detection, and boundary detection [83]. medical image classification is a process of illuminating different cases according to their similar features and selecting classes for them. it plays an essential role in clinical treatment and teaching tasks [84]. there are more than 120 types of brain and central nervous system tumors which classified as to less aggressive, such as benign: grades i and ii, aggressive, such as malignant; grades iii and iv, and the skull [73]. early diagnosis of tumor has significant role of enhancement in increasing treatment possibilities. the main aim of this survey study is to discuss about processing of medical image analysis and its components such as medical image classification, medical image detection, medical image segmentation, medical image localization, and medical image registration, depended on dl methods. especially cnn is dominant model for computer vision which involves, such algorithms as; alexnet, densenet, resnet-18/34/50/152, vggnet, google net, inception-v3, pre-trained cnn, hybrid cnn, vgg-16, inception-v4, fine-tuned vgg-16, carotid alexnet, inception-v4, 3d cnn, and caffe cnn. it shows comparison between some different methods that used many public and private datasets for different medical image analysis components with different accuracies. it created table of medical image analysis components that represent many proposed methods and their process advantages. this approaches used for various human body organs with time progressing, which indicates cnn model algorithms preferred and have optimum accuracies compare to other dl methods for medical imaging. most of the studies depend on using different medical image modalities and different public and private datasets in their types and sizes. the most accurate one among these approaches was brain mri using cnn which imply these implemented approach that used to brain tumor were preferred. it looks the strong points, such as working on declining error rate and making strong training dataset for cnn because it is supervised learning method of these approaches and what are the weak points and how dl improved in medical image analysis. references [1] k. suzuki. “overview of deep learning in medical imaging”. radiological physics and technology, vol. 10, no. 3, pp. 257-273, 2017. [2] d. ravi, c. wong, f. deligianni, m. berthelot and j. andreau-perez. “deep learning for health informatics”. ieee journal of biomedical and health informatics, vol. 21, no. 1, pp. 4-21, 2017. [3] a. maier, c. syben, t. lasser and c. riess. “a gentle introduction to deep learning in medical image processing”. zeitschrift für medizinische physik, vol. 29, no. 2, pp. 86-101, 2019. [4] j. k. han. terahertz medical imaging. in: “convergence of terahertz sciences in biomedical systems”. springer, netherlands, pp. 351371, 2012. [5] j. o’doherty, b. rojas-fisher and s. o’doherty. “real-life radioactive men: the advantages and disadvantages of radiation exposure”. superhero science and technology, vol. 1, no. 1, p. bakhtyar ahmed mohammed and muzhir shaban al-ani: medical image analysis using deep learning 88 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 2928, 2018. [6] a. s. lundervold and a. lundervold. “an overview of deep learning in medical imaging focusing on mri”. zeitschrift für medizinische physik, vol. 29, no. 2, pp. 102-127, 2019. [7] a. eklund, p. dufort, d. forsberg and s. m. laconte. “medical image processing on the gpu-past, present and future”. medical image analysis, vol. 17, no. 8, pp. 1073-1094, 2013. [8] d. shen, g. wu and h. suk. “deep learning in medical image analysis”. review in advance, vol. 19, pp. 221-248, 2017. [9] a. mitra, p. s. banerjee, s, roy, s. roy and s. k. setua. “the region of interest localization for glaucoma analysis from retinal fundus image using deep learning”. computer methods and programs in biomedicine, vol. 165, pp. 25-35, 2018. [10] g. urban, p. tripathi, t. alkayali, m. mittal, f. jalali, w. karnes and p. baldi. “deep learning localizes and identifies polyps in real time with 96% accuracy in screening colonoscopy”. gastroenterology, vol. 155, no. 4, pp. 1069-1078.e8, 2018. [11] n. banerjee, r. sathish and d. sheet. “deep neural architecture for localization and tracking of surgical tools in cataract surgery”. computer aided intervention and diagnostics in clinical and medical images, vol. 31, pp. 31-38, 2019. [12] y. zheng, d. liu, b. georgescu, d. xu and d. comaniciu. “deep learning based automatic segmentation of pathological kidney in ct: local versus global image context”. springer, cham, switzerland, pp. 241-255, 2017. [13] m. berahim, n. a. samsudin and s. s. nathan. “a review: image analysis techniques to improve labeling accuracy of medical image classification”. advances in intelligent systems and computing, vol. 700, pp. 1-11, 2018. [14] m. a. el-sayed, y. a. estaitia and m. a. khafagy. “automated edge detection using convolutional neural network”. international journal of advanced computer science and applications, vol. 4, no. 10, p. 11, 2013. [15] m. havaei, a. davy, d. warde-farley, a. biard, a. courville, y. bengio, c. pal, p. m. jodoin and h. larochelle. “brain tumor segmentation with deep neural networks”. medical image analysis, vol. 35, pp. 18-31, 2017. [16] s. kumar, a. negi, j. n. singh, h. verman. “a deep learning for brain tumor mri images semantic segmentation using fcn”. in: 2018 4th international conference on computing communication and automation, greater noida, india, india, 14-15 dec 2018, 2018. [17] m. k. abd-ellah, a. i. awad, a. a. m. khalafd and h. f. a. hamed. “a review on brain tumor diagnosis from mri images: practical implications, key achievements, and lessons learned”. magnetic resonance imaging, vol. 61, pp. 300-318, 2019. [18] r. p. m. krishnammal and s. selvakumar. “convolutional neural network based image classification and detection of abnormalities in mri brain images”. in: 2019 international conference on communication and signal processing, chennai, india, india, 4-6 april, 2019. [19] h. h. sultan, n. m. salem and w. al-atabany. “multi-classification of brain tumor images using deep neural network”. ieee access, vol. 1, pp. 1-11, 2019. [20] k. kushibar, s. valverde, s. gonzalez-villa, j. bernal, m. cabezas, a. oliver and x. liado. “automated sub-cortical brain structure segmentation combining spatial and deep convolutional features”. medical image analysis, vol. 48, pp. 177-186, 2018. [21] f. guo, m. ng, m. goubran, s. e. petersen, s. k. piechnik, s. n. bauerd and g. wright. “improving cardiac mri convolutional neural network segmentation on small training datasets and dataset shift: a continuous kernel cut approach”. medical image analysis, vol. 61, p. 101636, 2020. [22] a. bidani, m. s. gouider and c. m. traviesco-gonzalez. “dementia detection and classification from mri images using deep neural networks and transfer learning”. in: international workconference on artificial neural networks iwann 2019, vol. 11506, pp. 925-933, 2019. [23] h. n. g. geok, m. kerzel, j. mehnert, a. may and s. wermter. “classification of mri migraine medical data using 3d convolutional neural network”. icann 2018, vol. 11141, pp. 300309, 2018. [24] z. j. islam and y. yanqing. “a novel deep learning based multiclass classification method for alzheimer’s disease detection using brain mri data. in: international conference, bi 2017, beijing, china, november 16-18, 2017, beijing, china, 2017. [25] s. goswami and l. k. p. bhaiya. “brain tumor detection using unsupervised learning based neural network”. in: 2013 international conference on communication systems and network technologies, gwalior, india, 6-8 april 2013. [26] h. s. a. pardakhti. “age prediction based on brain mri image: a survey”. journal of medical systems, vol. 43(8), p. 279, 2019. [27] h. mohsen, a. e. s. a. el-dahshan, e. s. m. el-horbaty, a. b. m. salem. “classification using deep learning neural networks for brain tumors”. future computing and informatics journal, vol. 3, no. 1, pp. 68-71, 2018. [28] r. alkadi, f. taher, a. el-baz and n. werghi. “a deep learningbased approach for the detection and localization of prostate cancer in t2 magnetic resonance images”. journal of digital imaging, vol. 32, no. 12, pp. 793-807, 2018. [29] m. talo, u. b. baloglu, o. yildirim and u. r. acharya. “application of deep transfer learning for automated brain abnormality classification using mr images”. cognitive systems research, vol. 54, pp. 176-188, 2018. [30] l. aghaghazvini, p. pirouzi, h. sharifian, n. yazdani, s. kooraki, a. ghadiri and m. assadi. “3t magnetic resonance spectroscopy as a powerful diagnostic modality for assessment of thyroid nodules”. scielo analytics, vol. 62, no. 5, pp. 2359-4292, 2018. [31] a. elangovan and t. jeyaseelan. “medical imaging modalities: a survey. in: “2016 international conference on emerging trends in engineering, technology and science”. pudukkottai, india, 24-26 feb, 2016. [32] y. song, s. zheng, l. li, x. zhang, x. zhang, z. huang, j. chen, h. zhao, y. jie, r. wang, y. chong, j. shen and y. yang. “deep learning enables accurate diagnosis of novel coronavirus (covid-19) with ct images”. medrxiv, 2020. [33] j. men, y. huang, j. solanki, x. zeng, a. alex, j. jerwick, z. zhang, r. e. tanzi, a. li and c. zhou. “optical coherence tomography for brain imaging and developmental biology”. ieee journal of selected topics in quantum electronics, vol. 22, no. 4, p. 6803213, 2016. [34] l. ngo, g. yih, s. ji and j. h. han. “a study on automated segmentation of retinal layers in optical coherence tomography images. in: 2016 4th international winter conference on braincomputer interface (bci). yongpyong, south korea, 22-24 feb, 2016. [35] l. moraru, c. d. obreja, n. dey and a. s. ashour. dempster-shafer fusion for effective retinal vessels diameter measurement. elsevier, amsterdam, netherlands, pp. 149-160, 2018. [36] j. sun, c. wan, j. cheng, f. yu and j. liu. “retinal image quality bakhtyar ahmed mohammed and muzhir shaban al-ani: medical image analysis using deep learning uhd journal of science and technology | jul 2020 | vol 4 | issue 2 89 classification using fine-tuned cnn. in: omia 2017, fifi 2017: fetal, infant and ophthalmic medical image analysis”. vol. 10554. springer, berlin, germany, pp. 126-133, 2017. [37] y. song, j. z. cheng, d. ni, s. chen, b. lei and t. wang. “segmenting overlapping cervical cell in pap smear images. in: 2016 ieee 13th international symposium on biomedical imaging (isbi), prague, czech republic, 13-16 april, 2016. [38] l. d. nguyen, d. lin, z. lin and j. cao. “deep cnns for microscopic image classification by exploiting transfer learning and feature concatenation. in: 2018 ieee international symposium on circuits and systems (iscas), florence, italy, 27-30 may 2018. [39] s. dey, d.n. tibarewala, s. p. maity and a. barui. “automated detection of early oral cancer trends in habitual smokers. elsevier, amsterdam, netherlands, pp. 83-107, 2018. [40] m. izadyyazdanabadi, e. belykh, m. mooney, n. martirosyan, j. eschbacher, p. nakaji, m. c. preul and y. yang. “convolutional neural networks: ensemble modeling, fine-tuning and unsupervised semantic localization for neurosurgical cle images”. the journal of visual communication and image representation, vol. 1, pp. 10-20, 2018. [41] l. lan, c. ye, c. wang and s. zhou. “deep convolutional neural networks for wce abnormality detection: cnn architecture, region proposal and transfer learning”. ieee access, vol. 7, pp. 3001730032, 2019. [42] a. h. shahin, a. kamal and m. a. elattar. “deep ensemble learning for skin lesion classification from dermoscopic images”. in: 2018 9th cairo international biomedical engineering conference, cairo, egypt, egypt, 20-22 dec. 2018. [43] s. v. georgakopoulos, k. kottari, k. delibasis, v. p. plagianakos and i. maglogiannis. “improving the performance of convolutional neural network for skin image classification using the response of image analysis filters”. neural computing and applications, vol. 31, no. 6, pp. 1805-1822, 2019. [44] g. murtaza, l. shuib, a. w. a. wahab, g. mujtaba, h. f. nweke, m. a. al-garadi, f. zulfiqar, g. raza and n. a. azmi. “deep learning-based breast cancer classification through medical imaging modalities: state of the art and research challenges”. artificial intelligence review, 53, pp. 1-66, 2019. [45] y. li, j. wu and q. s. wu. “classification of breast cancer histology images using multi-size and discriminative patches based on deep learning. ieee access, vol. 7, pp. 21400-21408, 2019. [46] a. m. ahmad, g. muhammad and j. f. miller. “breast cancer detection using cartesian genetic programming evolved artificial neural networks. in: gecco ‘12 proceedings of the 14th annual conference on genetic and evolutionary computation, philadelphia, pennsylvania, usa, july 07-11, 2012. [47] p. r. jeyaraj e. r. s. nadar. “computer-assisted medical image classification for early diagnosis of oral cancer employing deep learning algorithm”. journal of cancer research and clinical oncology, vol. 145, no. 4, pp. 829-837, 2019. [48] r. wei, f. zhou, b. liu, x. bai, d. fu, y. li and b. liang. “convolutional neural network (cnn) based three dimensional tumor localization using single x-ray projection”. ieee access, vol. 7, pp. 37026-37038, 2019. [49] z. lai and h. f. deng. “medical image classification based on deep features extracted by deep model and statistic feature fusion with multilayer perceptron”. computational intelligence and neuroscience, vol. 2018, pp. 1-13, 2018. [50] g. litjens, t. kooi, b. e. bejnordi, a. a. a. setio, f. ciompi, m. ghafoorian, j. a. w. v. laak, b. van ginneken and c. s. diagnostic. “a survey on deep learning in medical image analysis”. medical image analysis, vol. 42, pp. 60-88, 2017. [51] k. c. l. wong, t. syeda-mahmood and m. moradi. “building medical image classifiers with very limited data using segmentation networks”. medical image analysis, vol. 49, pp. 105-116, 2018 [52] n. t. member, j. y. shin, s. r. gurudu, r. t. hurst, c. kendall, m. gotway and j. liang. “convolutional neural networks for medical image analysis: full training or fine tuning”. ieee transactions on medical imaging, vol. 35, no. 5, pp. 1299-1312, 2016. [53] j. ker, l. wang, j. rao and t. lim. “deep learning applications in medical image analysis”. ieee access, vol. 6, pp. 9375-9389, 2017. [54] k. u. rani. “analysis of heart diseases dataset using neural network approach”. international journal of data mining and knowledge management process, vol. 1, no. 5, pp. 1-8, 2011. [55] r. yamashita, m. nishio, r. k. g. do and k. togashi. “convolutional neural networks: an overview and application in radiology”. insights into imaging, vol. 9, no. 4, pp. 611-629, 2018. [56] a. d. ruvalcaba-cardenas, t. scolery and g. day. “object classification using deep learning on extremely low-resolution time-of-flight data”. in: 2018 digital image computing: techniques and applications (dicta), canberra, australia, australia, 10-13 dec 2018. [57] e. ahn, a. kumar, m. fulham, d. feng and j. kim. “convolutional sparse kernel network for unsupervised medical image analysis”. medical image analysis, vol. 56, pp. 140-151, 2019. [58] z. wu, s. zhao, y. peng, x. he, x. zhao, k. huang, x. wu, w. fan, f. li, m. chen, j. li, w. huang, x. chen and y. li. “studies on different cnn algorithms for face skin disease classification based on clinical images”. ieee access, vol. 7, pp. 66505-66511, 2019. [59] f. altaf, s. m. s. islam, n. akhtar and n. k. janjua. “going deep in medical image analysis: concepts, methods, challenges and future directions”. ieee access, vol. 7, pp. 99540-99572, 2019. [60] k. m. hosny, m. a. kassem and m. m. foaud. “classification of skin lesions using transfer learning and augmentation with alexnet”. plos one, vol. 14, no. 5, p. e0217293, 2019. [61] j. arevalo, f. a. gonzalez, r. r. pollan, j. l. oliveira and m. a. g. lopez. “convolutional neural networks for mammography mass lesion classification”. in: 2015 37th annual international conference of the ieee engineering in medicine and biology society (embc), milan, italy, 25-29 aug 2015. [62] m. f. b. othman, n. b. abdullah and n. f. kamal. “mri brain classification using support vector machine. in: 2011 4th international conference on modeling, simulation and applied optimization, kuala lumpur, malaysia, 19-21 april 2011. [63] s. h. shirazi, s. naz, m. i. razzak, a. i. umar and a. zaib. “automated pathology image analysis”. elsevier, pakistan, pp. 1329, 2018. [64] n. c. ouseph and k. shruti. “a reliable method for brain tumor detection using cnn technique”. iosr journal of electrical and electronics engineering, vol. 1. pp. 64-68, 2017. [65] a. srivastava, s. sengupta, s. j. kang, k. kant, m. khan, s. a. ali, s. r. moore, b. c. amadi, p. kelly, s. syed and d. e. brown. “deep learning for detecting diseases in gastrointestinal biopsy images. in: 2019 systems and information engineering design symposium (sieds), charlottesville, va, usa, usa, 26-26 april 2019. [66] r. m. summers. “deep learning and computer-aided diagnosis for medical image processing: a personal perspective”. springer international publishing, switzerland, pp. 3-10, 2017. bakhtyar ahmed mohammed and muzhir shaban al-ani: medical image analysis using deep learning 90 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 [67] g. carnerio, y. zheng, f. xing and l. yang. review of deep learning methods in mammography, cardiovascular,and microscopy image analysis. springer, switzerland, 2017, pp. 11-35. [68] v. v. kumar, k. s. krishna and s. kusumavathi. “genetic algorithm based feature selection brain tumour segmentation and classification”. international journal of intelligent engineering and systems, vol. 12, no. 5, pp. 214-223, 2019. [69] d. zikic, y. ioannou, m. brown and a. criminisi. “segmentation of brain tumor tissues with convolutional neural networks”. miccai workshop on multimodal brain tumor segmentation challenge (brats) at, boston, massachusetts. pp. 36-39, 2014. [70] a. norouzi, m. s. m. rahim, a. altameem, t. saba, a. e. rad, a. rehman and m. uddin. “medical image segmentation methods, algorithms, and applications”. iete technical review, vol. 31, no. 3, pp. 199-213, 2014. [71] n. dey and a. s. ashour. “computing in medical image analysis”. elsevier, amsterdam, netherlands, pp. 3-11, 2018. [72] n. padmasini, r. umamaheswari and m. y. sikkandar. “state-ofthe-art of level-set methods in segmentation and registration of spectral domain optical coherence tomographic retinal images. elsevier, united kingdom, 2018, pp. 163-181. [73] r. r. agravat and m. s. raval. “deep learning for automated brain tumor segmentation in mri images”. elsevier, united kingdom, pp. 183-201, 2018. [74] t. a. ngo and g. carneiro. “fully automated segmentation using distance regularised level set and deep-structured learning and inference. in: l. lu, y. zheng, g. carneiro, l. yang, (eds) “deep learning and convolutional neural networks for medical image computing. advances in computer vision and pattern recognition”. springer, cham, 2017, pp. 197-224. [75] j. bernal, k. kushibar, m. cabezas, s. valverde, a. oliver and x. llado. “quantitative analysis of patch-based fully convolutional neural networks for tissue segmentation on brain magnetic resonance imaging”. ieee access, vol. 7, pp. 89986-90002, 2019. [76] r. ceylan and h. koyuncu. “scpso-based multithresholding modalities for susoicious region detection on mammograms”. elsevier, amsterdam, netherlands, pp. 109-135, 2018. [77] n. dhungel, g. carneiro, a. p. bradley. combining deep learning and structured prediction for segmenting masses in mammograms. in: l. lu, y. zheng, g. carneiro, l. yang, (eds). “deep learning and convolutional neural networks for medical image computing. advances in computer vision and pattern recognition”. springer, cham, pp. 225-240, 2017. [78] a. o. mader, c. lorenz, m. bergtholdt, j. von berg, h. schramm, j. modersitzki and c. meyer. “detection and localization of spatially correlated point landmarks in medical images using an automatically learned conditional random field”. computer vision and image understanding, vol. 176-177, pp. 45-53, 2018. [79] k. marstal, f. berendsen, n. dekker, m. staring and s. klein. “the continuous registration challenge: evaluation-as-a-service for medical image registration algorithms”. in: 2019 ieee 16th international symposium on biomedical imaging (isbi 2019), venice, italy, italy, 8-11 april 2019. [80] s. ramamoorthy, r. vinodhini and r. sivasubramaniam. “monitoring the growth of polycystic ovary syndrome using monomodal image registration technique”. in: international conference on data science and management of data (cods-comad’19), kolkata, india, january 03-05, 2019. [81] b. d. de vos, j. m. wolterink, p. a. de jong, t. leiner, m. a. viergever and i. isgum. “conv net-based localization of anatomical”. ieee transactions on medical imaging, vol. 36, no. 7, pp. 1470-1481, 2017. [82] u. bagci. “medical image computing cava: computer aided visualization”. university of central florida, florida, 2017. [83] m. m. murray, m. l. rosenberg, a. j. allen, m. baranoski, r. bernstein, j. blair, c. h. brown, e. caine, s. greenberg and v. m. mays. “violence and mental health: opportunities for prevention and early detection: proceedings of a workshop”. the national academies press, washington, dc, 2018. [84] s. s. yadav and s. m. jadhav. “deep convolutional neural network based medical image classification for disease diagnosis”. journal of big data, vol. 6, p. 113, 2019. tx_1~abs:at/tx_2:abs~at uhd journal of science and technology | jul 2020 | vol 4 | issue 1 103 1. introduction an electrocardiogram (ecg) is simply a recording of the electrical activity generated by the heart [1]. the heart produces the electrical activity that measures by a medical test called an ecg, which identifies the cardiac abnormality [2]. a heart produces tiny electrical impulses that spread through the heart muscle [3]. an ecg all data about the electrical activity of the heart records and shows on a paper by an ecg machine [4]. then, a medical practitioner interprets this data; ecg leads to find the cause of symptoms of chest pain and also leads to detect abnormal heart rhythm [5]. an ecg signal has a total of five primary turns, counting p, q, r, s, and t waves, plus the depolarization of the atria causes a small turn before atria contraction as the activation (depolarization) wave-front propagates from the sino atria node through the atria [6]. the q wave is a downward deflection after the p wave [7]. the r wave follows as an upward deflection, and the s wave is a downward deflection following the r wave [8]. q, r, and s waves together indicate a single event [9]. hence, they are usually considered to be qrs complex, as shown in fig. 1 [10], [11]. the features based on the qrs complex are among the most powerful features for ecg analysis [13]. the qrs-complex is caused by currents that are generated when the ventricles depolarize before their contraction [14]. although atrial depolarization occurs before ventricular depolarization, the latter waveform (i.e., the qrs-complex) has much higher amplitude, and atria depolarization is, therefore, not seen on an ecg. the t wave, which follows the s wave, is ventricular depolarization, where the heart muscle prepares for the next a review study for electrocardiogram signal classification lana abdulrazaq abdullah1,2, muzhit shaban al-ani3 1department of computer science, college of science and technology, university of human development, sulaymaniyah, krg, iraq, 2department of computer, college of science, university of sulaimani, sulaymaniyah, krg, iraq, 3department of information technology, college of science and technology, university of human development, sulaymaniyah, krg, iraq r e v i e w a r t i c l e a b s t r a c t an electrocardiogram (ecg) signal is a recording of the electrical activity generated by the heart. the analysis of the ecg signal has been interested in more than a decade to build a model to make automatic ecg classification. the main goal of this work is to study and review an overview of utilizing the classification methods that have been recently used such as artificial neural network, convolution neural network (cnn), discrete wavelet transform, support vector machine (svm), and k-nearest neighbor. efficient comparisons are shown in the result in terms of classification methods, features extraction technique, dataset, contribution, and some other aspects. the result also shows that the cnn has been most widely used for ecg classification as it can obtain a higher success rate than the rest of the classification approaches. index terms: artificial neural network, convolution neural network, discrete wavelet transform, support vector machine, k-nearest neighbor access this article online doi: 10.21928/uhdjst.v4n1y2020.pp103-117 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2020 abdullah and al-ani. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) uhd journal of science and technology corresponding author’s e-mail: lana abdulrazaq abdullah, department of computer science, college of science and technology, university of human development, sulaymaniyah, krg, iraq, department of computer, college of science, university of sulaimani, sulaymaniyah, krg, iraq. e-mail: lana.abdulla@uhd.edu.iq received: 05-02-2020 accepted: 12-06-2020 published: 29-06-2020 lana abdulrazaq abdullah and muzhit shaban al-ani: a review of ecg signal classification 104 uhd journal of science and technology | jul 2020 | vol 4 | issue 1 ecg cycle [15]. finally, the u wave is a small deflection that immediately follows the t wave. the u wave is usually in the same direction as the t wave [16]. there are different kinds of arrhythmias, and each kind is associated with a pattern, and as such, it is possible to recognize and classify it [17]. the arrhythmias can be categorized into two major classes; the first class consists of arrhythmias formed by a single irregular ecg signal, herein called morphological arrhythmia, the other type consists of arrhythmias formed by a set of irregular heartbeats, herein called rhythmic arrhythmias [18]. the main problem in the process of identifying and classifying arrhythmias ecgs is that an ecg signal can vary for each person, and sometimes different patients have separate ecg morphologies for the same disease [19]. moreover, two various diseases could have approximately the same properties on an ecg signal [20]. these problems cause some difficulties in the issue of heart disease diagnosis [21]. furthermore, the ecg records analysis is complicated for a human due to fatigue; an alternative way for automatic classification is computerization techniques [22]. for arrhythmia classification from the signal received by ecg device needed an automated system that can be divided into three main steps, as follows first: pre-processing, next: feature extraction and finally: classification, as shown in fig. 2 [23]. ecg signals may contain several kinds of noises, which can affect the extraction of features used for classification; therefore, the pre-processing step is necessary for removing the noises [24]. researchers have applied different preprocessing techniques for ecg classification. for noise removal, techniques such as low pass linear phase filter and linear phase high pass filters, etc., are used [25]. some methods, such as median filter, linear phase high pass filter, and mean median filter are used baseline adjustment [26]. after the pre-processing step, extracting different ecg features then used as inputs to the classification model [27]. feature extraction techniques used by researchers are discrete wavelet transform (dwt), continuous wavelet transform, discrete cosine transform (dct), discrete fourier transform, principal component analysis (pca), pan-tompkins algorithm, and independent component analysis (ica) [28]. when the set of features has been defined from the heartbeats, models can be built from these data using artificial intelligence algorithms from machine learning and data mining domains for arrhythmia heartbeat classification. the most popular techniques employed for this task and found in the literature are artificial neural networks (ann), convolution neural network (cnn), dwt, support vector machines (svm), decision tree (dt), bayesian, fuzzy, linear discriminate analysis (lda), and k-nearest neighbors (knn) [29]. many surveys on ecg analysis and classification have been published. in karpagachelvi [30] surveyed the most effective features for ecg analysis and classification as ecg. features play a significant role in diagnosing most of the cardiac diseases. nasehi and pourghassem [31] provided a survey of variance types of seizure detection algorithms and their potential role in diagnostic. various machinelearning approaches for ecg analysis and classification were reviewed in roopa and harish [23]. a comprehensive review was published in 2018, which includes a literature on ecg analysis mostly from the past decade, and most of the major aspects of ecg analysis were addressed such as preprocessing, denoising, feature extraction, and classification methods [16] (previous works on ecg survey paper, reviewer 2). fig. 2. general diagram of electrocardiogram classification. fig. 1. a typical electrocardiogram signal [12]. lana abdulrazaq abdullah and muzhit shaban al-ani: a review of ecg signal classification uhd journal of science and technology | jul 2020 | vol 4 | issue 1 105 the main purpose of this work is to review most of the common techniques that have been used mostly from the past 5 years. moreover, the paper can be useful for the other researchers in identifying any issue in ecg classification and analyzing the research area as many aspects of the methods are addressed. (this section is the main purpose of the paper (reviewer 3)). the section of this paper is ordered as follows: section 2 contains classification techniques, and then section 3 provides of discussion, and finally, section 4 presents the conclusion. 2. classification a lot of pathological infor mation about a patient’s heart processes can be obtained by studying the ecg signal [32]. there are many approaches have been developed to classify heartbeats as it is essential for the detection of an arrhythmia [33]. arrhythmias can be divided into two parts, which are life-threatening and non-life-threatening arrhythmias, a long-term ecg classification is required for the diagnosis of non-life-threatening arrhythmias that could be time-consuming and impractical, automatic algorithms exhibit a great aid. consequently, automatic ecg classification of arrhythmias is one of the most worth studying in the world [34]. there are various classifiers that have been used for ecg classification task. in this paper, most common ecg classification methods are reviewed that were proposed since 2016–2020, these classification methods can be mainly clustered based on the classifiers into several categories such as anns, cnn, knn, svm, and dwt. all of the reviewed papers were accessed by three well-known publishers, which are ieee, sciencedirect, and springer. (this section was wrote about why and how the authors select the papers for this state (reviewer 2 and reviewer 3). different types of classification techniques are studied to classify ecg data under the variance features, as there are plenty of features in the ecg signal that can be extracted. some of the classification methods are addressed below. 2.1. ann the ann is an adaptive system with exciting features such as the ability to adapt, learn, and summarize; because ann’s parallel processing, self-organizing, fault-tolerant, and adaptive capabilities make it capable of solving many complex problems, ann is also very accurate in the classification and prediction of outputs [35]. the neural network (nn) consists of the number of layers; the initial layer has an association as of the system input, and the end layer gives the output of the network [36]. nn s having hidden layers and sufficient neurons can be applied to any limited input-output mapping trouble [37]. the nn model consists of an input layer, the hidden layer, and output layer, as shown in fig. 3 [38]. many kinds of literature are published related to the ecg classification based on ann. below some of these new approach: chen et al. (2016) proposed a wavelet-based ann (w-ann) method that was based on the wavelet transform. the result illustrated that the w-ann can provide lower computing time such that reduction time was 49% and cleaner ecg input signal. the method was implemented on the data mit-bih arrhythmia database and real ecg signal measurement [39]. boussaa et al. (2016) presented the design of a cardiac pathologies detection system with high precision of calculation and decision, which consists of the mel-frequency coefficient cepstrum algorithms such as fingerprint extractor (or features) of the cardiac signal and the algorithms of ann multilayer perceptron (mlp) type mlp classifier as fingerprints extracted into two classes: normal or abnormal. the design and testing of the proposed system are performed on two types of data extracted from the mitbih database: a learning base containing labeled data (ecg normal and abnormal) and another test base containing no-labeled data. the experimental results were shown that the proposed system combines the respective advantages of the descriptor mel-frequency cepstrum coefficient and the mlp classifier [40]. fig. 3. artificial neural network. lana abdulrazaq abdullah and muzhit shaban al-ani: a review of ecg signal classification 106 uhd journal of science and technology | jul 2020 | vol 4 | issue 1 savalia et al. (2017) distinguished between normal and abnormal ecg data using signal processing and nns toolboxes in matlab. data, which were downloaded from an ecg database, physiobank, were used for learning the nn. the feature extraction method was also used to identify variance heart diseases such as bradycardia, tachycardia, firstdegree atrioventricular (av), and second-degree av. since ecg signals were very noisy, signal processing techniques were applied to remove the noise contamination. the heart rate of each signal was calculated by finding the distance between r-r intervals of the signal. the qrs complex was used to detect av blocks. the result showed that the algorithm strongly distinguished between normal and abnormal data as well as identifying the type of disease [41]. wess et al. (2017) presented field-programmable gate array (fpga)-based ecg arrhythmia detection using an ann. the objective was to implement a nn-based machinelearning algorithm on fpga to detect anomalies in ecg signals, with better performance and accuracy (acc), compared to statistical methods. an implementation with pca for feature reduction and a mlp for classification, proved superior to other algorithms. for implementation on fpga, the effects of several parameters and simplification on performance, acc, and power consumption were studied. piecewise linear approximation for activation functions and fixed-point implementation was effective methods to reduce the number of needed resources. the resulting nn with 12 inputs and six neurons in the hidden layer, achieved, in spite of the simplifications, and the same overall acc as simulations with floating-point number representation. an acc of 99.82% was achieved on average for the mit-bih database [42]. pandey et al. (2018) compared three different ann models for classification normal and abnormal signals and using university of california, irvine ecg 12 lead signal data. this work had used methods, namely, back propagation (bp) network, radial basis function (rbf) networks, and recurrent neural network (rnn). rnn models have shown better analysis results. acc for testing classification was 83.1%. this result was better than some work, using the same database [43]. sannino and pietro (2018) proposed an approach based on a deep neural network (dnn) for the automatic classification of abnormal ecg beats, differentiated from normal ones. dnn was developed using the tensor flow framework, and it was composed of only seven hidden layers, with 5, 10, 30, 50, 30, 10, and 5 neurons, respectively. comparisons were made among the proposed model with 11 other well-known classifiers. the numerical results showed the effectiveness of the approach, especially in terms of acc [44]. debnath et al. (2019) proposed two schemes; at first, the qrs components have been extracted from the noisy ecg signal by rejecting the background noise. this was done using the pan-tompkins algorithm. the second task involved the calculation of heart rate and detection of tachycardia, bradycardia, asystole, and second-degree av block from detected qrs peaks using matlab. the results showed that from detected qrs peaks, and arrhythmias, which are based on an increase or decrease in the number of qrs peaks, the absence of a qrs peak, could be diagnosed. the final task is to classify the heart abnormalities according to previously extracted features. the bp trained feed-forward nn has been selected for this research. here, data used for the analysis of ecg signals are from the mit database [45]. abdalla et al. (2019) presented that approach was developed based on the non-linearity and nonstationary decomposition methods due to the nature of the ecg signal. complete ensemble empirical mode decomposition with adaptive noise (ceemdan) was used to obtain intrinsic mode functions (imfs). established on those imfs, four parameters have been computed to construct the feature vector. average power, coefficient of dispersion, sample entropy, and singular values were calculated as parameters from the first six imfs. then, ann was adopted to apply the feature vector using them and classify five different arrhythmia heartbeats downloaded from physionet in the mit–bih database. the performance of the ceemdan and ann was better than all existing methods, where the sensitivity (sen) is 99.7%, specificity (spe) is 99.9%, acc is 99.9%, and receiver operating characteristic (roc) is 01.0% [46]. 2.2. convolutional neural network (cnn) the cnn is the most common technique to classify ecg, cnn is mainly composed of two parts, feature extraction and classification [47]. the section of feature extraction is responsible for extracting effective features from the ecg signals automatically, while the part of classification is in charge of classifying signals accurately by making use of the extracted features, as shown in fig. 4 [48]. many approaches are published the ecg classification based on cnn. below some of these update works: zubair et al. (2016) proposed a model which was integrated into two main parts, feature extraction, and classification. lana abdulrazaq abdullah and muzhit shaban al-ani: a review of ecg signal classification uhd journal of science and technology | jul 2020 | vol 4 | issue 1 107 the model automatically remembers a suitable feature representation from raw ecg data and thus negates the need for hand-crafted features. using small and patientspecific training data, the proposed classification system efficiently classified ecg beats into five different classes. ecg signal from 44 recordings of the mit-bih database is used to assess the classification performance, and the results demonstrate that the proposed approach achieves a significant classification acc and superior computational efficiency than most of the state-of-the-art methods for ecg signal classification [49]. yin et al. (2016) proposed a system that applies the impulse radio ultra-wideband radar data as additional information to assist the arrhythmia classification of ecg recordings in the slight motion state. besides, this proposed system employs a cascaded cnn to achieve an integrated analysis of ecg recordings and radar data. the experiments are implemented in the caffe platform, and the result reaches an acc of 88.89% in the slight motion state. it turns out that this proposed system keeps a stable acc of classification for normal and abnormal heartbeats in the slight motion state [50]. oh et al. (2017) designed a nine-layer deep cnn dcnn to identify five different categories of heartbeats in ecg signals automatically. the test was applied in original and ecg signals that were derived from the available database. the set was artificially augmented for removing high-frequency noise. the cnn model was trained to utilize the augmented data and obtained an acc of 93.47% and 94.03% in the identification of heartbeats in noise-free and original ecgs [51]. zhai and tin (2018) proposed an approach based on the cnn model with a different structure. the model was improved sen, and positive predictive rate for s beats by more than 12.2% and 11.9%, respectively. the system provided a fully automatic tool and reliable to detect the arrhythmia heartbeat without any manual feature extraction or any expert assistant [52]. zhang et al. (2019) introduced a new pattern recognition method in ecg data using dcnn. different from past methods that utilized learn features or hand-crafted features from the raw signal domain, the proposed method was learned the features and classifiers from the time-frequency domain. first, the ecg wave signal was transformed into the time-frequency domain using the short-time fourier transform. then, several scale-specific dcnn models were trained on ecg samples of a specific length. eventually, an online decision fusion method was proposed to fuse decisions at different scales into a more accurate and stable one [53]. wang (2020) proposed a novel approach for the automated atria fibrillation (af) detection based dnn, which was built 11-layers. the network structure was combined using a modified elman neural network (menn) and cnn. ten-fold cross-validation was conducted to evaluate the classification performance of the model on the mit-bih af database. the result confirmed that the model yielded excellent classification performance with the acc, sen, and spe of 97.4%, 97.9%, and 97.1%, respectively [54]. yao et al. (2020) designed model attention based on timeincremental cnn (ati-cnn); a dnn model could obtain both spatial and temporal fusion of information from ecg signals using integrating cnn. the features were flexible input length, halved parameter amount as well as more than 90% computation reduction in real-time processing. the experiment result showed that ati-cnn achieved an overall classification rate of 81.2% compared to vggnet that is a classical 16-layer cnn, ati-cnn achieved acc increases of 7.7% in average, and up to 26.8% in detecting paroxysmal arrhythmias [55]. 2.3. dwt the dwt is used to recognize and diagnose the ecg signals and widely used in signal processing [56]. a perfect time resolution is the main advantage of dwt [57]. it provides good frequency resolution at low frequency and good resolution at high frequency [58]. the dwt can reveal the fig. 4. typical convolution neural network structure. lana abdulrazaq abdullah and muzhit shaban al-ani: a review of ecg signal classification 108 uhd journal of science and technology | jul 2020 | vol 4 | issue 1 local characteristics of the input signal because of this great time and frequency localization ability [59]. many kinds of literature are published related to the ecg classification based on dwt. below some of these new approach: desai et al. (2015) described a machine learning-based approach for detecting five classes of ecg arrhythmia beats based on dwt features. moreover, ica was used to comprise dimensionality reduction. anova approach was used to select significant features, and ten-fold crossvalidation was used to perform svm. the experiment was conducted on mit–bih arrhythmia, which is grouped into five classes of arrhythmia beats, namely, non-ectopic (n), ventricular ectopic (v), supraventricular ectopic (s), fusion (f), and unknown (u). using svm quadratic kernel classified ecg features with an overall average acc of 98.49% [60]. saraswat (2016) explored diverse possibilities of the decomposition using the dwt method to classify wolff parkinson white syndrome ecg signals. in this work, ecg signals are discretely sampled till the 5th resolution level of the decomposition tree using dwt with daubechies wavelet of order 4 (db4), which helps in smoothing the feature more appropriate for detecting changes in signals. the mit-bih database was used for some experimental results [61]. alickovic and subasi (2016) noted that rf classifiers achieved superior performances compared to dt methods using ten-fold cross-validation for the ecg datasets. the results suggested that further significant developments in words of classification acc could be accomplished by the proposed classification system. accurate ecg signal classification was the major requirement for the detection of all arrhythmia types. performances of the proposed system were evaluated on two different databases, namely, mit-bih database and st. petersburg institute of cardiological techniques 12-lead arrhythmia database. for the mit-bih database, the rf classifier generated an overall acc of 99.33 % against 98.44 and 98.67 %, respectively. for st. petersburg institute of cardiological technics 12-lead arrhythmia database, rf classifier yielded a general acc for the c4.5 and cart classifiers of 99.95% against 99.80% for both c4.5 and cart classifiers, respectively. the merged model with multiscale pca de-noising, dwt, and rf classifier also achieves good performance for mit-bih database with the area under the roc curve (area under the curve [auc]) and f-measure equal to 0.999 and 0.993 and 1 and 0.999 for and st. petersburg institute of cardiological technics 12-lead arrhythmia database, respectively. the results demonstrated that the proposed system was able for reliable classification of ecg signals and to help the clinicians to make an accurate diagnosis of cardiovascular disorders (cvds) [62]. pan et al. (2017) proposed a comprehensive approach based on random forest techniques and discrete wavelet for arrhythmia diagnosis. specifically, dwt was used to remove high-frequency noise and baseline drift, while dwt, autocorrelation, pca, variances, and other mathematical methods are used to extract frequency-domain features, time-domain features, and morphology features. moreover, an arrhythmia classification system was developed, and its availability was verified that the proposed scheme could significantly be used for guidance and reference in clinical arrhythmia automatic classification [63]. sahoo (2017) proposed an improved algorithm to find qrs complex features based on the wavelet transform to classify four kids of ecg beats: normal (n), left bundle branch block (lbbb), right bundle branch block (rbbb), and paced beats (p); using nn and svm classifier. model performance was evaluated in terms of sen, spe, and acc for 48-recorded ecg signals obtained from the mit–bih arrhythmia database. the proposed procedure achieved high detection efficiency with a low error rate of 0.42% when detecting the qrs compound. the classifier fixed its superiority with an average acc of 96.67% and 98.39% in svm and nn, respectively. the classification acc of the svm approach proves superior for the proposed method to that of the nn classifier with extracted parameters in detecting ecg arrhythmia beats [64]. ceylan (2018) studied a model based on spared coefficients of the signals that were achieved by employing sparse representation algorithms and dictionary learning. the obtained coefficients were utilized in the weight update process of three different classification approaches, which were created using svm, adaboost, and lda algorithms. in the first step, the proposed dictionary learning (dl) based adaboost classifiers isolated the ecg signals. then, the selected feature was applied to ecg signals, and six different feature subsets were obtained by dwt, t-test, bhattacharyya, first order statistics (fos), wilcoxon test, and entropy methods. the subscription of objects was used as a new dataset. the classification process is performed according to the proposed method, and satisfactory results are obtained. the best classification acc was received at 99.75% using the proposed commercial-based terminology method called dl-adaboost-svm for the subset of attributes obtained using the dwt and wilcoxon test methods [65]. lana abdulrazaq abdullah and muzhit shaban al-ani: a review of ecg signal classification uhd journal of science and technology | jul 2020 | vol 4 | issue 1 109 tea and vladan (2018) proposed a novel framework that combined the theory of compressive sensing and random forests to achieve reliable automatic cardiac arrhythmia detection. moreover, it evaluated the characterization power of dct, dwt, and fft data transformations to extract significant features that can bring an additional boost to the classification performance. the experiments conducted on the mit-bih benchmark arrhythmia database, the result demonstrated that dwt based features exhibit better returns compared to the feature extraction technique for a relatively small number of random projected coefficients. furthermore, due to its low-complexity, the proposed model could be implemented for practical applications of real-time ecg monitoring [66]. zhang et al. (2019) proposed a lightweight approach to classify five types of cardiac arrhythmia; namely, normal beat (n), premature ventricular contraction (pvc) (v), atria premature contraction (apc) (a), rbbb beat (r), and lbbb beat (l). the mixed method of frequency analysis and shannon entropy was applied to extract appropriate statistical features. the information gain criterion was manipulated for selecting features. the selected features were then fed to the input of random forest, knn, and j48 for classification. to evaluate classification performance, tenfold cross-validation was used to verify the effectiveness of our method. experimental results showed that the random forest classifier demonstrates significant performance with the spe of 99.5%, the highest sen of 98.1%, and the acc of 98.08%, outperforming other representative approaches for automated cardiac arrhythmia classification [67]. kora et al. (2019) showed that an algorithm to detect atrial fibrillation (af) in the ecg signal is developed. for correct detection of af, pre-processing and feature extraction of the ecg signal shall be performed before it detects af. after considering the ecg signal from the database, in the preprocessing stage, denoising of the ecg signal is carried out to obtain a clean ecg signal. after pre-processing, before feature extraction, r peak detection is carried out for the signal. since r peak has the highest amplitude, and therefore, it is detected in the first round, and subsequently location of other peaks of the ecg signals is performed. after completing, pre-processing and feature extraction using dwt applied based on inverted t wave logic and st-segment elevation. our classification algorithm was demonstrated to successfully acquire, analyze, and interpret ecgs for the presence of af, indicating its potential to support m-health diagnosis, monitoring, and management of therapy in af patients [68]. 2.4. svm svm is a learning algorithm that has many good properties. it is associated with data analysis and recognizes the pattern. svm uses a linear discriminate function for classification; however, non-linear classification can also be done if a nonlinear kernel is used [69]. svm performs well in real-time situations, robust, easy to understand. while compared to other classifiers [30]. a classification task typically requires the knowledge about the data to be classified; hence, the classifier must be trained before classifying any data [70]. one of the main advantages of the svm classifier is that it automatically finds the support vectors for better classification [71]. majorly, in every case the performance of svm depends on the affected kernel function selection [72]. many types of research are published in the ecg classification based on the svm. below some of these recent studies: elhaj et al. (2016) investigated a combination of linear and non-linear features to improve the classification of ecg data. in the study, five types of beat classes of arrhythmia as recommended by the association for advancement of medical instrumentation are analyzed: non-ectopic beats (n), supra-ventricular ectopic beats (s), ventricular ectopic beats (v), fusion beats (f), and unclassifiable and paced beats (u). the characterization ability of non-linear features such as high order statistics and cumulants and non-linear feature reduction methods such as ica is combined with linear features, namely, the pca of dwt coefficients. the features are tested for their ability to differentiate different classes of data using different classifiers, namely, the svm and nn methods, with tenfold cross-validation. this method can classify the n, s, v, f, and u arrhythmia classes with high acc (98.91%) using a combined svm and rbf method [73]. arjunan (2016) reported that statistics features could be useful for categorizing the ecg signals. like the first, the signal has been passed from the de-noising process as a pre-processing. then, the following statistics features such that mean, variance, standard deviation, and skewness are extracted from the signal. svm was implemented to classify the ecg signal into two categories; normal or abnormal. the results show that the system classifies the given ecg signal with 90% sen and spe [74]. smíšek et al. (2017) proposed method for automatic ecg classification to four classes (normal rhythm [n], af [a], another rhythm [o], and noisy records [p]). the svm approach was involved in the two different stages in the model. in the first stage, svm was used to extract the global lana abdulrazaq abdullah and muzhit shaban al-ani: a review of ecg signal classification 110 uhd journal of science and technology | jul 2020 | vol 4 | issue 1 features from the entire ecg signal. in the second stage, the features from the previous step were used to train the second svm classifier. the cross-validation technique was used to evaluate both classifiers. the result showed that in phase ii of challenge, the total f1 score of the method was 0.81 and 0.84 within the hidden challenge dataset and training set, respectively [75]. wu et al. (2017) developed a system for identifying excessive alcohol consumption. three sensors were used to acquire signals regarding (ecg), intoxilyzers, and photoplethysmograph (ppg). intoxilyzers were used to know alcohol consumption levels of participants before and after drinking. the signals were pre-processed, segmented, and subjected to feature extraction using specific algorithms to produce ecg and ppg training and test data. using the ecg, ppg, and alcohol consumption data, the developed model was fast and accurate for the identification scheme using the svm algorithm. using the training data for training and the test data were applied to comfort the recognition performance of the trained svms. the identification performance of the proposed classifiers achieved 95% on average. in the approach, different feature combinations were tested to select the optimum technological configuration. because the ppg and ecg features produce identical classification performance and the ppg features were more convenient to acquire, the technical setting based on ppg is preferable for developing smart and wearable devices for the identification of driving under the influence [76]. venkatesan et al. (2018), ecg signal pre-processing and svm -based arrhythmic beat classification is performed to categorize into normal and abnormal subjects. in ecg signal pre-processing, a delayed error normalized lms adaptive filter is used to achieve high speed and low latency design with less computational elements. since the signal processing technique is developed for distant healthcare systems, white noise removal is mainly focused. dwt is applied to the preprocessed signal for hrv feature extraction, and machinelearning techniques are used for performing arrhythmic beat classification. in this paper, the svm classifier and other popular classifiers have been used on noise removed feature extracted signal for beat classification. the results show that the svm classifier performs better than additional machine learning-based classifiers [77]. liu et al. (2019) proposed an ecg arrhythmia classification algorithm based on cnn. they compared the cnn models with combining linear discriminant analysis (lda) and svm. all cardiac arrhythmia beats are derived from the mit-bih arrhythmia database, which was classed into five groups according to the standard developed by the association for the advancement of medical instrumentation (aami). the training set and the testing set come from different people, and the correction of classification is >90% [78]. 2.5. knn the knn algorithm is a simple machine-learning algorithm compared to similar machine learning approaches [79]. most of the machine-learning algorithms work on the knn algorithm [80]. knn classifier is an instance-based learning method, which stores all training sample vectors [81]. it is a very simple and effective method, especially for highdimensional problems [82]. it classifies the new unknown test samples based on similar training samples [83]. the similarity measure is usually the euclidean distance [84]. k-nn classifier was based on grouping of closest training points of data in the considered feature space. the majority of voters do the cluster to the nearest neighbor points [85]. many approaches are published the ecg classification based on knn. below some of these new works: faziludeen and sankaran (2016) presented a method for automatic ecg classification into two classes: normal and pvc. the evidential k-nearest neighbors (eknn) was based on the dempster shafer theory for classifying the ecg beats. rr interval features were used. the analysis was performed on the mit-bih database. the performance of eknn was compared with the traditional knn (maximum voting) approach. the effect of training data size was assessed using training sets of varying sizes. the eknn based classification system was shown to out perform the knn based classification system consistently [86]. bouaziz et al. (2018) implemented an automatic ecg heartbeats classifier based on knn. the segmentation of ecg signals has been performed by dwt. the considered categories of beats are normal (n), pvc, apc, rbbb, and lbbb. the validation of the presented knn based classifier has been achieved using ecg data from mitbih arrhythmia database. they have obtained the excellent classification performances, in terms of the calculated values of the spe and the sen of the classifier for several pathological heartbeats and the global classification rate, which is equal to 98, 71% [87]. khatibi and rabinezhadsadatmahaleh (2019), a novel feature engineering method, was proposed based on deep learning and k-nns. the features extracted were classified with lana abdulrazaq abdullah and muzhit shaban al-ani: a review of ecg signal classification uhd journal of science and technology | jul 2020 | vol 4 | issue 1 111 different classifiers such as dts, svms with different kernels, and random forests. this method has good performance for beat classification and achieves the average acc of 99.77%, auc of 99.99%, precision of 99.75%, and recall of 99.30% using fivefold cross-validation strategy. the main advantage of the proposed method was its low computational time compared to training deep learning models from scratch and its high acc compared to the traditional machine learning models. the strength and suitability of the proposed method for feature extraction are shown by the high balance between sen and spe [88]. 3. discussion the ecg classification, which shows the status of the heart and the cardiovascular condition, is essential to improve the patient’s living quality. the main purpose of this work is to review the main techniques of ecg signal classification. in general, any structure of ecg classification can be divided into four stages. the first one is a preprocessing step, which is a crucial step in the ecg signal classification. for that reason, most well-known techniques are reviewed in this paper. the idea of using the preprocessing step and the combination of preprocessing techniques is to improve the performances of the model. the second step is extracting the most relevant information from the ecg signal, which represents the heart status. the step is called a feature extraction step. there is a vital challenge to extract efficient information that can be discriminated based on the variance status of the ecg signal. the success rate of the model can evaluate whether the feature contains valuable knowledge of the signal or not. the third step is named as the feature selection step. time execution of the model is a crucial part and can be reduced using optimal features among the feature spaces. many techniques have been adopted for reducing the dimensionality of the features. some of the methods have been inspired by nature and the others, working based on the mathematical rules. the primarily focused step is selecting a machine-learning algorithm to classify the ecg features. plenty of approaches has been used for this purpose. most of the classifier methods are fed by the features, but cnn is supplied using the raw signal as cnn is a feature-less technique. ann, cnn, dwt, knn, and svm are reviewed. all reviewed articles are downloaded from three trusted sources, ieee, sciencedirect, and springer for 2015–2020. tables 1-5 show the summarization of all the reviewed articles in term of what kind of machine-learning were used, how the methods were effective to the ecg table 1: heartbeat methods classification based on ann artificial neural networks author (year) dataset purpose methods result remarks chen et al. (2016) mit-bih arrhythmia dataset reduce the computing time by a simple method wavelet artificial neural network (w-ann) the average computing time can be reduced by 49% use a mobile real-time applications to classify ecg boussee et al. (2016) mit-bih arrhythmia dataset record, proceed, and classify ecg signal mel frequency coefficient cepstrum (mfcc)+ann available a robust and quick classification system build a system to classify ecg by a combination of signal processing algorithms savalia et al. (2017) mit/bih normal sinus database and mit-bih arrhythmia dataset to distinguish normal and abnormal ecg ann accuracy=86% abnormal ecg is used to identify specific heart diseases wess et al. (2017) mit-bih arrhythmia dataset to present fpgabased ecg arrhythmia detection pca+ann accuracy=99.82% increased the number of inputs, hidden layer, and fixed point pandey et al. (2018) uci arrhythmia dataset early and right identification of cardiac disease rnn, rbf and bpa accuracy rnn=83.05% rbf=75.25% bpa=74.35% accuracy of rnn is better than two ann models sannino and pietro (2018) mit-bih arrhythmia dataset the automatic recognition of abnormal dnn accuracy=99.68% the model is competitive in sensitivity and specificity debnath et al. (2019) mit-bih arrhythmia dataset analyze and predict heart abnormality ann accuracy normal=97.46% bradycardia=87.20% tachycardia=99.97% block=66.72% input noisy ecg signals abdalla et al. (2019) mit-bih arrhythmia dataset distinguish between different types of ecg arrhythmia ceemdan+ann accuracy=99.9% the performance of the ceemdan and ann is better than all existing methods lana abdulrazaq abdullah and muzhit shaban al-ani: a review of ecg signal classification 112 uhd journal of science and technology | jul 2020 | vol 4 | issue 1 table 2: heartbeat methods classification based on cnn convolutional neural network author (year) dataset purpose methods result remarks zubair et al. (2016) mit-bih arrhythmia database proposed learns features from raw ecg 1d-cnn accuracy=92.7% the model avoids the need for hand-crafted features yin et al. (2016) data is built on ecg sensor chip bmd101 and bluetooth module monitoring and classifying ecg signals and radar signals cascade cnn accuracy=88.89% the system can achieve stable performance in the slight motion state oh et al. (2017) mit-bih arrhythmia database identified automatically five different categories of ecg 9-layers cnn accuracy with noise=94.03% without noise=93.47% generated synthetic data to overcome imbalance problems zhai and tin (2018) mit-bih arrhythmia database implemented model on portable device for longterm monitoring cnn accuracy>97% the model doesn’t need manual feature extraction or expert assistant zhang et al. (2019) synthetic and realworld ecg datasets proposed learns features and classifiers from the time-frequency domain dcnn accuracy=99% the model can integrated into a portable ecg monitor with limited resources wang (2020) mit-bih af dataset proposed approach for automated af detection cnn+menn accuracy=97.4% the model has great potential to assist physicians and reduce mortality yao et al. (2020) china physiological signal challenge 2018 database classify varied-length ecg signals attention-based time-incremental (ati)-cnn accuracy=81.2% the model compares with vggnet, increases the accuracy table 3: heartbeat methods classification based on dwt discrete wavelet transforms author (year) dataset purpose methods result remarks desai et al. (2015) mit-bih arrhythmia dataset detected five classes of ecg arrhythmia dwt+ica+svm accuracy =98.49% efficient system in healthcare diagnosis saraswat et al. (2016) mit-bih arrhythmia dataset presented a clear difference between normal and abnormal ecg dwt provide min and max values of normal and abnormal ecg. detecting changes in signals leading to smooth the feature alickovic and subasi (2016) mit-bih arrhythmia and st. -petersburg institute of cardiological technics arrhythmia database automated system for the classification of ecg dwt+c4.5+cart accuracy c4.5=99.95% cart=99-80% efficient system for cardiac arrhythmia detection pan et al.(2017) mit-bih arrhythmia dataset developed system for clinical arrhythmia classification dwt+random forest accuracy=99.77% the system improves classification accuracy and speed sahoo et al. (2017) mit-bih arrhythmia dataset improved algorithm to detect qrs complex features to classify four types of ecg multiresolution wt +nn+svm accuracy nn=96.67% svm=98.39% extracted features are acceptable for classifying ecg by svm ceylan (2018) mit-bih arrhythmia dataset system for signal compression, noise elimination, and classification dwt+adaboost+ svm+lda+dl accuracy > 99% the best classification accuracy was obtained by (dl-adaboost – svm) tea and vladan (2018) mit-bih benchmark arrhythmia dataset monitor ecg in real –time fft+dct+dwt +random forests accuracy=97.33% dwt provides the best performance in comparison with fft and dct zhang et al. (2019) mit-bih arrhythmia dataset diagnosis of lowecost wearable ecg device dwt+rf+knn+j48 accuracy=98.08% reduce the computational cost and improves the classification efficiency. kora et al. (2019) mit-bih arrhythmia dataset detect atrial fibrillation in the ecg signal dwt+knn+svm accuracy dwt+svm=94.07% dwt+knn= 99.5% dwt represent the essential characteristics of the ecg lana abdulrazaq abdullah and muzhit shaban al-ani: a review of ecg signal classification uhd journal of science and technology | jul 2020 | vol 4 | issue 1 113 table 5: heartbeat methods classification based on knn knearest neighbors author (year) dataset purpose methods result remarks faziludeen and sankaran(2016) mit-bih arrhythmia database classify ecg beat into two classes knn+eknn lower error rates increase in training size is shown to lower the error rates bouaziz et al. (2018) mit-bih arrhythmia database implement an automatic ecg heartbeats classifier knn accuracy=98.71% knn an important and significant tool for ecg recognition khatibi and rabinezhadsadatmahaleh (2019) mit-bih arrhythmia database classify ecg for arrhythmia detection cnn+dt+s vm+rf+k-nns accuracy=99.77% the method is low computational time table 4: heartbeat methods classification based on svm support vector machines author (year) dataset purpose methods result remarks elhaj et al.(2016) mit-bih arrhythmia database classifying ecg signal with high accuracy pca+dwt+ica+ hos+nn+svm-rbf accuracy svm-rbf=98.91% nn=98.90% both classifiers provide equal average accuracy, sensitivity, and specificity arjunan (2016) mit-bih arrhythmia database categorize ecg by an automated system svm accuracy=90% mean, variance, standard deviation, and skewness are used for feature extraction smíšek et al. (2017) hidden dataset of 2017 physionet/ cinc challenge an advanced method for automatic classification ecg svm-rbf f1-measure=0.81 quite high performance was achieved even for low number in training set wu et al. (2017) collect data by sensors. recognize drunk driving by ecg and ppg svm accuracy=95% the smart and wearable sensing devices offer right solution for drunk driving venkatesan et al. (2018) mit-bih arrhythmia database. classifier with low computational complexity svm accuracy=96% svm is better than various classification techniques liu et al. (2019) mit-bih arrhythmia database. robust and efficient model to achieve a real time analysis ecg svm+cnn+lda accuracy >90% sometimes, do not need to extract complex features of ecg classification and which kind of ecg datasets were used. some important points in the ecg classification are observed and highlighted in the below: according to the previous works based on the ann algorithm for heartbeat classification, ann is trained using the polyspectrum patterns and features extracted from the higher-order spectral analysis of normal and abnormal ecg signal. ann is used as a classifier to help knowledge management and decision-making system to improve classification acc. the result shows that ann with pca obtains lowest error rate to classify the ecg signal. the performance of the ceemdan and ann is better than all higher than all existing and previous algorithms (table 1). (the main point are extracted from ann [reviewer 1 and 2 and 3]). cnn is straight forward to apply as the cnn is a features less techniques. hence, the researcher does not concern about the feature that means any handcraft feature does not require in the cnn model. 1 d and 2 d of cnn have been adopted, according to the observed result, 1 d cnn outperformed of the 2d cnn. moreover, the 1d cnn is less complex compare to the 2 d cnn in term of computational steps. cnn also can be integrated with menn to improve the classification acc (table 2). (roles cnn in ecg classification [reviewer 1 and 2 and 3]). dwt is applied on each heartbeat to obtain the morphological features. it provides better time and frequency resolution of ecg signal. dwt shows the powerful tool for ecg classification and it is straight forward tool to implantation. moreover, dwt is an assisting the clinicians for making an accurate diagnosis of cvds. based on the summarization of some works on dwt, the integration dwt model with random forest can achieve 99.77% acc (table 3) (the main notes about dwt [reviewer 1 and 2 and 3]). svm (svm) is widely used for pattern recognition. svm model with a weighted kernel function method significantly lana abdulrazaq abdullah and muzhit shaban al-ani: a review of ecg signal classification 114 uhd journal of science and technology | jul 2020 | vol 4 | issue 1 recognizes the q wave, r wave, and s wave in the input ecg signal to categorize the heartbeat. svm is also the powerful tool to ecg classification; however, the performance cnn has outperformed of the svm. moreover, the time consumption of implementing svm is higher than knn model and smaller than the cnn model. svm-rbf classifier classifies 95% of the given ecg signal correctly with simple statistical features table4. (the contributions of svm [reviewer 1 and 2 and 3]). the lowest computational rate for diagnosing arrhythmia can be achieved by applying knn as the knn algorithm does not require the training stage. the role of the handcraft features is a vital subject to the knn model as long as the dimensional of the obtained features is low because the knn model works based on the distance. time domain and frequency domain features are applied to knn classifier for ecg classification which is simpler than other machinelearning approaches (table 5). (the main roles of knn in ecg classification [reviewer 1 and 2 and 3]). 4. conclusion classification of ecg signals is acting an important role in recognizing normal and abnormal heartbeat. increasing the acc of ecg classification is a challenging problem. it has been interested in more than a decade; for this reason, many approaches have been developed. in this paper, most recent approaches are reviewed in terms of some aspects such as method, dataset, contribution, and success rate. the table (cnn) summarizes variance approaches in ecg signal analysis. we suggest using a hybrid model based on cnn with longand short-term memory (lstm). the cnn part can extract the features from the raw signal which can be a temporal features based on how many convolution layers we will used, and lstm can learn the pattern in the temporal feature as the lstm is more suitable to time series features. then, the model can predict unknown ecg signals. we will tune filters in the cnn model and layers in the lstm model to increase the classification rate. (explain how use cnn+lstm [reviewer 3]). references [1] a. alberdi, a. aztiria and a. basarab. “towards an automatic early stress recognition system for office environments based on multimodal measurements: a review”. journal of biomedical informatics, vol. 59, pp. 49-75, 2016. [2] m. s. al-ani. “electrocardiogram waveform classification based on p-qrs-t wave recognition”. uhd journal of science and technology, vol. 2, no. 2, pp. 7-14, 2018. [3] m. al-ani. “a rule-based expert system for automated ecg diagnosis”. international journal of advances in engineering and technology, vol. 6, no. 4, 1480-1492, 2014. [4] m. s. al-ani and a. a. rawi. “ecg beat diagnosis approach for ecg printout based on expert system”. international journal of emerging technology and advanced engineering, vol. 3, no. 4, pp. 797-807, 2013. [5] s. h. jambukia, v. k. dabhi and h. b. prajapati. “classification of ecg signals using machine learning techniques: a survey”. in: conference proceeding 2015 international conference on advances in computer engineering and applications, pp. 714-721, 2015. [6] j. li, y. si, t. xu and s. jiang. “deep convolutional neural network based ecg classification system using information fusion and onehot encoding techniques”. mathematical problems engineering, vol. 2018, p. 7354081, 2018. [7] d. sung, j. kim, m. koh and k. park. “ecg authentication in postexercise situation ecg authentication in post-exercise situation”. conference proceeding ieee engineering medical biology socirty, vol. 1, pp. 446-449, 2017. [8] m. lakshmi, d. prasad and d. prakash. “survey on eeg signal processing methods”. international journal of advanced research in computer science, vol. 4, no. 1, pp. 84-91, 2014. [9] r. chaturvedi and y. yadav. “a survey on compression techniques”. international journal of advanced research in computer and communication engineering, vol. 2, no. 9, pp. 3511-3513, 2013. [10] h. y. lin, s. y. liang, y. l. ho, y. h. lin and h. p. ma. “discretewavelet-transform-based noise removal and feature extraction for ecg signals”. irbm, vol. 35, no. 6, pp. 351-361, 2014. [11] m. hammad, s. zhang and k. wang. “a novel two-dimensional ecg feature extraction and classification algorithm based on convolution neural network for human authentication”. future generation computer systems, vol. 101, pp. 180-196, 2019. [12] j. juang. “proceedings of the 3rd international conference on intelligent technologies and engineering systems (icites2014)”. vol. 345. in: lecture notes in electrical engineering, pp. 545-555, 2016. [13] a. giorgio, m. rizzi and c. guaragnella. “efficient detection of ventricular late potentials on ecg signals based on wavelet denoising and svm classification”. information, vol. 10, no. 11, p. 328, 2019. [14] f. a. r. sánchez and j. a. g. cervera. “ecg classification using artificial neural networks”. journal of physics: conference series, vol. 1221, no. 1, pp. 1-6, 2019. [15] s. v. deshmukh and o. dehzangi. “ecg-based driver distraction identification using wavelet packet transform and discriminative kernel-based features”. 2017 ieee international conference on smart computing, 2017. [16] s. k. berkaya, a. k. uysal, e. s. gunal, s. ergin, s. gunal and m. b. gulmezoglu. “a survey on ecg analysis”. biomedical signal processing and control, vol. 43, pp. 216-235, 2018. [17] n. a. polytechnic. “automated identification of shockable and non-shockable life-threatening ventricular arrhythmias using convolutional neural network”. future generation computer systems the international journal of escience, vol. 79, p. 952, 2017. [18] f. a. elhaj, n. salim, t. ahmed, a. r. harris and t. t. swee. “hybrid classification of bayesian and extreme learning machine for heartbeat classification of arrhythmia detection”. in: 6th ict lana abdulrazaq abdullah and muzhit shaban al-ani: a review of ecg signal classification uhd journal of science and technology | jul 2020 | vol 4 | issue 1 115 international student project conference, pp. 1-4, 2017. [19] p. li, k. l. chan, s. fu and s. m. krishnan. “an abnormal ecg beat detection approach for long-term monitoring of heart patients based on hybrid kernel machine ensemble”. lecture notes in computer science, vol. 354. 1springer, berlin, pp. 346-355, 2005. [20] s. shadmand and b. mashoufi. “a new personalized ecg signal classification algorithm using block-based neural network and particle swarm optimization”. biomedical signal processing and control, vol. 25, pp. 12-23, 2016. [21] j. mateo, a. m. torres, a. aparicio and j. l. santos. “an efficient method for ecg beat classification and correction of ectopic beats”. computers and electrical engineering, vol. 53, pp. 219229, 2016. [22] c. kamalakannan, l. p. suresh, s. s. dash and b. k. panigrahi. “power electronics and renewable energy systems: proceedings of icperes 2014. vol. 326. lecture notes in electrical engineering, pp. 1537-1544, 2014. [23] c. k. and b. s. “a survey on various machine learning approaches for ecg analysis”. international journal of computer applications, vol. 163, no. 9, pp. 25-33, 2017. [24] s. z. islam, s. z. islam, r. jidin and m. a. m. ali. “performance study of adaptive filtering algorithms for noise cancellation of ecg signal”. vol. 4. in: icics 2009 conference proceeding 7th international conference information, communication signal process, 2009. [25] m. a. rahman, m. m. milu, a. anjum, a. b. siddik, m. h. sifat, m. r. chowdhury, f. khanam, m. ahmad. “a statistical designing approach to matlab based functions for the ecg signal preprocessing”. the iran journal of computer science, vol. 2, no. 3, pp. 167-178, 2019. [26] m. t. almalchy, v. ciobanu and n. popescu. “noise removal from ecg signal based on filtering techniques”. proceeding 2019 22nd international conference control systems computer science, pp. 176-181, 2019. [27] g. h. choi, e. s. bak and s. b. pan. “user identification system using 2d resized spectrogram features of ecg”. ieee access, vol. 7, pp. 34862-34873, 2019. [28] a. lay-ekuakille, m. a. ugwiri, c. liguori and p. k. mvemba. “enhanced methods for extracting characteristic features from ecg”. ieee international symposium on medical measurements and applications, pp. 1-5, 2019. [29] j. oster, j. behar, o. sayadi, s. nemati, a. e. w. johnson and g. d. clifford. “semisupervised ecg ventricular beat classification with novelty detection based on switching kalman filters”. ieee transactions on biomedical engineering, vol. 62, no. 9, pp. 21252134, 2015. [30] s. karpagachelvi. “ecg feature extraction techniques a survey approach”. international journal of computer science and information security, vol. 8, no. 1, pp. 76-80, 2010. [31] s. nasehi and h. pourghassem. “seizure detection algorithms based on analysis of eeg and ecg signals: a survey”. neurophysiology, vol. 44, no. 2, pp. 174-186, 2012. [32] s. m. j. jalali, m. karimi, a. khosravi and s. nahavandi. “an efficient neuroevolution approach for heart disease detection”. conference proceeding ieee international conference system man cybernetics, pp. 3771-3776, 2019. [33] d. carrera, b. rossi, p. fragneto and g. boracchi. “online anomaly detection for long-term ecg monitoring using wearable devices”. pattern recognition, vol. 88, pp. 482-492, 2019. [34] e. k. wang, x. zhang and l. pan. “automatic classification of cad ecg signals with sdae and bidirectional long short-term network”. ieee access, vol. 7, pp. 182873-182880, 2019. [35] n. omer, y. granot, m. kähönen, r. lehtinen, t. nieminen, k. nikus. “blinded analysis of an exercise ecg database using high frequency qrs analysis”. vol. 44. in: 2017 computing in cardiology, pp. 1-4, 2017. [36] i. karagoz. “cmbebih 2019”. vol. 73. in: ifmbe proceeding c, pp. 159-163, 2019. [37] s. lata and r. kumar. “disease classification using ecg signals based on r-peak analysis with abc and ann”. the international journal of electronics, communications, and measurement engineering, vol. 8, no. 2, pp. 67-86, 2019. [38] a. delrieu, m. hoël, c. t. phua and g. lissorgues. “multi physiological signs model to enhance accuracy of ecg peaks detection”. ifmbe proceeding, vol. 61, pp. 58-61, 2017. [39] k. c. j. chen, y. s. ni and j. y. wang. “electrocardiogram diagnosis using wavelet-based artificial neural network”. in: 2016 ieee 5th globel conference consumer electronics gcce 2016, pp. 5-6, 2016. [40] m. boussaa, i. atouf, m. atibi and a. bennis. “ecg signals classification using mfcc coefficients and ann classifier”. proceeding 2016 international conference electronics information technology, pp. 480-484, 2016. [41] s. savalia, e. acosta and v. emamian. “classification of cardiovascular disease using feature extraction and artificial neural networks”. journal of biosciences and medicines, vol. 5, no. 11, pp. 64-79, 2017. [42] m. wess, p. d. s. manoj and a. jantsch. “neural network based ecg anomaly detection on fpga and trade-off analysis. in: proceedings ieee international symposium on circuits and systems, 2017. [43] s. pandey and r. r. janghel. “classification of ecg arrhythmia using recurrent neural networks ecg arrhythmia classification using artificial neural networks”. procedia computer science, vol. 8, pp. 1290-1297, 2018. [44] g. sannino and g. de pietro. “a deep learning approach for ecgbased heartbeat classification for arrhythmia detection”. future generation computer systems, vol. 86, pp. 446-455, 2018. [45] t. debnath, m. hasan and t. biswas. “analysis of ecg signal and classification of heart abnormalities using artificial neural network”. in: proceeding 9th international conference electrical and computer engineering, pp. 353-356, 2017. [46] f. y. o. abdalla, l. wu, h. ullah, g. ren, a. noor and y. zhao. “ecg arrhythmia classification using artificial intelligence and nonlinear and nonstationary decomposition”. signal, image video process, vol. 13, no. 7, pp. 1283-1291, 2019. [47] z. k. abdul. “kurdish speaker identification based on one dimensional convolutional neural network”. computational methods for differential equations, vol. 7, no. 4, pp. 566-572, 2019. [48] d. li, j. zhang, q. zhang and x. wei. “classification of ecg signals based on 1d convolution neural network”. in: 2017 ieee 19th international conference on e-health networking, applications and services, healthcom, pp. 1-6, 2017. [49] m. zubair, j. kim and c. yoon. “an automated ecg beat classification system using convolutional neural networks”. in: 2016 6th international conference on it convergence and security, 2016. [50] w. yin, x. yang, l. zhang and e. oki. “ecg monitoring system integrated with ir-uwb radar based on cnn”. ieee access, vol. lana abdulrazaq abdullah and muzhit shaban al-ani: a review of ecg signal classification 116 uhd journal of science and technology | jul 2020 | vol 4 | issue 1 4, pp. 6344-6351, 2016. [51] s. l. oh, n. a. polytechnic, n. a. polytechnic, y. hagiwara and j. h. tan. “a deep convolutional neural network model to classify heartbeats”. computers in biology and medicine, vol. 89, pp. 389396, 2017. [52] x. zhai and c. tin. “automated ecg classification using dual heartbeat coupling based on convolutional neural network”. ieee access, vol. 6, pp. 27465-27472, 2018. [53] j. zhang, j. tian, y. cao, y. yang and x. xu. “deep time frequency representation and progressive decision fusion for ecg classification”. knowledge-based systems, vol. 190, p. 105402, 2020. [54] j. wang. “a deep learning approach for atrial fibrillation signals classification based on convolutional and modified elman neural network”. future generation computer systems, vol. 102, pp. 670679, 2020. [55] q. yao, r. wang, x. fan, j. liu and y. li. “multi-class arrhythmia detection from 12-lead varied-length ecg using attention-based time-incremental convolutional neural network”. information fusion, vol. 53, no. 1, pp. 174-182, 2020. [56] h. limaye and v. v. deshmukh. “ecg noise sources and various noise removal techniques: a survey”. international journal of application or innovation in engineering and management, vol. 5, no. 2, pp. 86-92, 2016. [57] s. l. joshi. “a survey on ecg signal denoising techniques 2013 international conference on communication systems and network technologies a survey on ecg signal denoisingtechniques”, 2013. [58] h. el-saadawy, m. tantawi, h. a. shedeed and m. f. tolba. “electrocardiogram (ecg) classification based on dynamic beats segmentation”. the acm international conference proceeding series, pp. 75-80, 2016. [59] t. r. naveen, k. v. reddy, a. ranjan and s. baskaran. “detection of abnormal ecg signal using dwt feature extraction and cnn”. international research journal of engineering and technology, vol. 6, no. 3, pp. 5175-5180, 2019. [60] u. desai, r. j. martis, c. g. nayak, k. sarika and g. seshikala. “machine intelligent diagnosis of ecg for arrhythmia classification using dwt, ica and svm techniques. 12th ieee international conference electronic energy, environmental research communications, pp. 2-5, 2016. [61] s. saraswat, g. srivastava and s. shukla. “decomposition of ecg signals using discrete wavelet transform for wolff parkinson white syndrome patients”. in: proceedings 2016 international conference on micro-electronics and telecommunication engineering, pp. 361-365, 2016. [62] e. alickovic and a. subasi. “medical decision support system for diagnosis of heart arrhythmia using dwt and random forests classifier”. the journal of medical systems, vol. 40, no. 4, pp. 1-12, 2016. [63] g. pan, z. xin, s. shi and d. jin. “arrhythmia classification based on wavelet transformation and random forests”. multimedia tools and applications journal, vol. 77, no. 17, pp. 21905-21922, 2018. [64] s. sahoo, b. kanungo, s. behera and s. sabut. “multiresolution wavelet transform based feature extraction and ecg classification to detect cardiac abnormalities multiresolution wavelet transform based feature extraction and ecg classification to detect cardiac abnormalities”. measurement, vol. 17, no. 1, pp. 55-66, 2017. [65] m. barstuğan and r. ceylan. “the effect of dictionary learning on weight update of adaboost and ecg classification”. journal of king saud university, vol. 30, pp.1-9, 2018. [66] t. marasović and v. papić. “a comparative study of fft, dct, and dwt for efficient arrhytmia classification in rp-rf framework”. international journal of e-health and medical communications, vol. 9, no. 1, pp. 35–49, 2018. [67] y. zhang, y. zhang, b. lo and w. xu. “wearable ecg signal processing for automated cardiac arrhythmia classification using cfase-based feature selection”. expert system, vol. 37, no. 1, pp. 1-13, 2020. [68] p. kora, c. u. kumari, k. swaraja and k. meenakshi. “atrial fibrillation detection using discrete wavelet transform. in: proceedings of 2019 3rd ieee international conference on electrical, computer and communication technologies, pp. 1-3, 2019. [69] s. raj and k. c. ray. “ecg signal analysis using dct-based dost and pso optimized svm”. ieee transactions on automatic control, vol. 66, no. 3, pp. 470-478, 2017. [70] r. banerjee, a. ghose and s. khandelwal. “a novel recurrent neural network architecture for classification of atrial fibrillation using single-lead ecg. in: european signal processing conference, pp. 1-5, 2019. [71] h. khorrami and m. moavenian. “a comparative study of dwt, cwt and dct transformations in ecg arrhythmias classification”. expert systems with applications, vol. 37, no. 8, pp. 5751-5757, 2010. [72] v. mygdalis, a. tefas and i. pitas. “exploiting multiplex data relationships in support vector machines”. pattern recognition, vol. 85, pp. 70-77, 2019. [73] f. a. elhaj, n. salim, a. r. harris, t. t. swee and t. ahmed. “arrhythmia recognition and classification using combined linear and nonlinear features of ecg signals”. computer methods and programs in biomedicine, vol. 127, pp. 52-63, 2016. [74] v. r. arjunan. “ecg signal classification based on statistical features with svm classification”. international journal of advances in signal and image sciences, vol. 2, no. 1, p. 5, 2016. [75] r. smíšek, j. hejč, m. ronzhina, a. němcová, l. maršánová, j. chmelík, k. jana. svm based ecg classification using rhythm and morphology features, cluster analysis and multilevel noise estimation”. computing in cardiolology, vol. 44, pp. 1-4, 2017. [76] w. f. wang, c. y. yang and y. f. wu. “svm-based classification method to identify alcohol consumption using ecg and ppg monitoring”. personal and ubiquitous computing, vol. 22, no. 2, pp. 275-287, 2018. [77] c. venkatesan, p. karthigaikumar, a. paul, s. satheeskumaran and r. kumar. “ecg signal preprocessing and svm classifierbased abnormality detection in remote healthcare applications”. ieee access, vol. 6, pp. 9767-9773, 2018. [78] j. liu, s. song, g. sun and y. fu. “classification of ecg arrhythmia using cnn, svm and lda”. vol. 11633. in: international conference on artificial intelligence and security, pp. 191-201, 2019. [79] j. zhai and a. barreto. “stress detection in computer users based on digital signal processing of noninvasive physiological variables”. in: proceedings of the 28th ieee embs annual international conference, pp. 1355-1358, 2007. [80] v. gupta and m. mittal. “knn and pca classifier with autoregressive modelling during different ecg signal interpretation”. procedia computer science, vol. 125, pp. 18-24, 2018. [81] n. flores, r. l. avitia, m. a. reyna and c. garcía. “readily available ecg databases”. journal of electrocardiology, vol. 51, lana abdulrazaq abdullah and muzhit shaban al-ani: a review of ecg signal classification uhd journal of science and technology | jul 2020 | vol 4 | issue 1 117 no. 6, pp. 1095-1097, 2018. [82] r. p. narwaria, s. verma and p. k. singhal. “removal of baseline wander and power line interference from ecg signal a survey approach”. international journal of information and electronics engineering, vol. 3, no. 1, pp. 107-111, 2011. [83] n. k. dewangan and s. p. shukla. “a survey on ecg signal feature extraction and analysis techniques”. international journal of innovative research in electrical, electronics, instrumentation and control engineering, vol. 3, no. 6, pp. 12-19, 2015. [84] i. saini. “analysis ecg data compression techniques a survey approach”. the international journal of emerging technology and advanced engineering, vol. 3, no. 2, pp. 544-548, 2013. [85] m. m. baig, h. gholamhosseini and m. j. connolly. “a comprehensive survey of wearable and wireless ecg monitoring systems for older adults”. medical and biological engineering and computing, vol. 51, no. 5, pp. 485-495, 2013. [86] s. faziludeen and p. sankaran. “ecg beat classification using evidential k-nearest neighbours”. procedia computer science, vol. 89, pp. 499-505, 2016. [87] f. bouaziz, d. boutana and h. oulhadj. “diagnostic of ecg arrhythmia using wavelet analysis and k-nearest neighbor algorithm”. in: proceedings of the 2018 international conference on applied smart systems, pp. 1-6, 2019. [88] t. khatibi and n. rabinezhadsadatmahaleh. “proposing feature engineering method based on deep learning and k-nns for ecg beat classification and arrhythmia detection”. physical and engineering sciences in medicine, vol. 43, pp. 1-20, 2019. tx_1~abs:at/tx_2:abs~at 46 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 1. introduction coronaviruses are a large group of viruses that may cause disease in animals and humans. it is known that a number of coronaviruses cause human respiratory infections that range from common colds to more severe diseases such as the middle east respiratory syndrome and severe acute respiratory syndrome (sars). the newly discovered coronavirus causes the covid-19 virus. coronavirus disease (covid)-19 is an infectious disease caused by the newly discovered coronavirus. there was no knowledge of this virus and this emerging disease before the outbreak of outbreak in the chinese city of johan in december 2019. the most common symptoms of covid-19 disease are fever, fatigue, and dry cough. some patients may experience pain and aches, nasal congestion, cold, sore throat, or diarrhea. these symptoms are usually mild and begin gradually. some people become infected without showing any symptoms and without feeling ill. moreover, the severity of the disease intensifies in approximately one person out of every six people who develop covid-19 infection, who suffer from review study on sciencedirect library based on coronavirus covid-19 muzhir shaban al-ani1, dimah mezher shaban al-ani2 1department of information technology, college of science and technology, university of human development, sulaymaniyah, iraq, 2department of pharmacy, college of pharmacy, university of philadelphia, amman, jordan o r i g i n a l re se a rc h a rt i c l e a b s t r a c t several years ago, china and the united states of america began experimenting with the coronavirus, which lives in the bat. it is not known until now how the virus spread and how it extended to all countries of the world. however, it is certain that this virus first appeared and spread was at the end of 2019 and in the chinese city of wuhan, especially in markets close to laboratories that are working on this virus. at the beginning of the year 2020, this virus began to spread very widely all over the world and began killing thousands of people every day. the world economy was destroyed until the world health organization considered it a pandemic. as for the research aspect, the researchers started the research work on this pandemic from many aspects, including medical, statistical, managerial, healthcare, and others. a statistical analysis depends many key factors that have been studied. this study was conducted on april 11, 2020, where a large number of research papers were downloaded using the keywords coronavirus disease (covid)-19, which were applied in the sciencedirect library that was examined on 100 research papers only. the obtained results indicated that most of the research papers that worked on the subject of covid-19 confirmed that this virus infects the human respiratory system, which in turn leads to shortness of breath and death. here, it must be noted that the human immune system has a major role in the process of overcoming this virus and gradual recovery. the obtained analysis indicated that the main fields of coronavirus are: medicine 42%, statistics 21%, healthcare 19%, and management 18%. through this study, it became clear that china is the first country in terms of the number of researchers and also in terms of the number of research papers related to the covid-19. index terms: coronavirus, coronavirus disease-19, diagnosis, human immune system, coronavirus disease-19 published papers. corresponding author’s e-mail: muzhir shaban al-ani, department of information technology, college of science and technology, university of human development, sulaymaniyah, iraq. e-mail: muzhir.al-ani@uhd.edu.iq received: 26-06-2020 accepted: 28-07-2020 published: 06-08-2020 access this article online doi: 10.21928/uhdjst.v4n2y2020.pp46-55 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2020 al-ani and al-ani. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) uhd journal of science and technology muzhir shaban al-ani and dimah mezher shaban al-ani: review study based on covid-19 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 47 difficulty breathing. the risk of the elderly and people with basic medical problems such as high blood pressure, heart disease, or diabetes is severe. about 2% of people who have contracted the disease have died. people with fever, cough, and difficulty breathing should seek medical care. the term “incubation period” refers to the period from infection with the virus to the onset of symptoms of the disease. most estimates of the incubation period for covid-19 disease range from 1 to 14 days, usually lasting 5 days. these estimates will be updated as more data becomes available. a section of kingston university’s microbiologists has concluded that a coronavirus attacks two specific groups of cells in the lungs. one of these cells is called a goblet cell, and the other is called a ciliated cell. they explain that the goblet cells produce the mucus that forms a moisturizing layer on the respiratory canal, which is important to help maintain the moisture of the lungs, and thus maintain health. the ciliary cells are cells with hairs that point upward, and their function is to shovel any harmful substance suspended in mucus, such as bacteria, viruses, and dust particles, toward the throat to get rid of them. coronavirus, on the other hand, infects these two groups of cells, which is observed with sars, professor felder says. felder added that the coronavirus infects these cells and begins to kill them, and its tissues begin to fall and collect in the lungs, and the lungs begin to become obstructed, which means that the patient has pneumonia. then, the body’s immune system is trying to respond because it realizes that the body is under attack, and this may lead to an overload of immunity, and then the immune system makes a major attack that damages healthy tissues in the lung, and this also may make breathing more difficult. the virus not only attacks the lungs but also the kidneys, which may lead to kidney failure and later death. the world health organization considered covid-19 as a pandemic disease which means an epidemic that has spread over a large area that is prevalent through an entire county, continent, or the whole world. this work aims to introduce a brief study of covid-19, in addition to analysis the published papers in this field. 2. materials and methods this section including: covid-19 mechanism, published papers on covid-19, results, and analysis. 2.1. covid-19 mechanism covid-19 virus get in the human body via many ways: eyes, nose, and mouth (fig. 1): • via eyes: in this case, the virus has two main pathways that exit through tears or enter the lacrimal sac, then reach the nasal cavity and mouth, then go to the stomach and end with gastric juices or enter the respiratory system. • via nose: in this case, the virus is either expelled through the mucus or enters the respiratory system. • via mouth: in this case, the virus enters the stomach and kills them, or enters the respiratory system. when the virus enters the respiratory system, it settles into the lung, where it begins to attack cells there. at that time, the immune system begins to attack the virus, and this leads to many accumulations in the lung that lead to a deficiency in the work of the lung. this means shortness of breath and sometimes leads to death. 2.2. published papers on covid-19 covid-19 key word is used is applied on sciencedirect library at april 11, 2020 in which appeared 1741 published papers, all these papers are published in 2020. to determine this this work, 100 papers were focused onto be studied as shown in fig. 2. to analyze these hundreds published papers, the first look at them was as follows: eighty-nine (89%) papers have been downloaded and the study has been done on it, eight (8%) papers we could not download, two v ia e ye s •entering the eyes •exit through tears •enters to the lacrimal sac v ia n os e •enter the nose •expelled through the mucus •enters the respiratory system v ia m ou th •enter the mouth •enters the stomach and kills them •enters the respiratory system fig. 1. coronavirus disease-19 virus entering the human body. muzhir shaban al-ani and dimah mezher shaban al-ani: review study based on covid-19 48 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 duplicated (2%) papers, and finally one repeated (1%) paper. in general, 11% of the papers are canceled and 89% of the papers are considered to be studied. 3. results and analysis table 1 includes data on the published research papers that have been studied. this table was divided into several fields to clarify the quality of research and on any topic in which it was focused, these fields are paper id, topic of the paper, applied method, and applied database, in addition to the country, in which the study was conducted and the country to which the researcher belongs. whatever was the area of interest, they are related to the medical part of covid-19. the field of research area regarding to the score is divided into four main categories: medicine 37 (42%), statistics 19 (21%), healthcare 17 (19%), and management 16 (18%). medicine category including: treatment 22 (60%), diagnosis 6 (16%), clinical 5 (14%), emergency 1 (2.5%), hepatic 1 (2.5%), surgical 1 (2.5%), and arrhythmia 1 (2.5%). statistical category including many fields in statistical analysis. health-care category including: health approach 5 (30%), health workers 3 (17%), personal health 3 (17%), community pharmacies 2 (12%), health-care provider 2 (12%), global health 1 (6%), and national care 1 (6%). management category including: recommendation 6 (38%), patient management 3 (19%), control 2 (13%), pharmaceutical care management 1 (6%), internet hospital 1 (6%), emergency 1 (6%), clinical management 1 (6%), and reorganization 1 (6%). the most important categories are related to the country in which the study was conducted and the country to which the researcher belongs. the country to which the researcher belongs is focusing on 25 countries and china settles on the top, as shown in fig. 3. the score of the researchers by country related to covid-19 is sorted top to bottom: china 37%, usa 12%, france 7%, taiwan 7%, canada 5%, uk 5%, italy 5%, india 4%, and spain 2%, and the rest of countries are 1%. the country in which the study was conducted is focusing on 20 countries and also china settles on the top, as shown in fig. 4. the score of the published research papers by country related to covid-19 is sorted top to bottom: china 37%, global 27%, usa 12%, france 7%, taiwan 7%, canada 5%, uk 5%, italy 5%, india 4%, and spain 2%, and the rest of countries are 1%. as a result of the previous research and studies obtained, it was found that the coronavirus has effected various places in the human body, and this virus has many different effects, including its effect on the lung and respiratory system. in general, its effect on heart murmurs, as well as on microvascular clots, in addition to other effects such high temperatures, intestinal disturbances, vomiting, and other disorders. 0 10 20 30 40 no. of papers % of papers china global usa france uk taiwan spain india canada saudi arabia europe japan korea bolivia italy colombia new zealand singapore ireland bulgaria fig. 4. country in which the research was conducted. 0 10 20 30 40 no. of papers % of papers china usa france taiwan canada uk italy india spain bulgaria switzerland iran egypt saudi arabia japan australia europe korea sweden bolivia colombia new zealand singapore germany ireland fig. 3. country to which the researcher belongs. 0 20 40 60 80 100 120 total papers downloaded papers not downloaded papers duplicated papers repeated papers fig. 2. published papers on coronavirus disease-19. muzhir shaban al-ani and dimah mezher shaban al-ani: review study based on covid-19 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 49 p ap er id to pi c a pp lie d m et ho d a pp lie d da ta ba se r es ea rc he r by c ou nt ry s ea rc h by co un tr y a kr am e t a l. [1 ] s ta tis tic al a na ly si s s ys te m at ic re vi ew im m un ot he ra py fo r c o v id -1 9 w eb o f s ci en ce , p ub m ed , s co pu s ira n g lo ba l k ai e t a l. [2 ] m ed ic in e (c lin ic al tr ia l) e ffe ct s of m us cl e re la xa tio n on an xi et y in p at ie nt s w ith c o v id -1 9 c he st c t us in g hy pn ot ic d ru gs c hi na c hi na ti an m in e t a l. [3 ] s ta tis tic al a na ly si s c ha ra ct er is tic s of c o v id -1 9 d at a co lle ct ed fr om c hi ne se c en te r f or d is ea se c hi na c hi na [s he rie f e t a l. [4 ] m ed ic in e (h ep at ic a nd ga st ro in te st in al in vo lv em en t) c ha ra ct er is tic s of li ve r b io ch em ic al w ith c o v id -1 9 p ub lis he d c o v id -1 9 ca se s tu di es e gy pt g lo ba l r iy an ti et a l. [5 ] s ta tis tic al a na ly si s b ui ld in g re si lie nc e ag ai ns t c o v id -1 9 p ub lis he d w or ks o n c o v id -1 9 ja pa n g lo ba l zh en yu e t a l. [6 ] s ta tis tic al a na ly si s v ic ar io us tr au m at iz at io n p ub lis he d w or ks o n c o v id -1 9 c hi na g lo ba l a nd re w e t a l. [7 ] m ed ic in e (s ur gi ca l) m od ifi ca tio n of h ea d an d ne ck tre at m en t p ar ad ig m s c ol le ct ed d at a u s a u s a c he ng e t a l. [8 ] m ed ic in e (a rr hy th m ia ) c o v id -1 9 an d in he rit ed a rr hy th m ia sy nd ro m es e c g c ol le ct ed d at a e ur op e e ur op e ve ró ni ca e t a l. [9 ] m an ag em en t (r ec om m en da tio ns o f pa nd em ic ) c o v id -1 9 an d ki dn ey tr an sp la nt pa tie nt s d at a co lle ct ed fr om n ep hr ol og y c lin ic al m an ag em en t u ni t, s pa in s pa in ju n et a l. [1 0] m ed ic in e (c lin ic al p ro gr es si on ) c o v id -1 9 an d ra di ol og ic al c o v id im pr ov em en t d at a co lle ct ed fr om c lin ic al a nd la bo ra to ry (c t im ag es ) c hi na c hi na r os a et a l. [1 1] m an ag em en t (r ec om m en da tio ns o f pa nd em ic ) te ch ni qu es a nd t ra ns pl an ta tio n an d n ur si ng a re as d at a co lle ct ed fr om (s e pa r ) a nd (a e e r ) s pa in s pa in ju an e t a l. [1 2] m an ag em en t (r ec om m en da tio ns o f pa nd em ic ) im pa ct o f t he c o v id -1 9 pa nd em ic in th e po pu la tio n. c ol le ct ed c as es fr om h os pi ta ls an d th e m in is try o f h ea lth b ol iv ia b ol iv ia s iy ua n et a l. [1 3] m an ag em en t (r ec om m en da tio ns o f pa nd em ic ) m an ag em en t o f b ur n w ar ds d at a co lle ct ed fr om in st itu te o f b ur n r es ea rc h c hi na c hi na li e t a l. [1 4] m ed ic in e (d ia gn os is a nd tre at m en t o f c o v id -1 9) in te rn et o f t hi ng sai de d di ag no si s an d tre at m en t o f c or on av iru s c ol le ct ed e xi st in g da ta a nd qu es tio nn ai re s c hi na c hi na a rn au d et a l. [1 5] m ed ic in e (d ia gn os is a nd tre at m en t o f c o v id -1 9) c c a fu re co m m en da tio ns o n th e m an ag em en t o f c an ce r d ur in g th e c o v id -1 9 ep id em ic d at a co lle ct ed fr om m an y fr en ch as so ci at io ns fr an ce fr an ce li -s he ng e t a l. [1 6] m ed ic in e (tr ea tm en t a nd pr ev en tio n of k no w le dg e su rr ou nd in g c o v id -1 9) s ys te m at ic al ly s um m ar iz es d at a co lle ct ed fr om c ur re nt pu bl is he d ev id en ce c hi na c hi na k ai e t a l. [1 7] s ta tis tic al a na ly si s a ra nd om iz ed c on tro lle d st ud y d at a co lle ct ed fr om tw o gr ou ps be fo re a nd a fte r i nt er ve nt io n c hi na c hi na c ui pi ng e t a l. [1 8] s ta tis tic al a na ly si s s ys te m at ic r ev ie w c ol le ct ed d at a fro m p ub m ed da ta ba se c hi na g lo ba l a nd re w e t a l. [1 9] m ed ic in e (d ia gn os is ) u se o f c ar di ac c om pu te d to m og ra ph y fo r c o v id -1 9 p an de m ic d at a co lle ct ed fr om u s c en te rs u s a g lo ba l p riy a et a l. [2 0] h ea lth ca re (h ea lth w or ke rs ) c ha lle ng es fo r t he a lle rg is t/ im m un ol og is t d at a co lle ct ed fr om h ea lth -c ar e re so ur ce s u s a u s a x in ya o et a l. [2 1] m ed ic in e (c lin ic al t ria ls o n c o s c o v id cl in ic al tr ia ls o n c o v id -1 9 c ol le ct ed c lin ic al tr ia l p ro to co ls o n c o v id -1 9 c hi na c hi na d om en ic o et a l. [2 2] s ta tis tic al a na ly si s a pp lic at io n of th e a r im a m od el c o v id -2 01 9 ep id em ic d at as et ita ly g lo ba l zu hu a et a l. [2 3] m ed ic in e (c he st d ia gn os is c o v id -1 9) c om pa re c he st h r c t lu ng s ig ns c ol le ct ed c he st h r c t im ag es c hi na c hi na (c on td ... ) t a b l e 1 : s ea rc h a p p lie d in s ci en ce d ir ec t lib ra ry a t a p ri l 7 , 2 02 0. muzhir shaban al-ani and dimah mezher shaban al-ani: review study based on covid-19 50 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 p ap er id to pi c a pp lie d m et ho d a pp lie d da ta ba se r es ea rc he r by c ou nt ry s ea rc h by co un tr y ji ng e t a l. [2 4] s ta tis tic al a na ly si s s ys te m at ic re vi ew p ub m ed , e m b a s e , a nd w eb o f sc ie nc es d at ab as es c hi na g lo ba l q an ta e t a l. [2 5] m an ag em en t (r ec om m en da tio ns o f pa nd em ic ) h aj j t hr ou gh th e si tu at io n w ith c o v id -1 9 li te ra tu re d at a co lle ct ed u s a s au di a ra bi a s té ph an e et a l. [2 6] h ea lth ca re (p er so na l h ea lth ) c ha ng es in c ar e of h om e ar tifi ci al nu tri tio n pa tie nt s du rin g th e c o v id -1 9 c ol le ct ed d at a tre at in g c o v id -1 9 pa tie nt s fr an ce fr an ce w en ju n et a l. [2 7] s ta tis tic al a na ly si s c lu st er s am pl in g of s tu de nt s vi a c o v id -1 9 ep id em ic m ed ic al c ol le ge in c hi na c hi na c hi na s hi os hi n et a l. [2 8] m ed ic in e (tr ea tm en t) tr ea tm en t o pt io ns fo r c o v id -1 9 c ol le ct ed c as es ta iw an ta iw an c hr is tia n et a l. [2 9] m ed ic in e (tr ea tm en t) e ffe ct s of c hl or oq ui ne a ga in st c o v id -1 9 c ol le ct ed c as es fr an ce fr an ce a pa rn a at a l. [3 0] h ea lth ca re (p er so na l h ea lth ) r es pi ra to ry m ed ic s or ic u c ol le ct ed d at a fro m n s a id s in c o v id -1 9 u k u k x ia op in g et a l. [3 1] m ed ic in e (tr ea tm en t) n uc le ic a ci ds te st in g an d ch es t c t ex am in at io n c ol le ct ed c as e st ud y c hi na c hi na s ha o et a l. [3 2] m ed ic in e (tr ea tm en t) p ne um on ia a nd c o v id -1 9 c ol le ct ed c he st x -r ay a nd c he st c t im ag es ta iw an ta iw an q ia ny in g et a l. [3 3] s ta tis tic al a na ly si s c on ce pt ua l m od el fo r ( c o v id -1 9 c ol le ct ed c as es u s a u k k ai e t a l. [3 4] s ta tis tic al a na ly si s c lin ic al fe at ur es o f c o v id -1 9 c om pa ris on y ou ng a nd m id dl eag ed p at ie nt s c hi na c hi na p hi lip pe e t a l. [3 5] m ed ic in e (tr ea tm en t) u si ng h yd ro xy ch lo ro qu in e an d az ith ro m yc in c ol le ct ed d at a fro m p at ie nt s fr an ce fr an ce d an da n et a l. [3 6] m ed ic in e (tr ea tm en t) r es po ns es in c yt ok in e c ol le ct ed s ev er e ca se s u s a u s a p ed ro e t a l. [3 7] h ea lth ca re (c om m un ity ph ar m ac ie s) h ea lth s ys te m fr om th e co m m un ity ph ar m ac ie s c ol le ct ed c as es c ol om bi a c ol om bi a w ei -h su an e t a l. [3 8] m ed ic in e (tr ea tm en t) n ov el c o v id tw o ca se s ar e co lle ct ed ta iw an ta iw an ja so n et a l. [3 9] m ed ic in e (tr ea tm en t) im pa ct o f t he v iru s on th e in di vi du al c ol le ct ed s ev er e ca se s an d de at hs c an ad a c hi na la na e t a l. [4 0] h ea lth ca re (h ea lth -c ar e pr ov id er ) s pe ci al is t p al lia tiv e c ar e s er vi ce ca re c lin ic ia ns d at a co lle ct ed n ew z ea la nd n ew ze al an d m uh am m ad e t a l. [4 1] m ed ic in e (tr ea tm en t) d ru g de ve lo pm en t p ro ce ss g is a id d at ab as e c hi na c hi na w an g et a l. [4 2] m an ag em en t ( ph ar m ac eu tic al ca re m an ag em en t) d ru g s up pl y m an ag em en t d at a co lle ct ed c hi na c hi na a le xi s et a l. [4 3] m ed ic in e (tr ea tm en t) m ed ic in e in d ru g d is co ve ry h is to ric al d at a of in te ns iv e ca re un it u s a u s a w es to n et a l. [4 4] m an ag em en t (r ec om m en da tio ns o f pa nd em ic ) pr ed ic t t he c o v id -1 9 ep id em ic c ol le ct ed d at a ca se s c an ad a c hi na n an -y ao e t a l. [4 5] m ed ic in e (tr ea tm en t) c lin ic al c ou rs e of c o v id -1 9 c ol le ct ed d at a ca se s ta iw an c hi na k ie sh a et a l. [4 6] s ta tis tic al a na ly si s r ed uc in g th e m ag ni tu de o f t he c o v id -1 9 ou tb re ak c ol le ct ed d at a ca se s in w uh an c hi na c hi na p ra di p et a l. [4 7] m ed ic in e (tr ea tm en t) p an de m ic a nd p re gn an cy d at a co lle ct ed fr om n at io na l u ni ve rs ity o f s in ga po re s in ga po re s in ga po re t a b l e 1 : (c o n ti n u ed ). (c on td ... ) muzhir shaban al-ani and dimah mezher shaban al-ani: review study based on covid-19 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 51 p ap er id to pi c a pp lie d m et ho d a pp lie d da ta ba se r es ea rc he r by c ou nt ry s ea rc h by co un tr y tu ec h et a l. [4 8] m ed ic in e (tr ea tm en t) d ig es tiv e an d on co lo gi ca l s ur ge ry du rin g th e c o v id -1 9 d at a co lle ct ed fr om p m s i fr an ce fr an ce m uh -y on g et a l. [4 9] m an ag em en t ( co nt ro l) in te rr up tin g c o v id -1 9 tra ns m is si on c ol le ct ed d at a ta iw an g lo ba l c la ud io e t a l. [5 0] m an ag em en t ( co nt ro l) s ur vi ve v ia p an de m ic c ol le ct ed d at a ita ly ita ly e lis sa e t a l. [5 1] h ea lth ca re (h ea lth a pp ro ac h) h ea lth ca re s ys te m th ro ug h c o v id -1 9 c ol le ct ed d at a fro m p ub m ed u s a c hi na y ic hu n et a l. [5 2] s ta tis tic al a na ly si s k id ne y di se as e w ith d ea th o f p at ie nt s w ith c o v id -1 9 d at a fro m c hi ne se n at io na l h ea lth c om m is si on c hi na c hi na d m itr y et a l. [5 3] s ta tis tic al a na ly si s s im ul at io nba se d an al ys is o n th e co ro na vi ru s ou tb re ak s el ec te d ca se s g er m an y g lo ba l s hu an gy i e t a l. [5 4] m an ag em en t ( in te rn et h os pi ta l) fi gh tin g ag ai ns t c o v id -1 9 s el ec te d ca se s c hi na g lo ba l li xi an g et a l. [5 5] s ta tis tic al a na ly si s p ro pa ga tio n an al ys is a nd p re di ct io n d at a co lle ct ed fr om m an y ar ea s c hi na g lo ba l c hi hc he ng e t a l. [5 6] s ta tis tic al a na ly si s e pi de m ic a nd th e ch al le ng es d at a co lle ct ed ta iw an g lo ba l w ei yi e t a l. [5 7] s ta tis tic al a na ly si s c ar di ov as cu la r b ur de n of c or on av iru s li te ra tu re d at a u s a g lo ba l s hu ai e t a l. [5 8] m an ag em en t ( e m er ge nc y) e m er ge nc y p ro ce du re s d at a fro m 4 h os pi ta ls in w uh an c hi na c hi na x ia oy an g et a l. [5 9] m an ag em en t ( cl in ic al m an ag em en t) r ol e in th e tre at m en t e c m o d at ab as e c hi na c hi na k ai e t a l. [6 0] h ea lth ca re (h ea lth w or ke rs ) p ro te ct h ea lth ca re w or ke rs fr om c o v id -1 9 c ol le ct ed d at a c hi na c hi na za iw ei e t a l. [6 1] h ea lth ca re (c om m un ity ph ar m ac ie s) g ui da nc e fro m c lin ic al e xp er ie nc e c hi ne se d at ab as e c hi na c hi na m at th ia s et a l. [6 2] m ed ic in e (c om m on c lin ic al sc en ar io s) lu ng c an ce r r ad io th er ap y du rin g th e c o v id -1 9 pa nd em ic c ol le ct ed d at a of s ix lu ng c an ce r ca se s s w itz er la nd g lo ba l r us se ll et a l. [6 3] m an ag em en t (r ec om m en da tio ns o f pa nd em ic ) s ys te m at ic re vi ew li te ra tu re d at a fro m w h o u k g lo ba l k au st uv e t a l. [6 4] h ea lth ca re (h ea lth a pp ro ac h) im pa ct o f c o v id -1 9 ep id em ic in fe ct ed c as es d at a in di a in di a p as ca l e t a l. [6 5] m ed ic in e (tr ea tm en t) li te ra tu re re vi ew r ad io lo gi ca l d at a ita ly g lo ba l g eo ffr ey e t a l. [6 6] m an ag em en t ( p at ie nt m an ag em en t) r ol e of c he st im ag in g c ol le ct ed d at a c an ad a g lo ba l a li et a l. [6 7] h ea lth ca re (p er so na l h ea lth ) e m er ge nc y ca se s s ev er al d en ta l c as es s au di a ra bi a g lo ba l ta o et a l. [6 8] m ed ic in e (tr ea tm en t) p ro ba bl e p an go lin w ith c o v id -1 9 d at a co lle ct ed fr om r n a -s eq c hi na c hi na r az vi go r [ 69 ] m ed ic in e (tr ea tm en t) d er m at ol og ic a sp ec ts o f c o v id -1 9 in fe ct io n c ol le ct ed s ki n da ta b ul ga ria b ul ga ria ji aj ia e t a l. [7 0] m ed ic in e (tr ea tm en t) c lin ic al s tu dy o f m es en ch ym al s te m c el l t re at m en t c ol le ct ed im ag in g sc an c hi na c hi na m ur ra y et a l. [7 1] h ea lth ca re (h ea lth w or ke rs ) r ec om m en ds h ea lth ca re w or ke rs c ol le ct ed m r i ire la nd ire la nd s te ph en e t a l. [7 2] m ed ic in e (e m er ge nc y) u np re ce de nt ed d is ru pt io n of li ve s d at a co lle ct ed fr om s ta ta 1 6. 0 a us tra lia c hi na a lic e et a l. [7 3] m an ag em en t ( r eo rg an iz at io n) s ur ve y d at a co lle ct ed fr om c ip o m o ita ly g lo ba l p ra ve en e t a l. [7 4] m ed ic in e (tr ea tm en t) b ar ic iti ni b as a p ot en tia l d ru g b io ch em ic al d at a co lle ct ed in di a in di a s un gw an e t a l. [7 5] m ed ic in e (tr ea tm en t) u si ng p sy ch on eu ro im m un ity a ga in st c o v id -1 9 d at a co lle ct ed k or ea k or ea r ite sh e t a l. [7 6] m an ag em en t ( pa tie nt m an ag em en t) r ev ie w o f t re at m en t o f c or on av iru s c ol le ct ed c ur re nt li te ra tu re in di a g lo ba l h ua i-l ia ng e t a l. [7 7] h ea lth ca re (h ea lth a pp ro ac h) p ub lic h ea lth m ea su re s c ol le ct ed d at a fa ce m as k w ith c o v id -1 9 c hi na c hi na t a b l e 1 : (c o n ti n u ed ). (c on td ... ) muzhir shaban al-ani and dimah mezher shaban al-ani: review study based on covid-19 52 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 we know that coronavirus (covid-19) started in china, so most of the research must begin and continue in this country to find a treatment for this disease. in spite of all the ongoing research papers on this subject, so far there is no effective treatment for this disease, on the other hand, all efforts are combined to achieve this goal and overcome this pandemic. 4. conclusions december 2019 was the start of an outbreak of coronavirus (covid-19) in wuhan, china. then, at the beginning of the year 2020, this virus began to spread across the countries of the world gradually, as it appeared in all countries within only 2 months. during the spread of the coronavirus, researchers began working on this virus in different aspects and disciplines. this paper is based on a number of key factors that have been studied such as: topic, applied method, applied database, researcher by country, and search by country. this search was applied on sciencedirect library and conducted on april 11, 2020. in this work, a focus was placed on a hundred research papers, including 899 downloaded and 11 research papers that we could not download, so this work is focused on research papers that were obtained only. as a conclusion of this work, a huge number of research papers are published in these 2 months and this research papers has been divided into four main categories: medicine 42%, statistics 21%, healthcare 19%, and management 18%. each of these categories also was divided into many sub-categories that related to narrow fields. furthermore, it is clear that china settles on the top of the country in which the study was conducted and also on the top of the country to which the researcher belongs. 5. acknowledgment on this occasion, we would like to extend our sincere thanks and appreciation to the sciencedirect database that contributed to supporting the research of covid-19 by providing research papers without fees. references [1] a. a. jafari and s. ghasemi. “the possible of immunotherapy for covid-19: a systematic review”. international immunopharmacology, vol. 83, p. 106455, 2020. [2] k. liu, y. chen, d. wu, r. lin and l. pan. “effects of progressive muscle relaxation on anxiety and sleep quality in patients with covid-19”. complementary therapies in clinical practice, vol. 39, p. 101132, 2020. [3] t. xu, c. chen, z. zhu, m. cui and y. xue. “clinical features and p ap er id to pi c a pp lie d m et ho d a pp lie d da ta ba se r es ea rc he r by c ou nt ry s ea rc h by co un tr y r up sa e t a l. [7 8] h ea lth ca re (h ea lth a pp ro ac h) m fm g ui da nc e fo r c o v id -1 9 c ol le ct ed c om m on in di ca to rs o f m fm u s a u s a a nn el ie s et a l. [7 9] h ea lth ca re (g lo ba l h ea lth ) c on ta in th e c o v id -1 9 ou tb re ak c ol le ct ed c as es u k g lo ba l ti an sh i e t a l. [8 0] m an ag em en t ( pa tie nt m an ag em en t) e xp er t c on se ns us d ur in g c o v id -1 9 d at a co lle ct ed fr om in te rv en tio na l o nc ol og y b ra nc h of c hi na c hi na c hi na s he ng e t a l. [8 1] s ta tis tic al a na ly si s da ta -d riv en a na ly si s o ffi ci al w eb si te m in is try o f h ea lth , ja pa n c hi na ja pa n c hu n et a l. [8 2] m ed ic in e (d ia gn os is ) e va lu at io n on t hi ns ec tio n c t c t co lle ct ed im ag es c hi na c hi na r on an e t a l. [8 3] h ea lth ca re (n ut rit io na l c ar e) e xp er t o pi ni on to m an ag e ho sp ita liz ed p at ie nt s e xp er t o pi ni on fr an ce fr an ce c hu nq in e t a l. [8 4] m ed ic in e (d ia gn os is ) d ia gn os is o f c o v id -1 9 c t im ag es d at a co lle ct ed c hi na c hi na m et ca lfe e t a l. [8 5] m ed ic in e (tr ea tm en t) m es en ch ym al s te m c el ls c lin ic al d at a co lle ct ed u k u k m ar vi e t a l. [8 6] m ed ic in e (tr ea tm en t) k er at oc on ju nc tiv iti s an d c o v id -1 9 c ol le ct ed c as es c an ad a c an ad a r on al d et a l. [8 7] h ea lth ca re (h ea lth a pp ro ac h) h ea lth b el ie f m od el c ol le ct ed c as es u s a u s a m oh am ed e t a l. [8 8] h ea lth ca re (h ea lth a pp ro ac h) u nk no w n c o v id -1 9 c ol le ct ed d at a s w ed en g lo ba l n an e t a l. [8 9] s ta tis tic al a na ly si s p re gn an t w om en w ith c o v id -1 9 c ol le ct ed d at a fro m t on gj i h os pi ta l i n w uh an c hi na c hi na t a b l e 1 : (c o n ti n u ed ). muzhir shaban al-ani and dimah mezher shaban al-ani: review study based on covid-19 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 53 dynamics of viral load in imported and non-imported patients with covid-19”. international journal of infectious diseases, vol. 94, pp. 68-71, 2020. [4] s. musa. “hepatic and gastrointestinal involvement in coronavirus disease 2019 (covid-19): what do we know till now”? arab journal of gastroenterology, vol. 21, no. 1, pp. 3-8, 2020. [5] r. djalante, r. shaw and a. dewit. “building resilience against biological hazards and pandemics: covid-19 and its implications for the sendai framework”. progress in disaster science, vol. 6, p. 100080, 2020. [6] z. li, j. ge, m. yang, j. feng and c. yang. “vicarious traumatization in the general public, members, and non-members of medical teams aiding in covid-19 control”. brain, behavior, and immunity, vol. 88, pp. 916-919, 2020. [7] a. t. day, d. j. sher, r. c. lee, j. m. truelson and e. a. gordin. “head and neck oncology during the covid-19 pandemic: reconsidering traditional treatment paradigms in light of new surgical and other multilevel risks”. oral oncology, vol. 105, p. 104684, 2020. [8] c. i. wu, p. g. postema, e. arbelo, e. r. behr and a. a. m. wilde. “sars-cov-2, covid-19 and inherited arrhythmia syndromes”. heart rhythm, vol. 2020, p. 24, 2020. [9] v. lópez, t. vázquez, j. alonso-titos, m. “cabello and grupo de estudio great. “recommendations on management of the sarscov-2 coronavirus pandemic (covid-19) in kidney transplant patients”. nefrología, vol. 40, no. 3, pp. 265-271, 2020. [10] j. chen, t. qi, l. liu, y. ling and h. lu. “clinical progression of patients with covid-19 in shanghai, china”. journal of infection, vol. 80, no. 5, pp. e1-e6, 2020. [11] r. c. pérez, s. álvarez, l. llanos, a. n. ares and d, díaz-pérez. “recomendaciones de consenso separ y aeer sobre el uso de la broncoscopia y la toma de muestras de la via respiratoria en pacientes con sospecha o con infeccion confirmada por covid-19”. archivos de bronconeumología, vol. 56, no. 2, pp. 19-26, 2020. [12] j. p. escalera-antezana, n. f. lizon-ferrufino, a. maldonadoalanoca, g. alarcón-de-la-vega and lancovid. “clinical features of cases and a cluster of coronavirus disease 2019 (covid-19) in bolivia imported from italy and spain”. travel medicine and infectious disease, vol. 35, p. 101653, 2020. [13] s. ma, z. yuan, y. peng, j. chen and g. luo. “experience and suggestion of medical practices for burns during the outbreak of covid-19”. burns, vol. 46, no. 4, pp. 749-755, 2020. [14] l. bai, d. yang, x. wang, l. tong and f. tan. “chinese experts’ consensus on the internet of things-aided diagnosis and treatment of coronavirus disease 2019 (covid-19)”. clinical ehealth, vol. 3, pp. 7-15, 2020. [15] a. mejean, m. rouprêt, f. rozet, k. bensalah and le comité de cancérologie de l’association française d’urologie. “recommandations ccafu sur la prise en charge des cancers de l’appareil urogénital en période d’épidémie au coronavirus covid-19”. progrès en urologie, vol. 30, no. 5, pp. 221-231, 2020. [16] l. s. wang, y. r. wang, d. w. ye and q. q. liu. “a review of the 2019 novel coronavirus (covid-19) based on current evidence”. international journal of antimicrobial agents, vol. 55, no. 6, p. 105948, 2020. [17] k. liu, w. zhang, y. yang, j. zhang and y. chen. “respiratory rehabilitation in elderly patients with covid-19: a randomized controlled study”. complementary therapies, vol. 39, p. 101166, 2020. [18] c. bao, x. liu, h. zhang, y. li and j. liu. “covid-19 computed tomography findings: a systematic review and meta-analysis”. journal of the american college of radiology, vol. 17, pp. 701-709, 2020. [19] a. d. choi, s. abbara, k. r. branch, g. m. feuchtner and r. blankstein. “society of cardiovascular computed tomography guidance for use of cardiac computed tomography amidst the covid-19 pandemic”. journal of cardiovascular computed tomography, vol. 14, no. 1, pp. 101-104, 2020. [20] p. bansal, t. a. bingemann, m. greenhawt, g. mosnaim and m. shaker. clinician wellness during the covid-19 pandemic: extraordinary times and unusual challenges for the allergist/ immunologist. the journal of allergy and clinical immunology, vol. 8, no. 6, pp. 1781-1790, 2020. [21] x. jin, b. pang, j. zhang, q. liu and b. zhang. “core outcome set for clinical trials on coronavirus disease 2019 (cos-covid)”. engineering, vol. 1, pp. 2-6, 2020. [22] d. benvenuto, m. giovanetti, l. vassallo, s. angeletti and m. ciccozzi. “application of the arima model on the covid-2019 epidemic dataset”. data in brief, vol. 29, p. 105340, 2020. [23] z. chen, h. fan, j. cai, y. li and j. sun. “high-resolution computed tomography manifestations of covid-19 infections in patients of different ages”. european journal of radiology, vol. 126, p. 108972, 2020. [24] j. yang, y. zheng, x. gou, k. pu and y. zhou. “prevalence of comorbidities in the novel wuhan coronavirus (covid-19) infection: a systematic review and meta-analysis”. international journal of infectious diseases, vol. 94, pp. 91-95, 2020. [25] q. a. ahmed and z. a. memish. “the cancellation of mass gatherings (mgs)? decision making in the time of covid-19”. travel medicine and infectious disease, vol. 34, p. 101631, 2020. [26] s. m. schneider, v. albert, n. barbier, d. barnoud and p. déchelotte. “adaptations de la prise en charge des patients en nutrition artificielle a domicile au cours de l’épidémie virale covid-19 en france: avis du comite de nutrition a domicile de la societe francophone de nutrition clinique et metabolisme (sfncm)”. nutrition clinique et métabolisme, vol. 34, no. 2, pp. 105-107, 2020. [27] w. cao, z. fang, g. hou, m. han and j. zheng. “the psychological impact of the covid-19 epidemic on college students in china”. psychiatry research, vol. 287, p. 112934, 2020. [28] s. s. jean, p. i. lee and p. r. hsueh. “treatment options for covid-19: the reality and challenges”. journal of microbiology, immunology and infection, vol. 53, no, 3, pp. 436-443, 2020. [29] c. a. devaux, j. m. rolain, p. colson and d. raoult. “new insights on the antiviral effects of chloroquine against coronavirus: what to expect for covid-19”? international journal of antimicrobial agents, vol. 55, no. 5, p. 105938, 2020. [30] a. viswanath and p. monga. “working through the covid-19 outbreak: rapid review and recommendations for msk and allied heath personnel”. journal of clinical orthopaedics and trauma, vol. 11, no. 3, pp. 500-503, 2020. [31] x. yin, l. dong, y. zhang, w. bian and h. li. “a mild type of childhood covid-19 a case report”. radiology of infectious diseases, in press, 2020. [32] s. c. cheng, y. c. chang, y. l. fan chiang, y. c. chien and y. n. hsu. “first case of coronavirus disease 2019 (covid-19) pneumonia in taiwan”. journal of the formosan medical association, vol. 119, no. 3, pp. 747-751, 2020. [33] q. lin, s. zhao, d. gao, y. lou and d. he. “a conceptual model for the coronavirus disease 2019 (covid-19) outbreak in muzhir shaban al-ani and dimah mezher shaban al-ani: review study based on covid-19 54 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 wuhan, china with individual reaction and governmental action”. international journal of infectious diseases, vol. 93, pp. 211-216, 2020. [34] k. liu, y. chen, r. lin and k. han. “clinical features of covid-19 in elderly patients: a comparison with young and middle-aged patients”. journal of infection, vol. 80, no. 6, pp. e14-e18, 2020. [35] p. gautret, j. c. lagier, p. parola, v. t. hoang and d. raoult. “hydroxychloroquine and azithromycin as a treatment of covid-19: results of an open-label non-randomized clinical trial”. international journal of antimicrobial agents, vol. 56, no. 1. p. 105949, 2020. [36] d. wu and x. o. yang. “th17 responses in cytokine storm of covid-19: an emerging target of jak2 inhibitor fedratinib”. journal of microbiology, immunology and infection, vol. 53, no. 3. pp. 368-370, 2020. [37] p. amariles, m. ledezma-morales, a. salazar-ospina and j. a. hincapié-garcía. “how to link patients with suspicious covid-19 to health system from the community pharmacies? a route proposal”. research in social and administrative pharmacy, vol. 23, pp. 30248-30249, 2020. [38] w. h. huang, l. c. teng, t. k. yeh, y. j. chen and p. y. liu. “2019 novel coronavirus disease (covid-19) in taiwan: reports of two cases from wuhan, china”. journal of microbiology, immunology and infection, vol. 53, no. 3, pp. 481-484, 2020. [39] j. a. tetro. “is covid-19 receiving ade from other corona viruses”? microbes and infection, vol. 22, pp. 72-73, 2020. [40] l. ferguson and d. barham. “palliative care pandemic pack: a specialist palliative care service response to planning the covid-19 pandemic”. journal of pain and symptom management, vol. 60, no. 1, pp. e18-e20, 2020. [41] m. t. ul qamar, s. m. alqahtani, m. a. alamri and l. l. chen. “structural basis of sars-cov-2 3clpro and anti-covid-19 drug discovery from medicinal plants”. journal of pharmaceutical analysis, vol. 1, pp. 1-9, 2020. [42] w. ying, y. qian and z. kun. “drugs supply and pharmaceutical care management practices at a designated hospital during the covid-19 epidemic”. research in social and administrative pharmacy, vol. 1, pp. 1-4, 2020. [43] a. nahama, r. ramachandran, a. f. cisternas and h. ji. “the role of afferent pulmonary innervation in poor prognosis of acute respiratory distress syndrome in covid-19 patients and proposed use of resiniferatoxin (rtx) to improve patient outcomes in advanced disease state: a review”. medicine in drug discovery, vol. 5, p. 100033, 2020. [44] w. c. roda, m. b. varughese, d. han and m. y. li. “why is it difficult to accurately predict the covid-19 epidemic”? infectious disease modelling, vol. 52020, pp. 271-281, 2020. [45] n. y. lee, c. w. li, h. p. tsai, p. l. chen and w. c. ko. “a case of covid-19 and pneumonia returning from macau in taiwan: clinical course and anti-sars-cov-2 igg dynamic”. journal of microbiology, immunology and infection, vol. 53, no. 3, pp. 485487, 2020. [46] k. prem, y. liu, t. w. russell, a. j. kucharski and p. klepac. “the effect of control strategies to reduce social mixing on outcomes of the covid-19 epidemic in wuhan, china: a modelling study”. the lancet, vol. 1, pp. 1-6, 2020. [47] p. dashraath, w. j. l. jeslyn, l. m. x. karen, l. l. min and s. l. lin. “coronavirus disease 2019 (covid-19) pandemic and pregnancy”. american journal of obstetrics and gynecology, vol. 222, no. 6, pp. 521-531, 2020. [48] j. j. tuech, a. gangloff, f. di fiore, p. michel and l. schwarz. “strategy for the practice of digestive and oncological surgery during the covid-19 epidemic”. journal of visceral surgery, vol. 157, pp. s7-s12, 2020. [49] m. y. yen, j. schwartz, s. y. chen, c. c. king and p. r. hsueh. “interrupting covid-19 transmission by implementing enhanced traffic control bundling: implications for global prevention and control efforts”. journal of microbiology, immunology and infection, vol. 53, no. 3, pp. 377-380, 2020. [50] c. guerci, a. maffioli, a. a. bondurri, l. ferrario and p. danelli. “covid-19: how can a department of general surgery survive to a pandemic? surgery, vol. 167, no. 6, pp. 909-911, 2020. [51] e. driggin, m. v. madhavan, b. bikdeli, t. chuich and s. a. parikh. “cardiovascular considerations for patients, health care workers, and health systems during the coronavirus disease 2019 (covid-19) pandemic”. journal of the american college of cardiology, vol. 75, no. 18, pp. 2352-2371, 2020. [52] y. cheng, r. luo, k. wang, m. zhang and g. xu. “kidney disease is associated with in-hospital death of patients with covid-19”. kidney international, vol. 97, no. 5, pp. 829-838, 2020. [53] d. ivanov. “predicting the impacts of epidemic outbreaks on global supply chains: a simulation-based analysis on the coronavirus outbreak (covid-19/sars-cov-2) case”. transportation research part e: logistics and transportation review, vol. 136, p. 101922, 2020. [54] s. sun, k. yu, z. xie and x. pan. “china empowers internet hospital to fight against covid-19”. journal of infection, vol. 81, no. 1, pp. e67-e68, 2020. [55] l. li, z. yang, z. dang, c. meng and y. shao. “propagation analysis and prediction of the covid-19”. infectious disease modelling, vol. 52020, pp. 282-292, 2020. [56] c. c. lai, t. p. shih, w. c. ko, h. j. tang and p. r. hsueh. “severe acute respiratory syndrome coronavirus 2 (sars-cov-2) and coronavirus disease-2019 (covid-19): the epidemic and the challenges”. international journal of antimicrobial agents, vol. 55, no. 3, 105924, 2020. [57] w. tan and j. aboulhosn. “the cardiovascular burden of coronavirus disease 2019 (covid-19) with a focus on congenital heart disease”. international journal of cardiology, vol. 309, pp. 70-77, 2020. [58] s. zhao, k. ling, h. yan, l. zhong and x. chen. “anesthetic management of patients with covid 19 infections during emergency procedures”. journal of cardiothoracic and vascular anesthesia, vol. 34, no. 5, pp. 1125-1131, 2020. [59] x. hong, j. xiong, z. feng and y. shi. “extracorporeal membrane oxygenation (ecmo): does it have a role in the treatment of severe covid-19”? international journal of infectious diseases, vol. 94, pp. 78-80, 2020. [60] k. xu, x. lai and l. zheng. “suggestions on the prevention of covid-19 for health care workers in department of otorhinolaryngology head and neck surgery”. world journal of otorhinolaryngology head and neck surgery, vol. 1, pp. 1-3, 2020. [61] z. song, y. hu, s. zheng, l. yang and r. zhao. “hospital pharmacists’ pharmaceutical care for hospitalized patients with covid-19: recommendations and guidance from clinical experience”. research in social and administrative pharmacy, vol. 1, pp. 1-27, 2020. [62] m. guckenberger, c. belka, a. bezjak, j. bradley and d. palma. “practice recommendations for lung cancer radiotherapy during the covid-19 pandemic: an estro-astro consensus statement”. muzhir shaban al-ani and dimah mezher shaban al-ani: review study based on covid-19 uhd journal of science and technology | jul 2020 | vol 4 | issue 2 55 radiotherapy and oncology, vol. 146, pp. 223-229, 2020. [63] r. m. viner, s. j. russell, h. croker, j. packer and r. booy. “school closure and management practices during coronavirus outbreaks including covid-19: a rapid systematic review”. the lancet child and adolescent health, vol. 1, pp. 1-16, 2020. [64] k. chatterjee, k. chatterjee, a. kumar and s. shankar. “healthcare impact of covid-19 epidemic in india: a stochastic mathematical model”. medical journal armed forces india, vol. 76, pp. 147-155, 2020. [65] p. lomoro, f. verde, f. zerboni, i. simonetti and a. martegani. “covid-19 pneumonia manifestations at the admission on chest ultrasound, radiographs, and ct: single-center study and comprehensive radiologic literature review”. european journal of radiology open, vol. 7, p. 100231, 2020. [66] g. d. rubin, c. j. ryerson, l. b. haramati, n. sverzellati and a. n. leung. “the role of chest imaging in patient management during the covid-19 pandemic: a multinational consensus statement from the fleischner society”. chest, vol. 158, pp. 106-116, 2020. [67] a. alharbi, s. alharbi and s. alqaidi. “guidelines for dental care provision during the covid-19 pandemic”. the saudi dental journal, vol. 32, pp. 181-186, 2020. [68] t. zhang, q. wu and z. zhang. “probable pangolin origin of sarscov-2 associated with the covid-19 outbreak”. current biology, vol. 30, no. 76, pp. 1346-1351, 2020. [69] r. darlenski and n. tsankov. “covid-19 pandemic and the skin what should dermatologists know”? clinics in dermatology, in press, 2020. [70] j. chen, c. hu, l. chen, l. tang and l. li. “clinical study of mesenchymal stem cell treatment for acute respiratory distress syndrome induced by epidemic influenza a (h7n9) infection: a hint for covid-19 treatment”. engineering, vol. 1, pp. 1-6, 2020. [71] o. m. murray, j. m. bisset, p. j. gilligan, m. m. hannan and j. g. murray. “respirators and surgical facemasks for covid-19: implications for mri”. clinical radiology, vol. 75, no. 6, pp. 405407, 2020. [72] s. x. zhang, y. wang, a. rauch and f. wei. “unprecedented disruption of lives and work: health, distress and life satisfaction of working adults in china one month into the covid-19 outbreak”. psychiatry research, vol. 288, p. 112958, 2020. [73] a. indini, c. aschele, d. bruno, l. cavanna and f. grossi. “reorganization of medical oncology departments during covid-19 pandemic: a nationwide italian survey”. european journal of cancer, vol. 132, pp. 17-32, 2020. [74] d. praveen, p. r. chowdary and m. v. aanandhi. “baricitinib a januase kinase inhibitor not an ideal option for management of covid 19. international journal of antimicrobial agents, vol. 55, no, 5, p. 105967, 2020. [75] s. w. kim and k. p. su. “using psychoneuroimmunity against covid-19”. brain, behavior, and immunity, vol. 87, pp. 4-5, 2020. [76] r. gupta and a. misra. “contentious issues and evolving concepts in the clinical presentation and management of patients with covid-19 infection with reference to use of therapeutic and other drugs used in co-morbid diseases (hypertension, diabetes etc)”. diabetes and metabolic syndrome, vol. 14, no. 3, pp. 251-254. [77] h. l. wu, j. huang, c. j. p. zhang, z. he and w. k. ming. “facemask shortage and the novel coronavirus disease (covid-19) outbreak: reflections on public health measures”. eclinicalmedicine, vol. 21, p. 100329, 2020. [78] r. c. boelig, g. saccone, f. bellussi and v. berghella. “mfm guidance for covid-19”. american journal of obstetrics and gynecology, vol. 2, no. 2, p. 100106, 2020. [79] a. wilder-smith, c. j. chiew and v. j. lee. “can we contain the covid-19 outbreak with the same measures as for sars”? the lancet infectious diseases, vol. 1, pp. 1-10, 2020. [80] t. lyu, l. song, l. jin, y. zou and interventional oncology branch of china anti-cancer association. “expert consensus on the procedure of interventional diagnosis and treatment of cancer patients during the covid-19 epidemic”. journal of interventional medicine, vol. 3, no. 2, pp. 61-64, 2020. [81] s. zhang, m. diao, w. yu, l. pei and d. chen. “estimation of the reproductive number of novel coronavirus (covid-19) and the probable outbreak size on the diamond princess cruise ship: a data-driven analysis”. international journal of infectious diseases, vol. 93, pp. 201-204, 2020. [82] c. s. guan, z. b. lv, s. yan, y. n. du and b. d. chen. “imaging features of coronavirus disease 2019 (covid-19): evaluation on thin-section ct”. academic radiology, vol. 27, no. 5, pp. 609-613, 2020. [83] r. thibault, d. quilliot, p. seguin, f. tamion and p. déchelotte. “stratégie de prise en charge nutritionnelle à l’hôpital au cours de l’épidémie virale covid-19: avis d’experts de la societe francophone de nutrition clinique et metabolisme (sfncm)”. nutrition clinique et métabolisme, vol. 34, no. 2, pp. 97-104, 2020. [84] c. long, h. xu, q. shen, x. zhang and h. li. “diagnosis of the coronavirus disease (covid-19): rrt-pcr or ct”? european journal of radiology, vol. 126, p. 108961, 2020. [85] s. m. metcalfe. “mesenchymal stem cells and management of covid-19 pneumonia”. medicine in drug discovery, vol. 5, p. 100019, 2020. [86] m. cheema, h. aghazadeh, s. nazarali, a. ting and c. solarte. “keratoconjunctivitis as the initial medical presentation of the novel coronavirus disease 2019 (covid-19): a case report”. canadian journal of ophthalmology, vol. 1, pp. 1-7, 2020. [87] r. r. carico, j. sheppard and c. b. thomas. “community pharmacists and communication in the time of covid-19: applying the health belief model”. research in social and administrative pharmacy, vol. 1, pp. 1-11, 2020. [88] n. yu, w. li, q. kang, z. xiong, w. m. e. el zowalaty and j. d. järhult. “from sars to covid-19: a previously unknown sars related coronavirus (sars-cov-2) of pandemic potential infecting humans call for a one health approach”. one health, vol. 9, p. 100124, 2020. [89] n. yu, w. li, q. kang, z. xiong, s. wang, x. lin, y. liu, j. xiao, h. liu, d. deng, s. chen, w. zeng, l. feng and j. wu. “clinical features and obstetric and neonatal outcomes of pregnant patients with covid-19 in wuhan, china: a retrospective, single-center, descriptive study”. the lancet infectious diseases, vol. 1, pp. 1-6, 2020. tx_1~abs:at/tx_2:abs~at 20 uhd journal of science and technology | jul 2021 | vol 5 | issue 2 1. introduction in the aftermath of the global financial crisis of 2008–2009, many who took mortgages defaulted when they could not pay leading to many credit card issuers routinely encountering a credit debt crisis. numerous occasions of over-issuing credit cards to unfit candidates have raised concerns. concurrently a considerable percentage of cardholders regardless of their repayment capabilities heavily relied on credit cards and resulted in heavy credit debts. this has negatively affected banks and consumer confidence. the problem of credit card defaulting is binary classification problem applicants will either default or repay their credit debts, however determining the probability of defaulting from the perspective of risk management offers more value than a result of a binary classification [1]. improving the accuracy of fraudulent activities by only one percent can have a major impact on reducing the loss of financial institutions [2]. the aim of a credit default detection model is to solve the problem of categorizing loan customers into two groups: good customers (those who are expected to pay off their full loans in a already agreed upon time period) and bad customers (those who might default on their payments). customers who pay their bills on time are more likely to repay their loans on time, which benefits banks. bad customers, on the other hand, can cost you money. as a result, banks and financial institutions are increasingly focusing on the development of credit scoring models, comparison of different ensemble methods in credit card default prediction azhi abdalmohammed faraj1,2, didam ahmed mahmud1, bilal najmaddin rashid1 1department of information technology, college of commerce, university of sulaimani, sulaimani, iraq, 2department of computer engineering, college of engineering, dokuz eylül üniversitesi, i̇zmir, turkey a b s t r a c t credit card defaults pause a business-critical threat in banking systems thus prompt detection of defaulters is a crucial and challenging research problem. machine learning algorithms must deal with a heavily skewed dataset since the ratio of defaulters to non-defaulters is very small. the purpose of this research is to apply different ensemble methods and compare their performance in detecting the probability of defaults customer’s credit card default payments in taiwan from the uci machine learning repository. this is done on both the original skewed dataset and then on balanced dataset several studies have showed the superiority of neural networks as compared to traditional machine learning algorithms, the results of our study show that ensemble methods consistently outperform neural networks and other machine learning algorithms in terms of f1 score and area under receiver operating characteristic curve regardless of balancing the dataset or ignoring the imbalance index terms: ensemble methods, credit card default prediction, balanced and imbalanced dataset, stacking and xgboosting, neural networks corresponding author’s e-mail: azhi abdalmohammed faraj, department of information technology, college of commerce, university of sulaimani, sulaimani, iraq/department of computer engineering, college of engineering, dokuz eylül üniversitesi, i̇zmir, turkey. e-mail: azhi.faraj@univsul.edu.iq received: 13-03-2021 accepted: 26-06-2021 published: 19-07-2021 access this article online doi: 10.21928/uhdjst.v5n2y2021.pp20-25 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2017 al-janabi, et al. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l re se a rc h a rt i c l e uhd journal of science and technology faraj, et al.: credit card default prediction uhd journal of science and technology | jul 2021 | vol 5 | issue 2 21 as even a 1% improvement in the quality of bad credit applicants will result in substantial potential savings for financial institutions. therefore; organizations and scholars have conducted extensive research on credit score models, which is a significant financial management practice. several studies have discussed the superiority of ensemble learning, as new machine learning models are proposed. ensemble learning has been incorporated into the application of credit scoring [3]. ensemble learning is a machine learning technique in which several machine learning algorithms are trained and combined to generate a final output that is superior to individual algorithm outputs. ensemble learning strategies are divided into two types: homogeneous and heterogeneous ensembles. each base learner form is built in a different way using various machine learning techniques in the heterogeneous ensemble technique. the final forecast and the same dataset are generated by statistically combining each individual base learner prediction. each base learner is used on different subsets of the entire training dataset in homogeneous ensemble techniques. to satisfy requirements and achieve a good ensemble, two necessary and critical conditions must be met: diversity and accuracy [4]. this research aims to answer three questions, first how well ensemble methods work on credit default predictions? second how do they compare to nn and other traditional algorithms when used on skewed datasets? third how does balancing the dataset affect the relative performance gain in ensemble methods? the ensemble techniques used in this research are bagging, boosting (adaboosting and xgboosting), voting, and random forests (rf). 1.1. related work advances in technology and the availability of big data have helped researcher improve results on machine learning in credit scoring, default prediction, and risk evaluation. since the purpose of credit management is to improve the business performance and decrease the associated risk, rules must be established to make credit decisions. hence, clustering algorithm is widely used in the credit default detection systems in the early stage. for instance, william and huang combined the k-means clustering method with the supervision method for insurance risk identification [5]. researchers in saia et al. [6] performed credit scoring to detect defaults using the wavelet transform combined with three metrics three different datasets were used in their experimentation the authors compared their results with rf and improved on rf; however, state of the art results is achieved using neural networks and to get a better perspective neural networks approach needed to be included. the work in saia and carta [7] transformed the canonical time domain representation to the frequency domain, by comparing differences of magnitudes after fourier transform conversion of time-series data. the authors in ceronmani sharmila et al. [8] applied an outlier-based score for each transaction, together with an isolation forest classifier to improve default detection. authors of zhang et al. [9] used data preprocessing and a rf optimized through a grid search step, the feature selection step while preparing the data helped to improve the accuracy of rf. in zhu et al. [10], deep learning was utilized for the 1st time by applying convolutional neural networks (cnn) approach through the transformation of features to gray scale images, their r-cnn model improved on the area under curve (auc) of rf and logistic regression (lr) by around 10%. a thorough analysis of different neural networks, such as multilayer perceptron and cnns for credit defaulting can be found in neagoe et al. [11]. ensemble learning techniques have previously been applied in different credit-related topics for example [12] used rf and majority voting to classify transactions by european cardholders in september 2013 [13], used majoring voting by combining support vector machine (svms) and lr, to validate a feature selection approach, called group penalty function the research mainly focuses on robustness of the models. wang et al. [14] used bagging and boosting for credit scoring, ghodselahi [15] used a hybrid svm ensemble for binary classification of credit default predictions. the work in zhang et al. [16] ensembles five classifiers (lr, svms, neural network, gradient boosting decision tree, and 6 rf) using a genetic algorithm and fuzzy assignment. in feng et al. [17], a set of classifiers are joined in an ensemble according to their soft probabilities. in tripathi et al. [18], an ensemble is used with a feature selection step based on feature clustering, and the final result is a weighted voting approach. 1.2. overviews of ensemble learning the ensemble methods seek to enhance model predictability by integrating several models to create one stable model. by training several models to train a meta-estimator, ensemble learning aims to enhance predictive efficiency. base estimators or base learners are considered the component models of an ensemble. the strategies of the ensemble faraj, et al.: credit card default prediction 22 uhd journal of science and technology | jul 2021 | vol 5 | issue 2 exploit the influence of “the wisdom of crowds,” which is focused on the idea that a community’s collective judgment is more powerful than any person in the group. ensemble techniques are widely used in various fields of application, including economic and business analytics, medicine and health insurance, information security, education, industrial production, predictive analytics, entertainment, and many more. many machine-learning algorithms deal with a tradeoff of fit versus uncertainty (also known as bias-variance), which affects their ability to generalize potential knowledge accurately. to solve this tradeoff, ensemble approaches use multiple models. two essential components are required for an effective ensemble: (1) ensemble diversity and (2) model aggregation for the final predictions [19], [20]. 1.3. bagging bagging is primarily used in classification and regression, the short form for bootstrap aggregation. by utilizing decision trees, it improves the precision of models, and to a large degree decreases uncertainty. the reduction of variance increases accuracy, hence eliminating overfitting, which is a challenge to many predictive models [19]. using bootstrapped replicas of the training data, diversity in bagging is acquired: different training data subsets are randomly drawn from the entire training data with replacement. to train a different base learner of the same type, each training data subset is used. the combination strategy of the base learners for bagging is the majority vote. simple as it is, when combined with the basic learner generation strategies, this strategy can decrease variance. bagging is particularly attractive when the data available is limited in size. relatively large portions of the samples (75–100%) are drawn into each subset to ensure that there are sufficient training samples in each subset. this causes a significant overlap of individual training subsets, with many of the same instances appearing in most subsets, and some instances appearing in a given subset multiple times. a relatively unstable base learner is used to ensure diversity under this scenario, so that sufficiently different decision limits can be obtained for small disturbances in different training datasets [21]. 1.4. boosting and rf boosting is a form of machine-learning as well. whereas bagging and rf use autonomous learning, sequential learning is used for boosting. in boosting method, by integrating multiple instances into a more reliable estimation, the simple concept is to improve the precision of a poor classification method [22]. rf is a decision tree-based ensemble learning algorithm. it is simple to implement and can be used for both regression and classification tasks. the bootstrap method is used by rf to collect samples from the original results. every tree assigns a classification, and the forest selects the classification that receives the most votes among all trees. the degree of randomness is determined by the parameter m, which is the number of decision trees. the borrower is presumed to have d attributes in the rf [23]. random fore effects of the classification produced from multiple datasets of training are organized and combined to improve the accuracy of the prediction. however, bagging uses all input variables to build each decision tree, rf uses subsets to create each decision tree that are random samplings of variables. this means that forest randomness is best adapted for high-dimensional data processing than bagging [24]. 1.5. stacking stacking, another tactic of the ensemble, is also known as stacked generalization. this approach works by allowing many other related learning algorithm predictions to be put together by a training algorithm. regression, density calculations, distance learning, and classifications have been widely applied by stacking. it may also be used during bagging to calculate the error rate involved [25]. 2. materials and methodology 2.1. the dataset the dataset contains information about 30,000 consumers for each consumer 23 attributes marked x1 to x23 [table 1] are stored. the dependent variable represents whether a customer has defaulted (1) or repaid (0). all the client’s data are recorded in september 2005 in taiwan. as with all types of risk assessment datasets, the ratio of positive to negative samples causes a major imbalance in the dataset, in this dataset, only 22% of the clients have defaulted. there are no missing values in the dataset however there are 35 duplicated rows in the dataset, these have been removed. • x1: amount of the given credit (nt dollar): it includes both the individual consumer credit and his/her family (supplementary) credit • x2: gender (1 = male; 2 = female) table 1: results of the imbalanced dataset ensemble methods accuracy precision recall f1 neural network 82.01 66.85 36.73 47.41 bagging 79.43 55.45 34.62 42.62 ada boost 81.83 68.06 33.33 44.75 xgboosting 82.11 68.16 35.6 46.77 voting ensemble 81.88 68.32 33.41 44.87 stacking 81.86 65.73 37.26 47.56 faraj, et al.: credit card default prediction uhd journal of science and technology | jul 2021 | vol 5 | issue 2 23 • x3: education (1 = graduate school; 2 = university; 3 = high school; 4 = others) • x4: marital status (1 = married; 2 = single; 3 = others) • x5: age (year) • x6–x11: history of past payment. we tracked the past monthly payment records (from april to september 2005); as follows: x6 = the repayment status in september 2005 x7 = the repayment status in august 2005 x11 = the repayment status in april 2005. the measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for 1 month; 2 = payment delay for 2 months; 8 = payment delay for 8 months; 9 = payment delay for 9 months and above • x12–x17: amount of bill statement (nt dollar). x12 = amount of bill statement in september 2005; x13 = amount of bill statement in august 2005 x17 = amount of bill statement in april, 2005 • x18–x23: amount of previous payment (nt dollar). x18 = amount paid in september 2005; x19 = amount paid in august 2005. x23 = amount paid in april 2005. 2.2. evaluation metrics the dataset used in this research is imbalanced if this is not handled then accuracy will not provide a meaningful result because even if the model only predicts the output to be 0 it will still get 78% accuracy regardless of the dependent features. it can be presumed that those responsible for issuing these credit cards believed that every cardholder will not default otherwise it would not have been issued in the first place, thus we can conclude that the human level accuracy for this dataset is approximately 78%, this is an example of when a machine learning performs better than humans. it should be noted that misclassifying a positive example as negative will have higher cost and damage than predicting a negative class to be positive. this means that the model with better performance on the positive cases should be preferred. some of the common metrics for classification include accuracy, precision, recall, receiver operating characteristic (roc), and auc [6]. all these common metrics will be presented for each model. in the context of credit card default recall means out of all defaulters how many did the model get correct while precision measures the correctness of the model based on its predictions. f1 score is the harmonic mean of recall and precision. in this research, all the common metrics will be presented however for the assessment of ensemble methods we will focus on the f1 score. accuracy = tp+tn tp+fp+tn+fn precision = tp tp+fp precision = tp tp+fp recall= tp tp+fn 2.3. methodology we used the steps shown in fig. 1 for each of the ensemble methods mentioned in section 1. in addition, lr and decision trees were also used however because their results were outperformed by neural networks, we opted to not include them in the results section and decided to use nn as a benchmark for performance comparison. to measure the effects of imbalance on the data all algorithms have also been tested after the down sampling of the datasets their results have included in subsequent sections. 3. results and discussion the results of the ensemble learning were recorded in two separate trials, first with the original imbalanced dataset and second after the imbalance aspect were eliminated. when the ratio of positive samples to negative sample is approximately 82% accuracy cannot be used as a reliable measure and as shown in table 1 all models retrieve an accuracy of around 80% which is equivalent to predicting the performance fig. 1. proposed method. faraj, et al.: credit card default prediction 24 uhd journal of science and technology | jul 2021 | vol 5 | issue 2 table 2: results of the balanced dataset ensemble methods accuracy precision recall f1 neural network 68.53 71.9 59.86 65.33 bagging 64.84 65.79 60.49 63.02 ada boost 68.49 71.47 60.56 65.57 xgboosting 68.76 71.42 61.58 66.13 voting ensemble 67.75 72.57 56.1 63.28 stacking 68.22 72.58 57.59 64.22 fig. 2. voting ensemble used in this research. fig. 4. receiver operating characteristic curve for the balanced dataset. fig. 3. receiver operating characteristic curve for imbalanced dataset. of ensemble methods as compared to regular prediction methods, a variety of other methods were tested such knn, lr, decision trees, and neural networks. neural networks performed the best as also confirmed in cheng yeh and lien [1], hand and henley [2]. therefore, for comparison purposes, the results of the artificial neural network are also presented with the ensemble methods for both cases. fig. 2 show the structure of the voting ensemble used in this study, additionally, for the stacking ensemble, the same algorithms were used in the first level and later lr was applied as the final estimator. in both cases, the data were scaled using a min-max scaler. 3.1. the imbalanced dataset not default for everyone and consequently is the same as human-level error. a better metric would be the f1 score which is the harmonic mean of recall and precision fig. 3. stacking produced the best result which is 47.56 marginally better than the 47.41 of neural networks. in terms of area under the roc curve stacking and xgboosting produced the best results. 3.2. the balanced dataset for balancing the dataset down sampling was used since there are 6630 positive samples, the same number of negative samples was kept, and the rest was discarded. the samples were randomly shuffled before feeding them to ann and ensemble methods. since the dataset is balanced now accuracy can also be taken into account as shown in table 2 we can see that xgboosting is slightly outperforming all the others in all the metrics. xgboosting is also the fastest in terms of time consumption. fig. 4 shows the the roc curves for the balanced dataset, in which xgboosting produced the best result. 4. conclusion the credit default prediction using ml algorithms has a crucial role in many financial situations including personal faraj, et al.: credit card default prediction uhd journal of science and technology | jul 2021 | vol 5 | issue 2 25 loans, insurance policies, etc. however, establishing a model that improves the previous rule-based predictions is weakened by the data imbalance problem in datasets, where the number of unreliable cases is quite smaller than the number of reliable cases. in this paper, we examine different ensemble methods for credit card default prediction in an imbalanced dataset and compare the results with neural networks. most research in the literature have either focused on the balanced dataset or a skewed one however we have included both in scenarios to provide a better perspective of the performances of each used algorithm. we tested the results first without altering the imbalance aspect of the dataset in which we used auc as a metric and ignored accuracy and later by down sampling the majority class. our experiments show that xgboosting performs better in both cases as compared to other ensemble methods and also better than neural networks. references [1] i. cheng yeh and c. h. lien. “the comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients”. expert systems with applications, vol. 36, no. 2, pp. 2473-2480, 2009. [2] d. j. hand and w. e. henley. “statistical classification methods in consumer credit scoring: a review”. journal of the royal statistical society, vol. 160, no. 3, pp. 523-541, 1997. [3] y. li and w. chen. “a comparative performance assessment of ensemble learning for credit scoring”. mathematics, vol. 8, no. 10, p, 1756, 2020. [4] m. akour, i. alsmadi and i. alazzam. “software fault proneness prediction: a comparative study between bagging, boosting, and stacking ensemble and base learner methods”. international journal of data analysis techniques and strategies, vol. 9, no. 1, pp. 1-16, 2017. [5] g. williams and z. huang. “mining the knowledge mine: the hot spots methodology for mining large real world databases”. in: proceedings of the 10th australian joint conference on artificial intelligence, perth, australia, 1997. [6] r. saia, s. carta and g. fenu. “a wavelet-based data analysis to credit scoring”. in: icdsp 2018: proceedings of the 2nd international conference on digital signal processing, acm, 2018, pp. 176180, 2018. [7] r. saia and s. carta. “a fourier spectral pattern analysis to design credit scoring models”. in: proceedings of the 1st international conference on internet of things and machine learning, acm, p. 18, 2017. [8] v. ceronmani sharmila, k. k. r., s. r., s. d. and h. r. “credit card fraud detection using anomaly techniques”. in: 2019 1st international conference on innovations in information and communication technology (iciict), pp. 1-6, 2019. [9] x. zhang, y. yang and z. zhou. “a novel credit scoring model based on optimized random forest”. in: 2018 ieee 8th annual computing and communication workshop and conference (ccwc), pp. 6065, 2018. [10] b. zhu, w. yang, h. wang and y. yuan. “a hybrid deep learning model for consumer credit scoring”. in: 2018 international conference on artificial intelligence and big data (icaibd), pp. 205-208, 2018. [11] v. neagoe, a. ciotec and g. cucu. “deep convolutional neural networks versus multilayer perceptron for financial prediction”. in: 2018 international conference on communications (comm), pp. 201-206, 2018. [12] i. sohony, r. pratap and u. nambiar. “ensemble learning for credit card fraud detection”. in: proceedings of the acm india joint international conference on data science and management of data, 2018. [13] j. lpez and s. maldonado. “profit-based credit scoring based on robust optimization and feature selection”. information sciences, vol. 500, pp. 190-202, 2019. [14] g. wang, j. hao, j. ma and h. jiang. “a comparative assessment of ensemble learning for credit scoring”. expert systems with applications, vol. 38, no. 1, pp. 223-230, 2011. [15] a. ghodselahi. “a hybrid support vector machine ensemble model for credit scoring”. international journal of computer applications, vol. 17, no. 5, pp. 1-5, 2011. [16] h. zhang, h. he and w. zhang. “classifier selection and clustering with fuzzy assignment in ensemble model for credit scoring”. neurocomputing, vol. 316, pp. 210-221, 2018. [17] x. feng, z. xiao, b. zhong, j. qiu and y. dong. “dynamic ensemble classification for credit scoring using soft probability”. applied soft computing, vol. 65, pp. 139-151, 2018. [18] d. tripathi, d. r. edla, v. kuppili, a. bablani and r. dharavath. “credit scoring model based on weighted voting and cluster based feature selection”. procedia computer science, vol. 132, pp. 2231, 2018. [19] p. bühlmann. “bagging, boosting and ensemble methods”. in: j. gentle, w. härdle and y. mori, (eds.), handbook of computational statistics. springer handbooks of computational statistics. springer, berlin, heidelberg, 2012. [20] g. kunapuli. “ensemble methods for machine learning”. meap publication, shelter island, new work, 2020. [21] s. hamori, m. kawai, t. kume, y. murakami and c. watanabe. “ensemble learning or deep learning? application to default risk analysis”. journal of risk and financial management, vol. 11, p. 12, 2018. [22] r. e. schapire and y. freund. “boosting: foundations and algorithms”. kybernetes, vol. 42, no. 1, pp. 164-166, 2013. [23] b. niu, j. ren and x. li. “credit scoring using machine learning by combing social network information: evidence from peer-to-peer lending”. information, vol. 10, p. 397, 2019. [24] a. mayr, h. binder, o. gefeller and m. schmid. “the evolution of boosting algorithms. from machine learning to statistical modeling”. methods of information in medicine, vol. 53, no. 6, pp. 419-427, 2014. [25] r. sikora and o. h. al-laymoun. “a modified stacking ensemble machine learning algorithm using genetic algorithms”. journal of international technology and information management, vol. 23, p. 1, 2014. tx_1~abs:at/tx_2:abs~at 28 uhd journal of science and technology | jan 2021 | vol 5 | issue 1 1. introduction beside content-based image retrieval (cbir), digital image processing plays a vital role in numerous areas such as processing and analyzing medical image [1], image inpainting [2], pattern recognition [3], biometrics [4], multimedia security [5], and information hiding [6]. in the area of image processing and computer vision, cbir has grown increasingly as an advanced research topic. cbir refers to the system which retrieves similar images of a query image from the dataset of images without any help of caption and/or description of the images [7]. there are two main mechanisms for image retrieval which are text-based image retrieval (tbir) and cbir [8]. tbir was first introduced in 1970 as search and retrieve images from image dataset [9]. in such a kind of image retrieval mechanism, the images are denoted by text and then the text is used to retrieve or search for the images. the tbir method depends on the manual text search or keyword matching of the existing image keywords and the result has been relied on the human labeling of the images. tbir approach requires information such as image keyword, image location, image tags, image name, and other information related to the image. human involvement is needed in the challenging process of entering information of the images in the dataset. the drawbacks of tbir are as follows: 1it leads to inaccurate results if human has been an efficient two-layer based technique for content-based image retrieval fawzi abdul azeez salih1, alan anwer abdulla2,3* 1department of computer science, college of science, university of sulaimani, sulaimani,iraq, 2department of information technology, college of commerce, university of sulaimani, sulaimani, iraq, 3department of information technology, university college of goizha, sulaimani, iraq a b s t r a c t the rapid advancement and exponential evolution in the multimedia applications raised the attentional research on contentbased image retrieval (cbir). the technique has a significant role for searching and finding similar images to the query image through extracting the visual features. in this paper, an approach of two layers of search has been developed which is known as two-layer based cbir. the first layer is concerned with comparing the query image to all images in the dataset depending on extracting the local feature using bag of features (bof) mechanism which leads to retrieve certain most similar images to the query image. in other words, first step aims to eliminate the most dissimilar images to the query image to reduce the range of search in the dataset of images. in the second layer, the query image is compared to the images obtained in the first layer based on extracting the (texture and color)-based features. the discrete wavelet transform (dwt) and local binary pattern (lbp) were used as texture features. however, for the color features, three different color spaces were used, namely rgb, hsv, and ycbcr. the color spaces are utilized by calculating the mean and entropy for each channel separately. corel-1k was used for evaluating the proposed approach. the experimental results prove the superior performance of the proposed concept of two-layer over the current state-of-the-art techniques in terms of precision rate in which achieved 82.15% and 77.27% for the top-10 and top-20, respectively. index terms: cbir, feature extraction, color descriptor, dwt, lbp corresponding author’s e-mail: alan anwer abdulla, department of information technology, college of commerce, university of sulaimani, sulaimani, iraq; department of information technology, university college of goizha, sulaimani, iraq. e-mail: alan.abdulla@univsul.edu.iq received: 07-01-2021 accepted: 30-03-2021 published: 05-04-2021 access this article online doi: 10.21928/uhdjst.v5n1y2021.pp28-40 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2021 fawzi. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l re se a rc h a rt i c l e uhd journal of science and technology mailto:alan.abdulla@univsul.edu.iq salih and abdulla: an efficient two-layer based technique for cbir uhd journal of science and technology | jan 2021 | vol 5 | issue 1 29 doing datasets annotation process incorrectly, 2single keyword of image information is not effective to transfer the overall image description, and 3it is based on manual annotation of the images, which is time consuming [10]. researchers introduced cbir as a new mechanism for image retrieval to overcome the above-mentioned limitations of tbir. it is considered as a popular technique to retrieve, search, and browse images of query information from a broad dataset of images. in cbir, the image information, visual features such as low-level features (color, texture, and/or shape), or bag of features (bof) have been extracted from the images to find the most similar images in the dataset [11]. fig. 1 illustrates the general block diagram of the cbir mechanism [12]. fig. 1 shows the block diagram of basic cbir system that involves two phases: feature extraction and feature matching. the first phase involves extracting the image features while the second phase involves matching these features [13]. feature extraction is the process of extracting features from the dataset of images and stored in feature vector and also extracting features from the query image. on the other hand, feature matching is the process of comparing the extracted features from the query image to the extracted features from images in the dataset using similarity distance measurement. the corresponding image in the dataset is considered as a match/ similar image to the query image if the distance between feature vector of the query image and the image in the dataset is small enough. thus, the matched images are then ranked based on the similarity index from the smallest distance value to the largest one. eventually, the retrieved images are selected according to the lowest distance value. the essential objective of the cbir systems is improving the efficiency of the system by increasing the performance using the combination of features [9]. image features can be categorized into two types: global features and local features. global features extract information from the entire image while local features work locally which are focused on the key points in images [14]. for the large image dataset, image relevant to the query image are very few. therefore, the elimination of irrelevant images is important. the main contribution of this research is first eliminating the irrelevant images in the dataset and then finds the most similar/matches images from the rest of the remained images. the reminder of the paper is organized as follows: section 2 discusses the related work. section 3 introduces a background about the techniques used in the proposed approach. section 4 presents the proposed cbir approach. section 5 shows the experimental results. finally, section 6 gives the conclusions. 2. related work studies related to the developed techniques of cbir have been researched a lot and they mainly focused on analyzing and investigating the interest points/areas such as corners, edges, contours, maxima shapes, ridges, and global features [15]. some of those developed approaches are concerned on combining/ fusing certain types of the extracted features, since such kind of strategy has an impact on describing the image content efficiently [13], [16]. this section reviews the most important and relevant existing works on cbir. the main competition in this research area is increasing the precision rate that refers to the efficiency of retrieving the most similar images correctly. kato et al. were first to investigate this field of study in 1992, who developed a technique for sketch retrieval, similarity retrieval, and sense retrieval to support visual interaction [17]. sketch retrieval accepts the image data of sketches, similarity retrieval evaluates the similarity based on the personal view of each user, and sense retrieval evaluates based on the text data and the image data at content level based on the personal view. yu et al., in 2013, proposed an effective image retrieval system based on the (bof) model, depending on two ways of integrating [18]. scale-invariant feature transform (sift) and local binary pattern (lbp) descriptors were integrated in one hand, and histogram of oriented gradients (hog) and lbp descriptors were integrated on the other hand. the first proposed integration, namely sift-lbp, provides better precision rate in which reached 65% for top-20 using jaccard similarity measurement. fig. 1. general block diagram of cbir mechanism. salih and abdulla: an efficient two-layer based technique for cbir 30 uhd journal of science and technology | jan 2021 | vol 5 | issue 1 shrivastava et al., in 2014, introduced a new scheme for cbir based on region of interest roi codes and an effective feature set consisting of a dominant color and lbp were extracted from both query image and dataset of images. this technique achieved, using euclidean distance measurement, a precision rate of 76.9% for top-20 [19]. dwt as a global feature, and gray level co-occurrence matrix (glcm), as local feature, was extracted and fused in the algorithm introduced by gupta et al., in 2015, and as a result, a precision rate of 72.1 % was obtained for top-20 using euclidean distance [20]. another technique was introduced by navabi et al., in 2017, for cbir which based on extracting color and texture features. the technique used color histogram and color moment as color feature. the principal component analysis (pca) statistical method was applied for the dimension’s reduction. finally, minkowski distance measurement was used to find most similar images. as reported, this technique achieved the precision rate of 62.4% for top-20 [21]. nazir et al., in 2018, proposed a new cbir technique by fusing the extracted color and texture features [22]. color histogram (ch) was used to extract a color information, and dwt as well as edge histogram descriptor (edh) were used to extract texture features. as authors claimed, this technique achieved a precision rate of 73.5% for top-20 using manhattan distance measurement. pradhan et al., in 2019, developed a new cbir scheme based on multi-level colored directional motif histogram (mlcdmh) [23]. this scheme extracts local structural features at three different levels. the image retrieval performance of this proposed scheme has been evaluated using different corel/natural, object, texture, and heterogeneous image datasets. for the corel-1k, the precision rate of 64% and 59% was obtained for top-10 and top-20, respectively. recently, sadique et al., in 2019, developed a new cbir technique by extracting global and local features [7]. a combination of speeded up robust features (surf) descriptor with color moments, as local feature, and modified glcm, as global feature, leads this technique to obtain 70.48% of the precision rate for top-20 using manhattan similarity measurement. continuously, in 2019, khawaja et al. proposed another technique for cbir using object and color features [24]. authors claimed that this technique outperformed in certain categories of the benchmark datasets caltech-101 and corel-1000, and it gained 76.5% of the precision rate for top-20 using euclidean distance. different from the previous techniques discussed above, qazanfari et al., in 2019, investigated hsv color space for developing cbir technique [25]. as reported in this work, the human visual system is very sensitive to the color as well as edge orientation, and also color histogram and color difference histogram (cdh) are two kinds of lowlevel feature extraction which are meaningful representatives of the image color and edge orientation information. this proposed technique used canberra distance measurement to measure the similarity between the extracted feature of the both query image and images in the dataset. this technique achieved 74.77% of the precision rate for the top-20 using euclidean distance similarity measurement. rashno et al., in 2019, developed an algorithm in which hsv, rgb, and norm of low frequency components were used to extract color features, and dwt was used to extract texture features [26]. accordingly, ant colony optimization (aco) feature selection technique was used to select the most relevant features. eventually, euclidian distance measurement was used to measure the similarity between query and images in the dataset. the results reported in this work showed that this approach reached the precision rate of 60.79% using euclidean distance for the top-20. finally, aiswarya et al., in 2020, proposed a cbir technique which uses a multi-level stacked autoencoders for feature selection and dimensionality reduction [27]. a query image space is created first before the actual retrieval process by combining the query image as well as similar images from the local image dataset (images in device gallery) to maintain the image saliency in the visual contents. the features corresponding to the query image space elements are searched against the characteristics of images in a global dataset. this technique achieved the precision rate of 67% for top-10. 3. background this section aims to provide detailed background information about important techniques, used in the proposed approach presented in this paper, such as surf feature descriptor, color-based features, texture-based features, and feature matching techniques. 3.1. surf feature descriptor there are many feature descriptors available and surf is one of the most common and significant feature descriptors in which can be considered as a local feature. in comparison with global features such as color, texture, and shape; local features can provide more detailed characters in an image. the rotation and scale invariant descriptor can perform better in terms of distinctiveness, repeatability, and robustness [12]. surf is used in many applications such as bag of feature (bof) which is used and success in image analysis and classification [28]. in the bof technique, the surf descriptor is sometimes used first to extract local features. then k-means clustering is used to initialize m center point to create m visual words. the k-means clustering algorithm takes the feature space as input and reduces it to the m cluster as output. then, the image is represented as a code word histogram by mapping the local features into salih and abdulla: an efficient two-layer based technique for cbir uhd journal of science and technology | jan 2021 | vol 5 | issue 1 31 a vocabulary [28]. fig. 2 illustrates the methodology of the image representation based on the bof model. surf features are extracted from database images, then the k-means clustering algorithm takes feature space as input and reduces it into clusters as output. the center of each cluster is called a visual word and the combination of visual words formulates the dictionary, which is also known as codebook or vocabulary. finally, using these visual words of the dictionary, the histogram is constructed using visual words of each image. the histogram of v visual words is formed from each image. after that, resultant information in the form of histograms is added to the inverted index of the bof model [25]. 3.2. texture-based features extraction texture-based features can be considered as a powerful lowlevel feature for image search and retrieval applications. there are many works have been developed on texture analysis, classification, and segmentation for the last four decades. yet, there is no unique definition for the texture-based features. texture is an attribute representing the spatial arrangement of the gray levels of the pixels in a region or image. in other words, texture-based features can be used to separate and extract prominent regions of interest in an image and apply to the visual patterns that have properties of homogeneity independent of a single color or intensity [9]. texture analysis methods can be categorized into statistical, structural, and spectral [15]. dwt and lbp are the two methods of texture feature extraction used in this work. 3.2.1. discrete wavelet transform (dwt) the dwt is considered to be an efficient multiresolution technique and it is easy to compute [29]. the signal for each level decomposed into four frequency sub-bands which are: low of low (ll), low of high (lh), high of low (hl), and high of high (hh) [30]. dwt is used to change an image from the spatial domain into the frequency domain, the structure of the dwt is illustrated in fig. 3 [22], [26]. wavelet transform could be applied to images as 2-dimensional signals. to refract an image into k level, first the transform is applied on all rows up to k level while columns of the image are kept unchanged. then, this task is applied on columns while keeping rows unchanged. in this manner, frequency components of the image are obtained up to k level. these frequency components in various levels let us to better analyze original image or signal [26]. for more details about dwt, you can see [31]. fig. 2. methodology of the bof technique for representing image in cbir. salih and abdulla: an efficient two-layer based technique for cbir 32 uhd journal of science and technology | jan 2021 | vol 5 | issue 1 3.2.2. local binary pattern (lbp) the concept of lbp was originally proposed by ojala et al. in 1996 [29], [32]. lbp can be considered as a texture analysis approach unifying structural and statistical models. the characteristic of lbp is that lbp operator is invariant to monotonic gray-level changes [33]. in the process of lbp calculation, firstly a 3 × 3 grid of image is selected, and then the intensity value of the center pixel can be computed using the intensity values of its neighboring pixels based on the following equations [34]: lbp fn� i in k k n k c� 1 2 3 9 0 1 2, , , ,…{ } = − = × −( )∑ (1) fn� a b � if a b otherwise �−( ) = −( ) ≥   � ,��� � ,�� 1 0 0 (2) where, n is the number of neighboring pixels around the center pixel. ik is the intensity value of the k th neighboring pixel, and ic is the intensity value of the center pixel. an example of lbp is presented in fig. 4 [34]. fig. 4 shows the lbp spectrum of the lena image with different circular domain radius and sampling points. correspondingly, fineness of the texture information in the obtained lbp spectrum is different. taking the lena image as an example, with the increase of sampling radius, the gray scale statistical value of the lbp map is sparser [34]. in the proposed approach presented in this paper, after lbp is applied on the lh and hl sub-bands of the dwt, 512 features are extracted to represent the image. 3.3. color-based features extraction color is considered as a basic feature observation in viewing an image to reveal a variety of information [12]. color is extremely used feature for image retrieval techniques [35], [36]. fig. 3. dwt sub-bands. fig. 4. an example of lbp operator. salih and abdulla: an efficient two-layer based technique for cbir uhd journal of science and technology | jan 2021 | vol 5 | issue 1 33 color points create color space and various color spaces based on the perceptual concepts are used for color illustration [23]. among all color spaces, ycrcb and hsv have the mentioned perceptual characteristic. in ycbcr, the y represents the luminance while the color is represented by cb and cr [37]. in our proposed approach, the mean and entropy of each component of the color spaces rgb, hsv, and ycbcr have been calculated as color features. meanwhile, totally 18 colorbased features are extracted. 3.4. feature matching there are variety of similarity measurements used to determine the similarity between the query image and the images in the dataset [9]. manhattan distance is used as a similarity measurement for both layers of the proposed approach in this work, equation (3) [36]: manhattan�distance� md x yi ii=1 k ( ) | |= −∑ (3) where x is the feature vector of query image, y is the feature vector of the dataset of images, and k is the dimension of image feature. manhattan distance is also known as city block distance. in general, the manhattan distance is non-negative where zero defines an identical point, and other means little similarity [9]. 3.4.1. proposed approach this section describes the details of the proposed two-layer approach in the following steps: 1. let the query image is denoted by q, and i = {i1, i2,…, in} refers to the dataset which consists of n images. 2. first layer of the proposed approach involves the following steps: a. qbof and ibof represent the feature vector of q and i, respectively, after bof technique is implemented on. b. to find the similarity between qbof and ibof, manhattan similarity measurement is used, and as a result, m most similar images to the query image are retrieved. 3. second layer of the proposed approach, which includes the following steps, implements on the query image q as well as the m most similar images that gained in the first layer. a. extracting the following features from q and mi: • let l = {l1, l2,…, l512} be the vector of 512 extracted texture-based features after lbp is applied on the lh and hl sub-bands of the dwt, 256 features extract from each sub-band. • let c = {c1, c2,……, c18} be the extracted 18 color-based features that represent the mean and entropy of the three components of rgb, hsv, and ycbcr color spaces. meanwhile, 6 features are extracted from each of the mentioned color spaces. • let f = l + c represents the feature vector of the fused of all the 530 extracted features from the previous steps. • finally, qf and mfi represent the fused feature vector of q and mi, respectively. b. to find the similarity between qf and mfi, manhattan similarity measurement is used, to retrieve the most similar images to the query image. the block diagram of the proposed two-layer approach is illustrated in fig. 5. 4. experimental results experiments are conducted comprehensively in this section to evaluate the performance of the proposed approach in terms of precision rate, the most common confusion matrix measurement used in the research area of cbir. in addition, the proposed approach is compared to the current existing works. 4.1. dataset corel-1k dataset of images has been used, which is a public and well-known dataset, that contains 1000 images in the form of 10 categories and each category consists of 100 images with resolution sizes of (256 × 384) or (384 × 256) [37], [38]. the categories are arranged as follows: african, people, beaches, buildings, buses, dinosaurs, elephants, flowers, horses, mountains, and foods [37]. 4.2. evaluation measurements to evaluate the performance of the proposed approach, precision confusion matrix measurement has been used which determines the number of correctly retrieved images to the total number of the retrieved images from the tested dataset of images. meanwhile, it measures the specificity of image retrieval system based on the following equation [38], [39]: precision r r c t = (4) where rc represents the total number of correctly retrieved images and rt represents the total number of retrieved images. in this study, top-10 and top-20 have been tested. top-10 indicates the total number of retrieved images is 10 images, and top-20 indicates the total number of retrieved images is 20 images. salih and abdulla: an efficient two-layer based technique for cbir 34 uhd journal of science and technology | jan 2021 | vol 5 | issue 1 feature extraction feature extraction similarity evaluation using manhattan similarity evaluation using euclidean retrieved images mmost similar images to the query image are retrieved feature extraction la ye r ( 1) la ye r ( 2) feature extraction fig. 5. block diagram of the proposed two-layer cbir approach. 4.3 results the experiments carried out in this work include two parts: (a) single layer cbir model and (b) two-layer cbir model. first part evaluates the single layer model (i.e., bof technique) alone, and on the other hand, cbir technique based on extracting texture and color features is evaluated. in the second part, the proposed two-layer model has been assessed. the experiments are detailed in the following steps: 1. bof-based cbir technique is tested using different number of clusters, as bof technique relies on the k-means clustering algorithm to create clusters, which is commonly called visual words. the number of clusters salih and abdulla: an efficient two-layer based technique for cbir uhd journal of science and technology | jan 2021 | vol 5 | issue 1 35 cannot be selected automatically; manual selection is needed. to select the proper number of clusters, (i.e., value of k-means), the different number of clusters have been tested to obtain the best precision result of bof technique. the precision results of different number of clusters are illustrated in the following tables. from tables 1 and 2, it is quite obvious that the best result is achieved when k = 500 for both top-10 and top-20. 2. the dwt sub-bands, and the concatenation of the subbands, have been tested as a texture feature as presented in the following tables. from tables 3 and 4, one can observe that the best result is obtained when the lh and hl sub-bands are concatenated for both top-10 and top-20. 3. in this step, lbp is implemented on dwt sub-bands, implementation of lbp on lh and hl sub-bands. in the proposed method, lbp is extracted from dwt subbands to form a sub-novel local feature descriptor. to achieve this, we performed dwt decomposition and consider the high frequency sub-bands hl, and lh. however, the subbands hl and lh also contain edge and contour details of image’s significant in extracting pose and expression relevant features with the aid of lbp. we ignored the lowfrequency ll and the high-frequency hh sub-band as it mostly contains the noise with negligible feature details. to preserve the spatial characteristics and to form a robust local feature descriptor, multi-region lbp pattern-based features [4] are obtained from non-overlapping regions of dwt sub-bands {hl, lh}, are statistically significant and offer reduced dimensionality with increased robustness to noise. each of the sub-band {hl, lh} is equally divided into m non-overlapping rectangle regions r0; r1;… ; rm, each of size (x,y) pixels. from each of these m regions, we extract local features lbp each with 256 labels separately. local features from successive regions are concatenated to form a combining the two results in one vector with 512 features, the results in tables 5 and 6. from tables 5 and 6, one can observe that the best result is obtained when the implementation of lbp on lh sub-band as well as hl sub-band is concatenated. table 2: precision rate of bof technique for different number of clusters for top‑20 different number of clusters categories africa beaches buildings buses dinosaur elephant roses horses mountains food average k=100 54.75 46.5 42.4 84.5 100 58.55 84.9 82 39.9 40.15 63.365 k=200 56.95 48.65 42.5 86.35 100 59.85 85.5 87.05 41 42.3 65.015 k=300 58.55 48.55 45 85.7 100 61.05 85.5 88.4 42.45 41.95 65.715 k=400 58.45 48.25 47.3 87.35 100 59.2 85.65 87.65 44.65 41.35 65.985 k=500 60.5 48.75 50.3 85.6 99.95 59.05 85 87.9 47.15 40.55 66.475 k=600 60.2 48.5 50 84.8 99.95 58.85 84.7 87.7 47.1 40.05 66.185 k=700 57.95 48.5 51.6 84.85 99.95 57.5 84.75 88.45 44.75 39.65 65.795 k=800 58.3 48.7 50.55 84.3 99.95 58.95 85.95 89.3 47.35 39.15 66.25 k=900 57.65 47.95 51.7 84.35 99.95 57.6 85.05 88.75 47.7 39.3 66 k=1000 56.8 48.2 52.95 82.6 99.85 56.95 85.15 88.4 46.15 37.55 65.46 table 1: precision rate of bof technique for different number of clusters for top‑10 different number of clusters categories africa beaches buildings buses dinosaur elephant roses horses mountains food average k=100 61.1 47.2 50.1 88.9 100 68.9 87.3 89.6 46.4 53.8 71.056 k=200 65.7 49.3 55.2 88.9 100 70.9 87 91.8 49.1 53.3 73.1 k=300 65.2 49.9 57.1 89.2 100 72.3 87.4 93.1 48.4 54.4 73.622 k=400 65.5 47.9 56.7 90.9 100 72.1 87.6 92.4 48.9 53.4 73.556 k=500 67.1 49.3 60.4 89.2 100 70.1 87.7 93.6 51.5 53.9 74.322 k=600 64.9 47.9 59.1 89.4 100 70.6 88.6 93.6 51.7 53.4 73.978 k=700 67.1 49.3 60.4 88.2 100 70.1 86.7 93.6 51.5 53.9 74.1 k=800 64.6 48 59.4 89.2 100 68.2 88.3 94.2 53.5 51 73.933 k=900 64.7 46.2 62 88.1 100 68 87.5 93.4 55.6 52.7 73.944 k=1000 64.5 47 61.5 87.6 100 66.9 88.3 94.2 53.4 50.9 73.711 salih and abdulla: an efficient two-layer based technique for cbir 36 uhd journal of science and technology | jan 2021 | vol 5 | issue 1 table 4: precision rate for the dwt sub‑bands for top‑20 dwt sub-bands categories africa beaches buildings buses dinosaur elephant roses horses mountains food average ll 15.2 29.7 26.75 22.85 94.1 29.4 27.25 16.5 26.15 20.25 30.82 lh 20.05 23.5 22.9 15.3 97.35 21.85 19.05 15 23.95 22.95 28.19 hl 11.55 12.9 18.2 13.15 93.3 13.2 14.25 11.25 11.65 15.7 21.52 hh 17.55 8.9 14.2 19.15 89.3 19.2 10.25 17.25 7.65 11.7 21.52 ll_lh 16.55 29.5 26.15 23.25 94.35 30.85 30.55 16.9 26.95 20.3 31.54 ll_hl 25.95 28.65 25.85 22.55 95.3 29.1 30.2 16.9 26.85 20.6 32.2 ll_hh 26.45 28.7 25.65 22.5 94.25 29.35 30.05 17.3 26.55 20.65 32.15 lh_hl 22.3 29.5 29.9 29.55 94.4 23.65 34.9 21.2 22 19.2 32.66 lh_hh 11.85 19.1 18.7 12.1 94.4 17.75 15.5 11.95 17.8 19 23.82 hl_hh 7.85 15.1 14.7 8.1 92.4 13.75 11.5 7.95 13.8 15 20.02 ll_lh_hh 11.85 19.1 18.7 12.1 96.4 17.75 15.5 11.95 17.8 19 24.02 ll_hl_hh 15.95 29.15 26.4 22 95.35 29.2 30.3 16 27.35 20.35 31.21 ll_lh_hh 17.35 30.55 27.8 23.4 95.75 30.6 31.7 17.4 28.75 21.75 32.51 ll_lh_hl_hh 11.95 25.4 23.3 18.55 95.2 27.05 27.2 11.65 23.95 15.85 28.01 table 3: precision rate for the dwt sub‑bands for top‑10 dwt sub-bands categories africa beaches buildings buses dinosaur elephant roses horses mountains food average ll 17.2 48.4 44.6 24.85 95.1 29.4 27.25 16.5 26.15 20.25 34.97 lh 25.05 36.05 31.8 24.3 97.35 29.85 29.05 22 27.95 27.95 35.14 hl 11.55 21.55 22.55 23.15 93.3 23.2 24.25 21.25 21.65 19.7 28.22 hh 21.2 12.55 17.85 22.8 92.95 22.85 13.9 20.9 11.3 15.35 25.17 ll_lh 20.2 33.15 29.8 26.9 98 34.5 34.2 20.55 30.6 23.95 35.19 ll_hl 30.2 32.9 30.1 26.8 98.55 33.35 34.45 21.15 31.1 24.85 36.35 ll_hh 29 31.25 28.2 25.05 96.8 31.9 32.6 19.85 29.1 23.2 34.7 lh_hl 26.85 34.05 34.05 33.7 98.55 27.8 39.05 25.2 26.15 23.35 36.88 lh_hh 15.61 22.86 22.46 15.86 98.16 21.51 19.26 15.71 21.56 22.76 27.58 hl_hh 11.85 27.8 22.5 8.1 92.4 23.75 21.5 17.05 15.8 19 25.98 ll_lh_hh 16.05 32 26.7 12.3 96.6 27.95 25.7 21.25 20 23.2 30.18 ll_hl_hh 19.85 33.05 30.3 25.9 99.25 33.1 34.2 19.9 31.25 24.25 35.11 ll_lh_hh 24.66 33.86 32.11 30.71 96.2 31.91 31.01 20.71 30.06 22.06 35.329 ll_lh_hl_hh 17.6 29.2 29.5 24.3 99.7 31.8 30.6 15 30.3 23.4 33.14 table 5: precision rate for lbp for top‑10 dwt sub-bands categories africa beaches buildings buses dinosaur elephant roses horses mountains food average ll 37.35 50.55 47.8 43.4 115.75 50.6 51.7 37.4 48.75 41.75 52.51 lh 56.8 49.5 54.1 85.2 99.5 45.8 91.5 70.2 39.8 47.8 64.02 hl 74.1 54.2 58 92.9 99.9 56.3 89.1 78.1 46.5 63.8 71.29 hh 53.5 40.9 54.9 90.1 98.6 39.8 92.3 57.1 37.8 46.8 61.18 ll_lh 63.4 50.4 61.8 92.3 98.7 44.6 92.9 74.6 39.1 51.4 66.92 ll_hl 62.64 49.54 61.02 91.74 98.18 43.7 92.34 73.91 38.17 50.55 66.18 ll_hh 61.19 48.19 59.59 90.09 96.49 42.39 90.69 72.39 36.89 49.19 64.71 lh_hl 70.7 61.4 69.7 95.5 99.2 55.1 93.4 83 46.4 65 73.94 lh_hh 64.45 51.45 62.85 93.35 99.75 45.65 93.95 75.65 40.15 52.45 67.97 hl_hh 63.45 50.45 61.85 92.35 98.75 44.65 92.95 74.65 39.15 51.45 66.97 ll_lh_hh 64.55 51.75 63.05 92.95 99.75 45.95 94.05 75.45 40.15 52.45 68.01 ll_hl_hh 62.75 49.95 61.25 91.15 97.95 44.15 92.25 73.65 38.35 50.65 66.21 ll_lh_hh 74.35 57.95 67.75 95.75 99 54.05 94.15 81.85 47.55 65.05 73.75 ll_lh_hl_hh 72 55.8 65.5 93.6 98.3 52 91.9 79.7 45.5 62.9 71.72 4. before selecting an appropriate color description, selection of color space is important and needs to choose a color model for color feature extraction process [35]. this step evaluates the impact of extracting the color salih and abdulla: an efficient two-layer based technique for cbir uhd journal of science and technology | jan 2021 | vol 5 | issue 1 37 table 6: precision rate for lbp for top‑20 dwt sub-bands categories africa beaches buildings buses dinosaur elephant roses horses mountains food average ll 35.15 48.35 45.6 41.2 113.55 48.4 49.5 35.2 46.55 39.55 50.31 lh 48.95 40.7 45.25 79.65 96.6 36.85 87.8 64.75 32.85 40.75 57.42 hl 67.55 48.7 49.2 89.85 98.4 44.7 86.05 68.9 40.7 57.35 65.14 hh 44.8 36 46.25 84.4 98.1 32.1 89.9 49.9 30.75 38.95 55.12 ll_lh 53 43.3 51 87.95 98.1 35.95 90.5 62.85 34.3 42.15 59.91 ll_hl 52.16 42.39 50.15 87.36 97.58 34.99 89.93 62.08 33.33 41.24 59.12 ll_hh 50.79 41.09 48.79 85.74 95.89 33.74 88.29 60.64 32.09 39.94 57.7 lh_hl 63.8 54.45 59.85 93.2 99.15 43.75 90.65 72.75 41 58.2 67.68 lh_hh 54.05 44.35 52.05 89 99.15 37 91.55 63.9 35.35 43.2 60.96 hl_hh 53.05 43.35 51.05 88 98.15 36 90.55 62.9 34.35 42.2 59.96 ll_lh_hh 54.1 44.65 52.15 88.65 99.15 36.9 91.55 63.8 35.4 43.35 60.97 ll_hl_hh 52.3 42.85 50.35 86.85 97.35 35.1 89.75 62 33.6 41.55 59.17 ll_lh_hh 64.5 50.6 57.8 93.3 98.9 43.35 92.45 69.75 40.7 56 66.74 ll_lh_hl_hh 62.25 48.4 55.9 90.9 97.65 41.2 90.2 67.6 38.45 53.7 64.625 table 7: precision rate for color‑based features for top‑10 color spaces categories africa beaches buildings buses dinosaur elephant roses horses mountains food average rgb 45.8 35.5 33.2 41.6 99.4 49.6 62.6 83.7 39.4 56.4 54.72 hsv 21.1 22.65 32.8 26.7 39.5 23.9 19.9 29.45 20 22.1 25.81 ycbcr 47.6 35.9 40.1 46.9 100 53.4 65.7 56.6 39.3 51 53.65 rgb_hsv 48 37.3 41.7 48.8 99.3 52.3 60.3 84 38.5 56.7 56.69 rgb_ycbcr 50.7 37.4 38.5 50.8 99.7 55.2 68 86.4 38.8 55.6 58.11 hsv_ycbcr 41.5 33.5 41.1 46.8 99.6 48.65 59.65 56.1 35.2 44.2 50.63 rgb_hsv_ ycbcr 47.8 38.9 47.6 60.3 99.7 52.4 63.9 82.2 44.6 56.8 59.42 feature by testing different color space components such as: rgb, ycbcr, and hsv. in other words, mean and entropy for each color components have been calculated and the results are presented in the following tables. the results presented in tables 7 and 8 demonstrates that combining the extracted features of all the tested color spaces provides best precision rate. 5. finally, all the extracted features are fused. meanwhile, by concatenating the extracted lbp in step 3 and the extracted color feature in step 4, table 9. 6. eventually, the proposed two-layer approach has been tested. it includes two layers: the first layer implements bof technique (for k=500) and m most similar images are retrieved, m is user defined. in the second layer, color and texture features are extracted from the query image and the m remained images, as a result, n most similar images are retrieved. the following tables investigate the best value of m. in other words, tables 10 and 11 show investigating different number of m for top-10 and top-20, respectively. results in tables 10 and 11 demonstrate that the best precision results are obtained for m = 100 and m=200. for table 8: precision rate for color‑based features for top‑20 color spaces categories africa beaches buildings buses dinosaur elephant roses horses mountains food average rgb 40 28.8 27.55 34.5 99.5 45.45 54.6 77.7 34.5 48.15 49.075 hsv 27.1 29.3 37.7 31.2 50.4 28.1 26.3 37.1 25.2 27.5 31.99 ycbcr 39.2 30.55 34.5 39.65 99.85 47.8 60.25 49.55 35.1 44 48.045 rgb_hsv 41.05 30.85 34.55 41.05 99.15 46.15 51.8 77.25 36.25 48.2 50.63 rgb_ycbcr 42.55 30.3 31 42.25 99.55 49 58.9 78.45 33.85 48.05 51.39 hsv_ycbcr 44.4 33.4 45.5 48.9 95.2 49.8 60.5 61.4 35.3 47.2 52.16 rgb_hsv_ ycbcr 41.9 33.3 37.15 55.35 99.5 47.25 56.05 73.85 38.55 48.25 53.115 salih and abdulla: an efficient two-layer based technique for cbir 38 uhd journal of science and technology | jan 2021 | vol 5 | issue 1 table 9: concatenation of texture‑based feature and color‑based feature top image retrieved categories africa beaches buildings buses dinosaur elephant roses horses mountains food average top-20 67.45 49.1 60.65 88.8 99.45 61.7 90.5 81.8 49 61.55 71 top-10 67.45 56.25 69.5 88.8 99.45 61.7 90.5 81.8 49 61.55 72.6 table 10: precision rate for different number of m for top‑10 different number of m categories africa beaches buildings buses dinosaur elephant roses horses mountains food average m=100 81.65 63 71.9 95.95 99.85 77.6 96.35 92.75 67.55 73.45 82.005 m=200 78.85 62.2 73.65 95.7 99.85 76.6 95.95 91.6 66.65 75.1 81.615 m=300 77.05 61.1 74.65 94.8 99.85 75.4 96.05 90.65 65.4 75.35 81.03 m=400 76.6 60.1 74.7 94.65 99.65 74.8 95.95 90.3 64.45 75.15 80.635 m=500 76.5 60.2 74.75 94.35 99.65 74.2 95.85 90.25 63.45 74.85 80.405 m=600 76.2 60.1 74.3 94 99.65 74 95.65 90 63.3 74.75 80.195 m=700 76.15 60 74.1 93.85 99.65 73.85 95.45 90 63.15 74.55 80.075 m=800 76 59.4 74 93.6 99.2 73.25 95.15 89.4 63 74.2 79.72 m=900 76.35 59.9 75.2 94.15 99.65 73.4 95.85 90 62.1 75.05 80.165 m=1000 76.3 59.3 75.2 94.1 99.25 73.2 95.55 89.7 62 74.85 79.945 table 11: precision results for different number of m for top‑20 different number of m categories africa beaches buildings buses dinosaur elephant roses horses mountains food average m=100 76.35 58.8 61.85 94.95 99.9 70.35 94.1 90.3 61.45 64.15 77.22 m=200 75.25 57.45 63.85 94.35 99.6 70.1 93.95 87.45 60.6 66.25 76.885 m=300 73.1 55.5 64.4 92.9 99.55 68.9 93.8 86.35 58.85 66.75 76.01 m=400 72.1 54.47 65.4 92.4 99.55 68.15 93.6 85.95 57.75 66.5 75.587 m=500 72 54.45 65.4 92 99.55 67.85 93.55 85.75 57.25 66.3 75.41 m=600 71.7 54 65.1 91.6 99.55 67.75 93.3 85.65 57.15 66.2 75.2 m=700 71.6 53.8 65 91.4 99.45 67.6 93 85.4 57.05 66.2 75.05 m=800 71.5 53.5 64.8 91.2 99 67.1 93 85.1 56.8 66 74.8 m=900 71.7 54.05 65.5 91.45 99.55 66.75 93.45 85.5 54.05 66.45 74.845 m=1000 71.2 54.05 65.2 91.15 99.05 66.65 93.15 85.4 53.85 66.35 74.605 this reason, different numbers of m in the range of m=100 to m=200 have also been investigated to gain better precision result, tables 12 and 13. from tables 12 and 13, it is quite clear that the best result is obtained when m =110 for both top-10 and top-20. more experiments have been done to compare the proposed approach with the state-of-the-art techniques, table 14. according to the results presented in table 14, the best performance (precision rate) is achieved by the proposed approach for both top-10 and top-20. all the tested table 12: precision rate of the proposed approach for different number of m for top‑10 different number of m categories africa beaches buildings buses dinosaur elephant roses horses mountains food average m=30 82.89 64.64 70.39 95.44 99.74 76.79 96.24 94.19 68.94 69.94 81.92 m=50 83.15 64.9 70.65 95.7 100 77.05 96.5 94.45 69.2 70.2 82.18 m=70 81.9 63.95 70.85 95.6 100 77.2 96.65 93.45 68.5 72.65 82.08 m=90 81.55 62.65 71.75 95.9 99.95 77.85 96.6 93 68.45 72.65 82.04 m=110 80.95 64.25 72.6 96 99.95 77.75 96.35 92.55 68 73.05 82.15 m=130 81.85 63.8 71.05 95.8 99.92 77.65 96.4 92.4 68.2 72 81.91 m=150 81.49 63.38 71.86 95.8 99.9 77.5 96.4 92.55 68 72.9 81.98 m=170 81.1 62.25 71.29 95.33 99.58 77.3 96 92.2 67.7 72 81.48 m=190 80 62.6 70.9 94.6 99.2 76.45 95.3 91.8 67 71.7 80.96 salih and abdulla: an efficient two-layer based technique for cbir uhd journal of science and technology | jan 2021 | vol 5 | issue 1 39 table 14: precision rate of the tested cbir techniques approaches top-10 top-20 proposed approach 82.15 77.27 atif nazir, 2018 73.5 pradhan jitesh, 2019 64.00 59.60 khawaja and ahmed , 2019 76.5 hamed qazanfari, 2019 74.77 aiswarya, 2020 67 table 13: precision rate of the proposed approach for different number of m for top‑20 different number of m categories africa beaches buildings buses dinosaur elephant roses horses mountains food average m=30 75.14 57.84 59.19 94.09 99.74 69.29 93.29 90.64 58.54 59.24 75.70 m=50 75.40 58.10 59.45 94.35 100.00 69.55 93.55 90.90 58.80 59.50 75.96 m=70 76.80 59.00 60.65 94.45 99.95 69.80 94.65 90.65 59.65 62.10 76.77 m=90 76.95 58.70 61.35 94.85 99.95 69.65 94.15 90.45 61.00 63.70 77.08 m=110 76.00 59.10 62.35 95.00 99.90 70.35 93.85 90.20 61.55 64.40 77.27 m=130 76.35 58.6 60.7 94.74 99.2 69.45 94.1 90.4 61 62.2 76.674 m=150 76.4 58.9 61.5 94.7 99.35 70 94.3 90.1 61.1 61.58 76.793 m=170 76.38 58 61 94.4 99 69.4 93.2 89.9 61.69 61.99 76.496 m=190 75.25 57.6 60.5 93.7 98.93 69.33 93.1 89 60.87 61.93 76.021 state-of-the-art techniques, except technique in ahmed and naqvi [24], they tested their approach either for top-10 or for top-20, and this is why in table 14 some cells are not contained the precision rate. 5. conclusion the two-layer based cbir approach for filtering and minimizing the dissimilar images in the dataset of images to the query image has been developed in this study. in the first layer, the bof has been used and as a result, m most similar images are remained for the next layer. meanwhile, the most dissimilar images are eliminated and, hence, the range of search is narrowed for the next step. the second layer concentrated on concatenating the extracted both (texture and color)-based features. the results obtained by the proposed approach devoted the impact of exploring the concept of two-layer in improving the precision rate compared to the existing works. the proposed approach has been evaluated using corel-1k dataset and the precision rate of 82.15% and 77.27% for top-10 and top-20 is achieved, respectively. in the future, certain feature extractors need to be investigated as well as feature selection techniques need to be added to select the most important feature which reflects increasing precision rate. references [1] z. f. mohammed and a. a. abdulla. “thresholding-based white blood cells segmentation from microscopic blood images”. uhd journal of science and technology, vol. 4, no. 1, p. 9, 2020. [2] m. w. ahmed and a. a. abdulla. “quality improvement for exemplarbased image inpainting using a modified searching mechanism”. uhd journal of science and technology, vol. 4, no. 1, p. 1, 2020. [3] h. liu, j. yin, x. luo and s. zhang. “foreword to the special issue on recent advances on pattern recognition and artificial intelligence”. springer, berlin, p. 1, 2018. [4] a. wojciechowska, m. choraś and r. kozik. “evaluation of the pre-processing methods in image-based palmprint biometrics”. springer, international conference on image processing and communications, p. 1, 2017. [5] a. a. abdulla, s. a. jassim and h. sellahewa. “secure steganography technique based on bitplane indexes”. 2013 ieee international symposium on multimedia, 2013. [6] a. a. abdulla. “exploiting similarities between secret and cover images for improved embedding efficiency and security in digital steganography”. department of applied computing, the university of buckingham, united kingdom, pp. 1-235, 2015. [7] s. farhan, b. k. biswas and r. haque. “unsupervised contentbased image retrieval technique using global and local features”. international conference on advances in science, engineering and robotics technology, p. 2, 2019. [8] r. s. patil, a. j. agrawal. “content-based image retrieval systems: a survey”. advances in computational sciences and technology, vol. 10, 9, pp. 2773-2788, 2017. [9] h. shahadat and r. islam. “a new approach of content based image retrieval using color and texture features”. current journal of applied science and technology, vol. 21, no. 1, pp. 1-16, 2017. [10] a. sarwar, z. mehmood, t. saba, k. a. qazi, a. adnan and h. jamal. “a novel method for content-based image retrieval to improve the effectiveness of the bag-of-words model using a support vector machine”. journal of information, vol. 45, pp. 117135, 2019. [11] l. k. paovthra and s. t. sharmila. “optimized feature integration and minimized search space in content based image retrieval”. procedia computer science, vol. 165, pp. 691-700, 2019. [12] a. masood, m. a. shahid and m. sharif. “content-based image retrieval features: a survey”. the international journal of advanced networking and applications, vol. 10, no. 1, pp. 37413757, 2018. [13] s. singh and s. batra. “an efficient bi-layer content based image salih and abdulla: an efficient two-layer based technique for cbir 40 uhd journal of science and technology | jan 2021 | vol 5 | issue 1 retrieval system”. springer, berlin, p. 3, 2020. [14] y. d. mistry. “textural and color descriptor fusion for efficient content-based image”. iran journal of computer science, vol. 3, pp. 1-15, 2020. [15] k. t. ahmed, a. irtaza, m. a. iqbal. “fusion of local and global features for effective image extraction”. elsevier, amsterdam, netherlands, vol. 51, pp. 76-99, 2019. [16] m. o. divya and e. r. vimina. “maximal multi-channel local binary pattern with colour information for cbir”. springer, berlin, p. 2, 2020. [17] t. kato. “database architecture for content-based image retrieval”. international society for optics and photonics, vol. 1662. pp. 112-123, 1992. [18] j. yu, z. qin, t. wan and x. zhang. “feature integration analysis of bag-of-features model for image retrieval”. neurocomputing, vol. 120, pp. 355-364, 2013. [19] n. shrivastava. “content-based image retrieval based on relative locations of multiple regions of interest using selective regions matching”. elsevier, amsterdam, netherlands, vol. 259, pp. 212224, 2014. [20] e. gupta and r. s. kushwah. “combination of global and local features using dwt with svm for cbir in reliability”. infocom technologies and optimization (icrito) trends and future directions, 2015. [21] e. gupta and r. s. kushwah. “content-based image retrieval through combined data of color moment and texture”. international journal of computer science and network security, vol. 17, pp. 94-97, 2017. [22] a. nazir, r. ashraf, t. hamdani and n. ali. “content based image retrieval system by using hsv color histogram, discrete wavelet transform and edge histogram descriptor”. 2018 international conference on computing, mathematics and engineering technologies, p. 4, 2018. [23] p. jitesh, a. ashok, p. a. kumarand and b. haider. “multi-level colored directional motif histograms for content-based”. the visual computer, vol. 36, pp. 1847-1868, 2020. [24] k. t. ahmed and s. h. naqvi. “convolution, approximation and spatial information based object and color signatures for content based image retrieval”. 2019 international conference on computer and information sciences, 2019. [25] h. qazanfari, h. hassanpour and k. qazanfari. “content-based image retrieval using hsv color space features”. international journal of computer and information engineering, vol. 13, no. 10, pp. 537-545, 2019. [26] e. rashno. “content-based image retrieval system with most relevant features among wavelet and color features”. iran university of science and technology, vol. pp. 1-18, 2019. [27] k. s. aiswarya, n. santhi and k. ramar. “content-based image retrieval for mobile devices using multi-stage autoencoders”. journal of critical reviews, vol. 7, pp. 63-69, 2020. [28] j. zhou, x. liu, w. liu and j. gan. “image retrieval based on effective feature extraction and diffusion process”. multimedia tools and applications, vol. 78, no. 5, pp. 6163-6190, 2019. [29] p. srivastava. “content-based image retrieval using multiresolution feature descriptors”. springer, berlin, pp. 211-235, 2019. [30] i. a. saad. “an efficient classification algorithms for image retrieval based color and texture features”. journal of al-qadisiyah for computer science and mathematics, vol. 10, no. 1, pp. 42-53, 2018. [31] m. s. haji. “content-based image retrieval: a deep look at features prospectus”. international journal of computational vision and robotics, vol. 9, no. 1, pp. 14-37, 2019. [32] v. geetha, v. anbumani, s. sasikala and l. murali. “efficient hybrid multi-level matching with diverse set of features for image retrieval”. springer, berlin, pp. 12267-12288, 2020. [33] r. boukerma, s. bougueroua and b. boucheham. “a local patterns weighting approach for optimizing content-based image retrieval using a differential evolution algorithm”. 2019 international conference on theoretical and applicative aspects of computer science, 2019. [34] y. cai, g. xu, a. li and x. wang. “a novel improved local binary pattern and its application to the fault diagnosis of diesel engine”. shock and vibration, vol. 2020, p. 9830162, 2020. [35] g. xie, b. guo, z. huang, y. zheng and y. yan. “combination of dominant color descriptor and hu moments in consistent zone for content based image retrieval”. ieee access, vol. 8, pp. 146284-146299, 2020. [36] a. c. nehal and m. varma. “evaluation of distance measures in content based image retrieval”. 2019 3rd international conference on electronics, communication and aerospace technology, pp. 696-701, 2019. [37] s, bhardwaj, g. pandove and p. k. dahiya. “a futuristic hybrid image retrieval system based on an effective indexing approach for swift image retrieval”. international journal of computer information systems and industrial management applications, vol. 12, pp. 1-13, 2020. [38] s. p. rana, m. dey and p. siarry. “boosting content based image retrieval performance through integration of parametric and nonparametric approaches”. journal of visual communication and image representation, vol. 58, pp. 205-219, 2019. [39] m. k. alsmadi. “content-based image retrieval using color, shape and texture descriptors and features”. springer, berlin, pp. 1-14, 2020. . uhd journal of science and technology | april 2017 | vol 1 | issue 1 27 1. introduction the iraqi population consists of several ethnic gatherings, including arab muslim shiite, arab muslim sunnis, kurds, assyrian, turkoman, chaldean, armenian, yazidi, sabean, and jews. arabic is the language used in many regions and kurdish is the official language in kurdistan [1], and overall designing a curcuilm of using technology will help each ethnic group to express their culture globally with using technology; the key is to be a technology-based classroom context where each individual can work and develop ideas. thus, technology-based classroom can bring together this melting pot in a diverse country such as iraq with diverse students in the university. mit professor seymour papert had presented among great debate that the kids were to use computer in their learning. what was all the more shocking was the progressive route in which he led his classes, in open investigation of learning and information that exemplified the general population’s pc development of the 1960s. this scene summed up the significance of the change of training and a move in pondering training that fits so well for a fates approach [2]. thus, could a similar case be witnessed in kurdistan and/or iraq? lei and different researchers demonstrate the significance of utilizing computer technology in learning and profitability. furthermore, innovation is utilized as a part of instructing and learning techniques in many nations. however, not all the university organizations in iraq utilize technologies in their classrooms. this review has taken a subjective research approach to examine this issue and recognize the reasons of why the technological tools are not utilized to encourage the procedure of correspondence, the future of technology-based classroom mazen ismaeel ghareb1 and saman ali mohammed2 1department of computer science, college of science and technology, university of human development, iraq 2department of english, college of language, university of human development, iraq a b s t r a c t in recent years, information technology has turned out to be a standout among the most essential social practices of the iraqi society. technology has developed in many areas in iraq, yet the internet and computer technology is not implemented in every classroom in most of iraqi universities. the aim of this paper is to question: why technology is not executed in the classroom of the universities? the point of this contextual analysis is to recognize the difficulties that face university organizations from executing computer technology in every classroom with support of ministry of higher education. another question is: how popular or effective could the application of computer communication or computer application in the classrooms of our graduate students be when it comes to education? this study wants to uncover the underlying issues and reasons for the in capabilities in using technology. to examine this, the researchers inspected cases of teaching where technologies used in the classrooms as a lived experience and as an apparatus for encouraging direction while instructing in the universities. the outcomes show that the country’s current educational and telecommunications infrastructure is weak and the ministry of higher education and universities need to manufacture the framework limit of schools, and furthermore to enhance educator preparing programs at all levels. index terms: e-classroom, information technology, internet in education, iraq education, online learning applications corresponding author’s e-mail: mazen.ismaeel@uhd.edu.iq received: 20-03-2017 accepted: 30-03-2017 published: 12-04-2017 access this article online doi: 10.21928/uhdjst.v1n1y2017.pp27-32 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2017 ghareb and mohammed. this is an open access article distributed under the creative commons attribution noncommercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l re se a rc h a rt i c l e uhd journal of science and technology mazen ismaeel ghareb and saman ali mohammed: the future of technology-based classroom 28 uhd journal of science and technology | april 2017 | vol 1 | issue 1 improve the strategies for instructing, and enhance the students’ learning skills in iraqi colleges and personnel [3]. 2. literature review a. technology and information age during the previous 50 years, electronic devices (radio, tv, pcs, satellite, and so forth) were the focal instruments and correspondence advances in assisting with transmitting the information to the individuals [4]. the information age has brought new challenges since 1950 where individuals might want to have multimedia sources accessible for them to utilize. the term is used to describe a cybernetic society, which relies upon computer and information transmission. the commonplace casing of understanding a modern culture depends on the human work and the machines they use to deliver products. in the light of the consistent changes over decades, geographic hindrances are being broken down, and the connection between the employees and their working environment is evolving quickly. new data information technology (it) and types of correspondence have developed to take care of issues and set new bearings for issues that have been around for quite a while. if we take education, for example, we will see that individuals can read, compose, sort, and print by utilizing computer literacy [5]. in many societies, the nature and capacity of technology have changed using computerized advancements in correspondence and in keeping people in general educated on matters of open significance [6]. it is the reason for changes for the dominant part of businesses. it is a strategic tool, and without information and technologies, changes are unrealistic especially in a modern age. in the recent years, a large portion of the enterprises everywhere worldwide utilized telecommunicated systems of computers at the center of information systems and communication processes. the development of new technologies makes educating and correspondence all the more effective and less demanding. technologies do not take care of the social issues, but rather a crucial instrument for improvement and creativity in education in society [7]. research has demonstrated that the utilization of innovation serves massively crucial by encouraging the learning process where it associates workforce and students with the world. in addition, technology can offer assets and encounters that books are not ready to offer. computer technology can likewise help with data analysis. they can encourage the administration of vast volumes of information and empower staff, students, or analysts to find, mark, and gather distinctive mixes of portions of literary information [8]. b. the futures of educations the use investigating the morally stable points of unesco’s educating process for a sustainable future [9], or the global challenges training and learning uncovers the principal and effectiveness of human association with instruction and the importance of training as an extensive advancement apparatus for social monetary change, that is repeated solidly [10]. training in this regard is one of the problems needs to be addressed when tending to change and what’s to come. it is clear that the educational systems do not set us up for the rising pluralistic, interconnected, and complex world. they unquestionably do not set us up for apparently interminable change, insecurity, or more all, instability [11]. educators and education structures, then need to address this issue of managing ceaseless change and its complexities, new establishments for training. hicks and gidley maintain that if teachers do not help youngsters to feel engaged in connection to their future and future change, then, it can be contended, they will have bombed in their obligation to the present era of learners [12]. thus, these are heated words that place the future eras at the wheel of education and its key approach, as the experience of instruction will shape them, in their expert capacity as well as in their exceptional character, as the finnish instruction service shows that training basically constructs personality. the moral and good obligation in view of future lives and personalities then could not be as vital or as esteem driven. c. education in iraq iraq government requires all qualified matured kids to go to grade school. this helped establish the framework for a standout among the most created and proficient populaces in the area. amid the times of 1970-1984, a period some recognize as the country’s “brilliant years,” education system increased huge initiative acknowledgment in middle east. today, the ministry of education (moe) and ministry of higher education and scientific research (mohesr) and ministry of higher education of kurdistan region government (moh-krg) deal with the iraqi instruction framework [13]-[15]. d. computers in the classroom in a presentation on education and learning in 1990, seymour paper questioned what specialists, futurists, and prophets are asking what will it resemble? what kind of a world will it be? furthermore, he states that these could be wrong mazen ismaeel ghareb and saman ali mohammed: the future of technology-based classroom uhd journal of science and technology | april 2017 | vol 1 | issue 1 29 sort of things to ask, as they are all wrong, he said that “the question is definitely not what will the computer do to us?” the question is “the thing that we will make of the computer?” the fact of the matter is defined not to foresee the computer future [16]. the substance of this is the vital position in which training ought to be viewed as proactive and authoritative, that diagram two distinctive perspectives of education, one that is repetitive knowledge based, and the other that is an experience based learning. that the adjustments in our condition do not simply happen to us and impact us, however, we likewise play a part in influencing them. this fits extremely well for a prospect’s point of view and helps in exploring moral issues managing technology and education. on account of this we can get better engagement with kids as dynamic learners and their enabling perspective of training that is exploratory in nature. the landing of the computer in the classroom, blended up with solid moral inquiries regarding the significance of information, educating, and discovering till the present time serves as a key issue that raises key inquiries, for and against technology. we can see this sort of viewpoint today in moral discourses about kids bringing individual innovations such as mobile smart phones into the classroom. one proposed enactment for instance, announced in the news as of late, recommended that they ought to confiscate smart-phones since they would be a diversion to learn, or that students were tricked in some way, which in particular moral cases are genuine issues confronted for conventional education learning settings [16]. embracing childhood and youth as engaged through digital media, and consequently for instructive arrangement reflects their strengthening. it recognizes the requirement for a “quiet revolution” in education that saddles the impressive limits of youngsters and youngsters to take part in and coconstruct conceivable future [17]. e. internet revolution in iraq in recent years, the internet technology as methods for communication and its users have been growing drastically. the number of users in iraq increased after the 2003 war because there were rules feeding using internet [18].currently, the internet circumstance is enhanced significantly as there are presently there are many [19]. the last information gathered in regards to non-military personnel access to the internet was in 2017 and around then roughly 6.381 million iraqis had home web, positioning 87th for internet access globally [19]. while this is a an improvement, individuals get to independent companies, educational institutions, and government framework and keep on crippling iraq in turning out to be globally competitive. the united states (us) government welcomed the financial specialists and enormous organization to invest in iraq to begin the procedure of the country’s reconstruction. in 2004, the united nations development group iraq trust fund (undg itf) began in iraq. undg itf is one of the international reconstruction fund facility for iraq (irffi). the iraqi government counseled with the united nations and the world bank to plan the irffi and open the entryways for contributors and global sources to bolster iraq’s reconstruction activities including technology implementation projects [20]. f. social media in iraq researchers have been inspecting the part that web-based social networking plays in the higher education classroom. a portion of the work has highlighted the full of feeling results of social network sites coordination. a couple thinks about examined learning results and understudy accomplishment in relating to the instructive utilization of online networking in school courses. while the lion’s share of studies detailed positive evidence, there was proof of downsides too. education system does tend toward investigating and developing technologies as new or enhanced apparatuses to upgrade instruction and learning. web-based social networking has developed as a profoundly valuable individual correspondence technology. although the foundation to bolster web-based social networking’s nearness exists in many colleges today, teachers have been moderate in receiving the instrument as an instructive one. a few, obviously, are not willing to acknowledge the apparatus unlimited power favoring theoretical or pragmatic reasons for an implementation [21]. adding to that there are many students and undergraduates who spent a lot of time on social network, making many groups of study. many lecturers also have been interacting about the subjects, assignments, and courses on social media guiding and giving the students instructions [22]. 3. methodology the aim of this research is to recognize the difficulties that face the universities administrations from implementing the computer technology in every classroom. this study attempts to answer the accompanying research question: why the technology is not implemented yet in the classroom settings in the iraqi and krg colleges? this review takes after a qualitative research way to examine the issue. mazen ismaeel ghareb and saman ali mohammed: the future of technology-based classroom 30 uhd journal of science and technology | april 2017 | vol 1 | issue 1 g. the iraqi education system the models in country risk analysis, for example, the economic intelligence unit, business environment risk intelligence, and euro money help organizations and their administrations see well which nation is a decent market for speculation and which one is not to maintain a strategic distance from misfortunes to their organizations. these models function as a hazard appraisal to help firms in settling on market passage choices, operations principles, and leave procedure choices. these models and strategies incorporate essential data on iraq’s political circumstance, legal, investment overview, social, cultural, environmental, and geological dangers of the nation. the reports on iraq from these models were not positive which make it harder for the iraqi government to move quickly with the innovation execution extends in the scholarly organizations. in spite of the fact that the utilization of computer technology in classroom settings can help in encouraging the educating and enhancing the student’s learning aptitudes, the classrooms in iraqi universities still do not have the accessibility of this technology because of a few difficulties that the nation faces. these difficulties will be clarified in the accompanying sections. 4. results and discussions we as the researchers of this paper, have been observing some english and computer teaching classes in some of kurdistan universities such as university of human development as a private one and sulimanya university as the state university in iraq. the english training in iraq depends on conventional instructional strategy and generally the class size is extensive. the quantity of students is around 30-50 in each class. the technique in which the understudies learn english is an educator focused. perusing understanding takes a substantial piece of class time and linguistic use is essential. talking and listening appear to be disregarded in showing english as an outside dialect in iraq. moreover, little gathering work, which is common with the communicative teaching method (the strategy more generally acknowledged by remote language instructors these days), is disregarded in the iraqi classroom, which is also teacher-centered instead of student-centered. regarding the programming teaching, mazen experienced the lack of infrastructure for students to publish their software product and no support from government as small business projects [23]. as interview with several professors, they do not use the computer technologies to encourage learning through online talk bunches, watching the recordings and clasps inside the classroom, talking and recordings, and so on because of the absence of language laboratories or computer technology in classrooms. this issue is as yet present in iraq till now the same number of my associates who still sit-in iraq told me of these ongoing problems. getting to online instructional materials and the utilization of blackboard apparatuses in iraqi organizations is as yet not accessible to all public and private universities in iraq. however, university of human development uses moodle tool. moodel is used as an open source for our teaching using online forums, assignments, and online quizzes. in addition, we have experience using social media for group working as an effective method in learning process [24]. the iraqi and the us governments have been attempting to modernize the instructional framework and to improve online instructive libraries. this incorporates building instructive organizations among iraqi and us schools, trade projects and going by researchers, and supporting iraqi understudies and instructor to examine or to get prepared in the us foundations. in spite of the fact that this support is exceptionally useful, the overall progress is very slow [25]. 5. conclusions and recommendations the research findings show that if country risk analysts contrast iraq’s education, technological status with other developed countries such as the usa, they can see that getting to online instructional materials in iraqi establishments is still not of that high quality, and the nation’s current educational and media communications infrastructure is weak. in addition, iraq confronts various difficulties after the 2003 war. one of these difficulties was to recover the unmistakable quality that its instruction framework once held in the middle east, the need to construct the foundation limit of schools, and furthermore to enhance instructor preparing programs at all levels [26]. in view of the unsecure circumstance and the political instability in iraq, it is difficult to execute technology in all schools in a short time. therefore, the alternatives recognized are to have a pc laboratory, technology tools, and the internet in each school as an initial step to encourage learning and instructing. having such a lap is an essential official unit unlike smartphone. having internet café undergrad and graduate students to use for their class study and research. since so many students have smartphone and tablets, the use of internet and the technology-based classroom becomes easy if there are adopted mechanisms. the recommendations outlined in this research are as follows: the universities must mazen ismaeel ghareb and saman ali mohammed: the future of technology-based classroom uhd journal of science and technology | april 2017 | vol 1 | issue 1 31 have a correct time span for finishing the outline and the cost of their technology-based systems, classrooms, and project. it ought to likewise determine which schools that can profit by the technology implementation. there ought to likewise be a political strength and security in the nation to ensure the specialists and associations that will take the venture and play out the business. the universities have likewise to empower iraqi lecturers and students to go to workshops and gatherings to acquire information and improvement. moe and mohesr may contract worldwide staff and personnel to help with the educational development process and technology used in classrooms. sending iraqi students to consider in foreign countries will be an advantage for the country once they come back to exchange with the nearby staff their own experiences and knowledge. we recommend the other public and private university using new technological tools for their educational process. for instance, university of human development has used moodle tool as an open source for their students which has positively affected the capabilities of interact with the students and teaching activities in university and later outside the university [24]. this system is upgrade to google applications and all the students have their own university account and it will be configured to google class next year. furthermore, students can use many social media applications for educational communication such as facebook, viber, and many more. 6. acknowledgment we would like thanks university of human development for usual support to conduct this research, and for all students and lectuere that support us during analysis data of this study. references [1] j. ainsworth, ed. sociology of education: an a-to-z guide. thousand oaks, ca: sage publications, 2013. [2] d. c. hoyles and r. noss. “visions for mathematical learning: the inspirational legacy of seymour papert (1928-2016)." ems newsletter, vol. 3, no. 103, pp. 34-36, 2017. [3] j. lei, j. shen and l. johnson. “digital technologies and assessment in the twenty-first-century schooling.” in assessing schools for generation r (responsibility), netherlands: springer, 2014, pp. 185-200. [4] m. m. teixeira, c. d. de aquino, m. b. c. leão, h. v. l. souza, r. f. f. de oliveira, l. miranda, r. neves and e. medeiros. “mass media in teaching and learning: circumstances in higher education.” in information systems and technologies (cisti), 2016 11th iberian conference on ieee, 2016, jun. pp. 1-5. [5] t. gillespie, p. j. boczkowski and k. a. foot. media technologies: essays on communication, materiality, and society. cambridge, massachusetts: mit press, 2014. [6] j. jordan-meier. “the four stages of highly effective crisis management: how to manage the media in the digital age.” boca raton: crc press, 2016. [7] l. s. sidhu and j. sharma. “information, globalization through research and social development.” compusoft, an international journal of advanced computer technology, vol. 4, no. 5, pp. 1760, 2015. [8] t. a. schwandt. “the sage dictionary of qualitative inquiry.” thousand oaks, ca: sage publications, 2014. [9] a. e. wals. “sustainability in higher education in the context of the un desd: a review of learning and institutionalization processes.” journal of cleaner production, vol. 62, pp. 8-15, 2014. [10] e. unterhalter. “measuring education for the millennium development goals: reflections on targets, indicators, and a post2015 framework.” journal of human development and capabilities, vol. 15, no. 2-3, pp. 176-187, 2014. [11] a. montuori. “creative inquiry: confronting the challenges of scholarship in the 21st century. futures, vol. 44, no. 1, pp. 64-70, 2012. [12] d. hicks and j. gidley. “futures education: case studies, theories and transformative speculations.” futures, vol. 44, no. 1, pp. 1-3, 2012. [13] j. h. issa and h. jamil. “overview of the education system in contemporary iraq.” european journal of social sciences, vol. 14, no. 3, pp. 360-386, 2010. [14] s. al-husseini and i. elbeltagi. “knowledge sharing practices as a basis of product innovation: a case of higher education in iraq.” international journal of social science and humanity, vol. 5, no. 2, pp. 182, 2015. [15] n. kaghed and a. dezaye. “quality assurance strategies of higher education in iraq and kurdistan: a case study.” quality in higher education, vol.15, no. 1, pp. 71-77, 2009. [16] a. taylor. “technology in the classroom-the ethical futures of a transforming education.” coolest student papers at finland futures research centre, vol. 2015-2016, pp. 15, 2016. [17] a. craft. “childhood in a digital age: creative challenges for educational futures.” london review of education, vol. 10, no. 2, pp. 173-190, 2012. [18] k. m. wagner and j. gainous, j. “digital uprising: the internet revolution in the middle east.” journal of information technology and politics, vol. 10, no. 3, pp. 261-275, 2013. [19] r. w. abedalla, l. s. escobar and d. a. al-quraishi. accessing information technology-social media in iraq.” international journal of scientific and research publications, vol. 4, pp. 32, 2014. [20] unesco institute for statistics. “global education digest 2010: comparing education statistics across the world.” montreal, quebec: unesco-uis, 2010. [21] p. a. tess. “the role of social media in higher education classes (real and virtual)-a literature review.” computers in human behavior, vol. 29, no. 5, pp. a60-a68, 2013. [22] m. i. ghareb and h. o. sharif. “facebook effect on academic performance and social life for undergraduate students of university of human developments.” international journal of multidisciplinary and current research, vol. 3, pp. 811-820, 2015. [23] m. i. ghareb and s. a. mohammed. “the effect of e-learning and the role of new technology at university of human development.” international journal of multidisciplinary and current research, vol. 4, pp. 299-307, 2016. mazen ismaeel ghareb and saman ali mohammed: the future of technology-based classroom 32 uhd journal of science and technology | april 2017 | vol 1 | issue 1 [24] m. i. ghareb and s. a. mohammed. “the role of e-learning in producing independent students with critical thinking.” international journal of multidisciplinary and current research, vol. 4, pp. 299307, 2016. [25] a. d. andrade and b. doolin. “information and communication technology and the social inclusion of refugees.” mis quarterly, vol. 40, no. 2, pp. 405-416, 2016. [26] a. g. atiya and m. vrazhnova. “higher education funding in iraq in terms of the experience of particular developed countries.” international journal of advanced studies, vol. 6, no. 1, pp. 8-17, 2017. tx_1~abs:at/tx_2:abs~at uhd journal of science and technology | jan 2021 | vol 5 | issue 1 41 1. introduction cancer is a category of diseases that includes cell growth which is irregular with the ability to spread to other areas of the body. physicists have concentrated on continuous advancement in imaging methods over the past decades, enabling radiologists to improve cancer detection and diagnosis. however, the human diagnosis still suffers from poor repeatability, associated with false identification or perception in clinical decisions of anomalies. two factors influence these inaccuracies: the ability to observe is limited, for example, perception of human vision is constrained, fatigue duty, or confusion, and the second factor is the clinical case complexity, for instance, unbalanced data which are the mean number of healthy cases are more than a malignant case. different machine learning-based techniques for cancer detection and classification have introduced a new area of research for early cancer detection. the researches will lead to the ability to reduce the manual system impairments [1]. another reason, modality that has various analysis techniques such as inappropriate diagnostics, handling, and complicated history is leading to increasing mortality [2]. in the past decades, the field of digital pathology has dramatically developed due to the improvement of algorithms in image processing, machine learning, and advancements in computational power. within this sector, countless approaches have been suggested to analyze and classify automated pathological images. at present, many a state-of-the-art review on machine learning-based methods for prostate cancer diagnosis ari mohammed ali ahmed1, aree ali mohammed2 1department of information technology, technical college of informatics, sulaimani polytechnic university, krg, sulaimani, iraq, 2department of computer science, college of science, university of sulaimani, sulaymaniyah, iraq a b s t r a c t prostate cancer can be viewed as the second most dangerous and diagnosed cancer of men all over the world. in the past decade, machine and deep learning methods play a significant role in improving the accuracy of classification for both binary and multi classifications. this review is aimed at providing a comprehensive survey of the state of the art in the past 5 years from 2015 to 2020, focusing on different datasets and machine learning techniques. moreover, a comparison between studies and a discussion about the potential future researches is described. first, an investigation about the datasets used by the researchers and the number of samples associated with each patient is performed. then, the accurate detection of each research study based on various machine learning methods is given. finally, an evaluation of five techniques based on the receiver operating characteristic curve has been presented to show the accuracy of the best technique according to the area under curve (auc) value. conducted results indicate that the inception-v3 classifier has the highest score for auc, which is 0.91. index terms: prostate cancer, machine learning, deep learning, algorithm, and datasets corresponding author’s e-mail: ari.m.ali@spu.edu.iq received: 19-10-2020 accepted: 27-03-2021 published: 31-03-2021 access this article online doi: 10.21928/uhdjst.v5n1y2021.pp41-47 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2021 ahmed and mohammed. this is an open access article distributed under the creative commons attribution noncommercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l re se a rc h a rt i c l e uhd journal of science and technology ahmed and mohammed: prostate cancer diagnosis 42 uhd journal of science and technology | jan 2021 | vol 5 | issue 1 smart and powerful features are added to the microscope and digital images to convert slides of stained tissues into entire digital images. these facilities make a more efficient computer diagnosis system to analyze histopathology and helping early diagnosis. moreover, they treat cancer by avoiding the increase of cancer cells and easily controlling the tumors from spreading to other parts of the body [3]. in addition, analysis of medical imaging could be significantly involved in identifying defects in various body organs, such as prostate cancer (pca), blood cancer, skin cancer, breast cancer, brain cancer, and lung cancer. the abnormality of the organ is mainly the result of rapid tumor development, which is the world’s leading cause of death. as mentioned by globocan statistics, around 18.1 million new cancer cases have appeared in 2018 that gave rise to 9.6 million cancer deaths [2]. pca is considered the most dangerous disease type of cancer, and it is viewed as the second most commonly diagnosed cancer [3], [4]. the most ubiquitous form of cancer in men is pca and it has been reported to be the second leading cause of death in men [5]. in the usa, the occurrence of pca ranks first in men whereas in south korea, is the fifth most common cancer among males, and the expected cancer deaths in 2018 were 82,155 [3]. pca is the most leading cancer among men, after lung cancer. it is estimated that about 174,650 new cases and 31,620 pcarelated deaths were recorded in the united states in 2019. pca considers about 1 in 5 new cancer diagnoses among men. one of the difficulties of pca is grading that can be considered as a part of the classification problem. therefore, accurate prediction of pca grade is crucial to guarantee the quick treatment of malignancy [6]. furthermore, early diagnosis and treatment planning can significantly reduce the mortality rate due to pca [6], [7]. technologies lead to having a crucial role in helping the medical community to diagnose cancer quickly [8]. on the one hand, there are many differences between images attained with modalities of analytic imaging and other image types that related to features and management of procedures. on the other hand, challenges are arising from the use of the different types of scanners, protocols of imaging, variety of noising, and other issues related to image attainment [5]. different computer-aided techniques have been proposed using a radiomics method or deep learning network to accurately classify the pca on magnetic resonance imaging (mri) images [8]. several studies have shown that computeraided systems have a remarkable role in pca detection and diagnostic evaluation. the methods proposed so far are based on handcrafted features, using a classifier on top to determine whether a pca lesion is present or to assess its severity by assigning a specific class label. recently, different techniques such as convolutional neural networks (cnn), support vector machine (svm), iterative random (random forest [rf]), and j48 in the field of machine learning are proposed for locating and identifying cancer cells and normal cells. they have shown an impressive performance in various computer vision tasks following training with large image databases [5], [9]. this paper aims to propose a state-of-the-art review that surveys several techniques for pca diagnosis, moreover, the techniques which are mostly based on machine learning are comparing in terms of performance accuracy. the structure of the paper is as follows: in section ii, a review of some related works is represented while in section iii, the methodology of the literature review is described. section iv shows a comparison among the aforementioned methods. finally, a conclusion and future direction of the research survey are given in section v. 2. survey of pca techniques several techniques have been suggested by many researchers for improving and developing pca detection. in this survey, we mainly focus on the researcher’s techniques that have been implemented with the machine learning field between 2015 and 2020. sammouda et al., 2015, worked on malignant pca cells using near-infrared optical imaging technique that uses the high absorption of hemoglobin in pca cells. two algorithms (k-mean and fuzzy clustering mean) are used to segment and extract the cancer region in the prostate’s infrared images. using the student’s t-test to measure the accuracy between these two clusters, p value of k-means “cluster 3” is < 0.0001, and the standard error = 0.0002 is less than p value of ferric carboxymaltose (fcm)<0.0252 and standard error = 0.004. as the result, the k-mean is more accurate than fcm based on statistical analysis [10]. mohapatra and chakravarty, 2015, suggested a model using three classifiers svm, naive bayes, and knn to classify pca. in this model, microarray is used as a dataset. the area under the curve (auc) and accuracy have been measured to compare and evaluate the performances of these classifiers, ahmed and mohammed: prostate cancer diagnosis uhd journal of science and technology | jan 2021 | vol 5 | issue 1 43 with taken the entire datasets and selected optimal features separately as the input to the classifiers one by one. as the result, the svm technique performs more efficacy with higher accuracy of 95.5% [11]. in the same year, bouazza et al., 2015, proposed a classification method that performed a comparative study of four feature selection methods fisher, t-statistics, snr, and relieff, using two classifiers k-nearest neighbors and svm. test results indicated that the best classification accuracy is obtained with svm classifier and snr method [12]. dash et al., 2016, worked on the microarray medical datasets and, two variations of kernel ridge regression (krr) are used which are wkrr and rkrr to classify the datasets. to achieve a high rate of accuracy, this model is comparing the accuracy test among several techniques such as krr, svm-rbf, svm-poly, and rf. as the result, krr (wkrr and rkrr) has been higher than all of them, especially rkrr which has an accuracy rate of 97%. however, this model has some drawbacks related to feature extraction which are ignoring the interaction with the classifier, features are considered independently which is mean ignoring these features which are dependencies. another difficulty is related to determine the point of threshold to rank the features [13]. imani et al., 2016, proposed an approach to integrate mpmri with temporal ultrasound for pca classification, in vivo. cnn technique has been utilized in this approach. a combination of mp-mri and temporal ultrasound is used to reduce the missing regions of tumors. the auc of 0.89 has been achieved for the classification of cancer with higher grades. despite the importance of this model, there are some drawbacks because of the heterogeneous of pca and it is difficult to determine tissue signature consistently [14]. ram et al., 2017, proposed an iterative rf (irf) algorithm as a classifier model to separate cancer from the controlled samples of pca. the method worked on microarray and next-generation sequencing (ngs) data. however, having a large number of gene expression data make it difficult of how to identify the biomarkers related to cancer. the rf has been used to select the genes which can diagnose and treat cancer effectively. rf method is used to extract very small sets of genes while it is taken predictive performance. genes of snrpa1 are selected for pca with the obtained accuracy of 73.33% [15]. sun et al., 2017, suggested a model investigate the performance of svm algorithm and to predict the prostate tumor location using multiparametric mri data. the capability of best predictive is achieved by optimizing model parameters using leave-one-out cross-validation. a binary svm classifier utilizes to find a plane in feature space, frequently identified as a decision boundary, which splits the data into two parts. furthermore, this algorithm is used to search for a decision boundary that maximizes the margin between the two groups. the final model gives results of classification by predicting the higher accuracy of 80.5%. however, only signal intensities and values from both t2-weighted (t2w) images and parametric maps are incorporated as features, respectively [16]. liu and an, 2017, suggested a model based on deep learning and cnn for image classification of pca, they used diffusionweighted mri (dwi) images that are selected images from a number of patients including positive and negative images. however, a small dataset makes a difficult for training a model that achieves higher accuracy. the proposed model has yielded an accuracy of 78.15% [17]. reda et al., 2018, presented a model using cnn based on computer-aided design (cad) system for early diagnosis of pca from dwi. they achieved accuracy rate of 95.65% [18]. bhattacharjee et al., 2019, developed a system for digitized histopathology images using a supervised learning method. svm has been presented and used to classify malignant and benign pca grade 3, achieved accuracy was 88.7%. in svm classification, 2-fold cross-validation has been used to train the model. both of linear and gaussian kernel are used for classifying samples as benign and malignant. furthermore, a binary classification approach has been used which divides the multitype classification into two-category groups. each partition characterizes distinct and independent classifications which are malignant and benign [3]. yoo et al., 2019, proposed a model of cad system based on cnn and rf techniques for mri (dwi) images. five individually trained cnns have been used to categorized dwi slices to extract the features and rf classifier has been used to classify patients into two groups patient with pca and without pca with achieved 0.84 as an auc. the main limitation of this model is intrinsically biased which is mean these patients take mris who have symptoms of pca. on the other hand, by depending on the reports of radiology, these slices with no biopsy consider as a negative sample while slices with biopsy consider as a positive sample based on pathology reports [19]. ahmed and mohammed: prostate cancer diagnosis 44 uhd journal of science and technology | jan 2021 | vol 5 | issue 1 cahyaningrum et al., 2020, proposed a method of artificial neural network (ann) that optimized by genetic algorithm (ga) for pca detection. this approach gives 76.4% of accurate detection. ann has some limitations because of involving a huge number of parameters. consequently, there have been many efforts to fix some of these limitations by joining ann with another algorithm to address this problem. ga is an algorithm that is compatible and adapted with the ann algorithm [20]. besides, duran-lopez et al., 2020, presented a novel cad system based on a deep learning algorithm (cnn) for distinguishing between malignant and normal tumors in whole slide images (wsis). cross-validation technique has been used with patches extracted from ws images. in this approach, the higher accuracy rate has achieved 99.98% [21]. liu et al., 2020, stated a model of deep learning that integrates s-mask r-cnn with inception-v3 in ultrasound images to diagnose pca. furthermore, the auc for inception-v3 is 0.91. according to this model, there is a lot of traditional classifiers that can be used such as svm and k-nearest neighbor. due to minor variation between the ultrasonic pca images and serious noise interface, some miss classifications might happen. therefore, the cnn was presented in deep learning to achieve the best improvement of the classification accuracy in ultrasound images of the prostate without needing to describe the features manually and target image extraction [22]. 3. methodology this review paper has conducted various studies in the field of pca that is based on machine and deep learning techniques. first, the datasets that have been used by the researchers are described in subsection a. while in subsection b, the methodologies that are related to the performance accuracy are explained. 3.1. datasets different datasets that have been used by researchers were investigated. table 1 shows the modality of the dataset types that have been used by the authors in this survey. moreover, sample numbers associated with each patient were given. 3.2. performance parameter criteria in the classification process, performance measurement is very important and essential, which determines the accuracy of the model. for this purpose, receiver operating table 1: datasets types with number of samples authors modality no. of samples mohapatra and chakravarty[11] microarray 136 dash et al. [13] microarray 136 ram et al. [15] microarray gene expression omnibus (gse71783) 30 sun et al. [16] mri (t2w, dwi and dce) 5 liu and an [17] mri (dwi) 200 reda et al. [18] mri (dwi) 23 bhattacharjee et al. [3] microscopic tissue images 400 cahyaningrum et al. [20] microarray gene expression 102 duran-lopez et al.[21] wsi (whole slide image) 97 characteristic (roc) and auc are proposed as effective evaluation metrics based classification model’s performance. in statistics, a roc curve can be defined as a graphical plot to illustrate the performance of a binary classification as it is used to distinguish varied thresholds. the true-positive rate (tpr) against the false-positive rate (fpr) is plotted to create the curve at varied threshold settings. in machine learning, the terms of recall, sensitivity, or detection probability have the same meaning as tpr. while, the term of fallout or false alarm probability has the same meaning as fpr and could be calculated as (1-specificity) [23]. fig. 1 illustrates the relation between the roc and auc. furthermore, roc is the probability curve whereas auc is the degree of separable classes. roc indicates that how much the model is capable of distinguishing amongst classes. higher the auc value (between 0 and 1) leads to better accuracy of the model. this survey compares the techniques that are based on the accuracy of the proposed methods. confusion matrix (cm) with performance metrics such as specificity and sensitivity is used to evaluate the proposed models [24]. the cm output could be either binary or multiclass. it has also a table of four different combinations between actual and predicted values. predicted values are predicted by the model while actual values are actually in a dataset. fig. 2 shows the cm relations. the following formulas describe the performance accuracy metrics based on tp, tn, fp, and fn, according to cm. tp – values that are actually positive and predicted positive. fp – values that are actually negative but predicted to positive. ahmed and mohammed: prostate cancer diagnosis uhd journal of science and technology | jan 2021 | vol 5 | issue 1 45 tn – values that are actually negative and predicted to negative. tpr sensitivity or recall ( ) = + tp tp fn (1) specificity = + tn fp tn (2) fpr 1 specificity= − = + fp fp tn (3) accuracy = + + + + tp tn tp tn fp fn (4) tp and tn represent the number of correctly predicted positive and negative samples, while fp and fn are used to represent the number of incorrectly predicted positive and negative samples [25]. 4. comparison and discussion many different techniques have been used by researchers. each technique used a special type of dataset. here, we compare the methods based on the accuracy with the dataset types and the year of publication, as shown in table 2. fig. 3 shows the accuracy of the techniques separately. finally, an evaluation of five techniques based on the auc has been performed to show the accuracy of the best technique, as depicted in fig. 4. as a result, according to the auc measurements, the inception-v3 classifier has the highest score for auc, which fn – values that are actually positive but predicted to negative. table 2: techniques with accuracy s. no. references year methods accuracy percent 1 mohapatra and chakravarty[11] 2015 svm 95.5 2 dash et al. [13] 2016 wkrr 97 3 ram et al. [15] 2017 irf 73.3 4 sun et al. [16] 2017 svm 80.5 5 liu and an [17] 2017 cnn 78.15 6 reda et al. [18] 2018 cnn 95.65 7 bhattacharjee et al. [3] 2019 cnn 88.7 8 cahyaningrum et al. [20] 2020 ann-ga 76.4 9 duran-lopez et al. [21] 2020 cnn 99.98 fig. 2. confusion matrix combinations. fig. 1. area under the curve-receiver operating characteristic curve. fig. 3. accuracy comparison of each technique. ahmed and mohammed: prostate cancer diagnosis 46 uhd journal of science and technology | jan 2021 | vol 5 | issue 1 is 0.91, although the type and quality of the dataset affect the ratio of the auc scale. 5. conclusion this paper has introduced a comparison of classification methods based on machine learning techniques of the research related to pca using various datasets including (microar ray, microar ray gene expression omnibus (gse71783), mri (t2w, dwi, and dce), microscopic tissue images, and wsi. in addition, the methods used in the literature have been reviewed along with the available results of the performance accuracy. the higher value of the auc is identified amongst most five recent papers and it is 0.91. references [1] lemaître, r. martí, j. freixenet, j. c. vilanova, p. m. walker and f. meriaudeau. “computer-aided detection and diagnosis for prostate cancer based on mono and multi-parametric mri: a review”. computers in biology and medicine, vol. 60, pp. 8-31, 2015. [2] t. saba. “recent advancement in cancer detection using machine learning: systematic survey of decades, comparisons and challenges”. journal of infection and public health, vol. 13, no. 9, pp. 1274-1289, 2020. [3] s. bhattacharjee, h. g. park, c. h. kim, d. prakash, n. madusanka, j. h. so, n. h. cho and h. k. choi. “quantitative analysis of benign and malignant tumors in histopathology: predicting prostate cancer grading using svm”. applied sciences, vol. 9, no. 15, 2019. [4] s. liu, h. zheng, y. feng and w. li. “prostate cancer diagnosis using deep learning with 3d multiparametric mri”. spie proceedings, vol. 10134, pp. 3-6, 2017. [5] n. aldoj, s. lukas, m. dewey and t. penzkofer. “semi-automatic classification of prostate cancer on multi-parametric mr imaging using a multi-channel 3d convolutional neural network”. european radiology, vol. 30, no. 2, pp. 1243-1253, 2020. [6] b. abraham and m. s. nair. “automated grading of prostate cancer using convolutional neural network and ordinal class classifier”. informatics in medicine unlocked, vol. 17, p. 100256, 2019. [7] l. a. torre, b. trabert, c. e. desantis, k. d. miller, g. samimi, c. d. runowicz, m. m. gaudet, a. jemal, r. l. siegel. “ovarian cancer statistics, 2018”. ca: a cancer journal for clinicians, vol. 68, no. 4, pp. 284-296, 2018. [8] m. arif, i. g. schoots, j. c. tovar, c. h. bangma, g. p. krestin, m. j. roobol, w. niessen and j. f. veenland. “clinically significant prostate cancer detection and segmentation in low-risk patients using a convolutional neural network on multi-parametric mri”. european radiology, vol. 30, pp. 6582-6592, 2020. [9] l. brunese, f. mercaldo, a. reginelli and a. santone. “formal methods for prostate cancer gleason score and treatment prediction using radiomic biomarkers”. magnetic resonance imaging, vol. 66, pp. 165-175, 2020. [10] r. sammouda, h. aboalsamh and f. saeed. “comparison between k mean and fuzzy c-mean methods for segmentation of near infrared fluorescent image for diagnosing prostate cancer”. international conference on computer vision and image analysis applications, 2015. [11] p. mohapatra and s. chakravarty. “modified pso based feature selection for microarray data classification”. 2015 ieee power, communication and information technology conference, pp. 703709, 2015. [12] s. h. bouazza, n. hamdi, a. zeroual and k. auhmani. “geneexpression-based cancer classification through feature selection with knn and svm classifiers”. 2015 intelligent systems and computer vision, 2015. [13] p. mohapatra, s. chakravarty and p. k. dash. “microarray medical data classification using kernel ridge regression and modified cat swarm optimization based gene selection system”. swarm and evolutionary computation, vol. 28, pp. 144-160, 2016. [14] f. imani, s. ghavidel, p. abolmaesumi, s. khallaghi, e. gibson, a. khojaste, m. gaed, m. moussa, j. a. gomez, c. romagnoli, d. w. cool, m. bastian-jordan, z. kassam, d. r. siemens, m. leveridge, s. chang, a. fenster, a. d. ward and p. mousavi. “fusion of multi-parametric mri and temporal ultrasound for characterization of prostate cancer: in vivo feasibility study”. medical imaging 2016: computer-aided diagnosis, vol. 9785, p. 97851k, 2016. [15] m. ram, a. najafi and m. t. shakeri. “classification and biomarker genes selection for cancer gene expression data using random forest”. the iranian journal of pathology, vol. 12, no. 4, pp. 339347, 2017. [16] y. sun, h. reynolds, d. wraith, s. williams, m. e. finnegan, c. mitchell, d. murphy, m. a. ebert and a. haworth. “predicting prostate tumour location from multiparametric mri using gaussian kernel support vector machines: a preliminary study”. physical and engineering sciences in medicine, vol. 40, no. 1, pp. 39-49, 2017. [17] y. liu and x. an. “a classification model for the prostate cancer based on deep learning,” proceedings of the 10th international congress on image and signal processing, biomedical engineering and informatics, cisp-bmei 2017, pp. 1-6, 2018. [18] i. reda, a. shalaby, m. elmogy, a. a. elfotouh, f. kahalifa, m. a. el-ghar, e. hosseini-asl, g. gimel’farb, n. werghi and a. elbaz. “a new cnn-based system for early diagnosis of prostate cancer”. proceedings international symposium on biomedical imaging, pp. 207-210, 2018. [19] s. yoo, i. gujrathi, m. a. haider and f. khalvati. “prostate cancer fig. 4. receiver operating characteristic curve for five classifiers. ahmed and mohammed: prostate cancer diagnosis uhd journal of science and technology | jan 2021 | vol 5 | issue 1 47 detection using deep convolutional neural networks”. scientific reports, vol. 9, no. 1, pp. 1-10, 2019. [20] k. cahyaningrum, adiwijaya and w. astuti. “microarray gene expression classification for cancer detection using artificial neural networks and genetic algorithm hybrid intelligence,” 2020 international conference on data science and its applications, 2020. [21] l. duran-lopez, j. p. dominguez-morales, a. f. conde-martin, s. vicente-diaz and a. linares-barranco. “prometeo: a cnnbased computer-aided diagnosis system for wsi prostate cancer detection”. ieee access, vol. 8, pp. 128613-128628, 2020. [22] . liu, c. yang, j. huang, s. liu, y. zhuo and x. lu. “deep learning framework based on integration of s-mask r-cnn and inception-v3 for ultrasound image-aided diagnosis of prostate cancer”. future generation computer systems, vol. 114, pp. 358-367, 2021. [23] a. z. shirazi, s. j. s. mahdavi chabok and z. mohammadi. “a novel and reliable computational intelligence system for breast cancer detection”. medical and biological engineering and computing, vol. 56, no. 5, pp. 721-732, 2018. [24] m. nour, z. cömert and k. polat. “a novel medical diagnosis model for covid-19 infection detection based on deep features and bayesian optimization”. applied soft computing, vol. 97, pp. 1-13, 2020. [25] y. celik, m. talo, o. yildirim, m. karabatak and u. r. acharya. “automated invasive ductal carcinoma detection based using deep transfer learning with whole-slide images”. pattern recognition letters, vol. 133, pp. 232-239, 2020. tx_1:abs~at/tx_2:abs~at uhd journal of science and technology | april 2017 | vol 1 | issue 1 33 1. introduction ultrasound imaging regarded as a boon to study the internal tissues of a human body for different purposes, especially for the pregnant women because of its several advantages as comparing with computed tomography (ct), magnetic resonance imaging (mri), and positron emission technology (ransley 1990). however, it is consider as one of the best methods of diagnostic scanning tools but the presence of multiplicative speckle noise which is difficult to model in real time that affects the visual quality of the ultrasound images, especially in two dimensional (2d) ultrasound. the extensive research done by the researchers at device level led to the introduction of the three dimensional and four dimensional [3]. ultrasound scanning is one of the best methods for detecting many diseases related to the internal organs, one of the most important defects that can be diagnosed clearly by ultrasound scanning is pelviureteric junction (puj) dilatation. congenital and acquired (puj) obstruction can be treated with balloon dilatation, using a fogarty/gruntzig catheter introduced through the cystoscope in children. mahant et al. [4] stated that puj dysfunction is one of the common causes of renal hydronephrosis. other causes, which are usually associated with hydroureter as well hydronephrosis, include bladder pathology vesicoureteric advanced methods for detecting pelviureteric junction dilatation by two dimensional ultrasound heamn noori abduljabbar and sardar yaba perxdr department of physics, college of education, shaqlawa slahaddin university, erbil, iraq a b s t r a c t pelviureteric junction (puj) obstruction is a condition frequently encountered in both adult and pediatric patients. congenital abnormalities and crossing lower-pole renal vessels are the most common underlying pathologies in both men and women. there are different methods for detecting it the most usual, safe, and easy one is by ultrasound scanning. the aim of this study is how to improve the image quality of two dimensional (2d) ultrasound screening of detecting puj dilatation using image processing software, image enhancement, and different types of filters, then making comparison which filter is the best to improve the image quality, that helps the medical doctors, and sonographers to make the correct decision and diagnosis. 1357 patients scanned by ultrasound in harer general hospital for general abdominal scanning, 987 cases among them have detected as urinary tract infection cases among this 987 case there were 73 case of them with puj dilatation. the 2d ultrasound images saved, after making image enhancement and using different types of filters (he, adh, clahe, and wiener) to enhance four 2d ultrasound images of abnormal kidneys, the result was in each type of filters there were some advantages and disadvantages, so that the best type of filters are (he and adh) because the puj and pelvis is much more clear and easy to define after using these kinds of filters. index terms: adh, clahe, he, image enhancement, image processing, pelviureteric junction, wiener corresponding author’s e-mail: heamn.abduljabbar@su.edu.krd received: 10-03-2017 accepted: 25-03-2017 published: 12-04-2017 access this article online doi: 10.21928/uhdjst.v1n1y2017.pp33-37 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2017 abduljabbar and perxdr. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l re se a rc h a rt i c l e uhd journal of science and technology heamn noori abduljabbar and sardar yaba perxdr: advanced methods for detecting puj dilatation by 2d ultrasound 34 uhd journal of science and technology | april 2017 | vol 1 | issue 1 obstruction and vesicoureteric reflux. common causes of puj dysfunction include intrinsic stenosis is indicated in patients with pain, infection or hematuria, fever, back pain, dysuria, and hypertension in severe cases. surgery may be performed open or laparoscopically. 2. anatomy a. location kidneys are located on the posterior abdominal wall, with one on either side of the vertebral column and perirenal space. kidneys normally are lie on the quadratus lumborum muscles, the long axis of the kidney is parallel to the lateral border of the psoas muscle, and can be visualized in the first trimester by transabdominal scan at 12-13 weeks. in addition, the kidneys lie at an oblique angle, that is the superior pole is more medial and anterior than inferior renal pole. right kidney usually lies slightly lower than left kidney due to the right lobe of the liver (bems 2014). b. structure renal shape looks like a bean-shaped with a superior and an inferior pole. the mid portion of the kidney is often called the midpole. in adults, each kidney is normally weighs 150-260 g and 10-15 cm in length, 3-5 cm in width. the left kidney is usually slightly larger than right. moreover, kidney has a fibrous capsule, which is surrounded by pararenal fat. the kidney itself can be divided into renal parenchyma, consisting of renal cortex and medulla, and the renal sinus containing renal pelvis, calyces, renal vessels, nerves, lymphatics, and perirenal fat. renal parenchyma consists of two layers: cortex and medulla. renal cortex lies peripherally under the capsule while the renal medulla consists of 10-14 renal pyramids (renal filters), which are separated from each other by an extension of renal cortex called renal columns. urine is produced in the renal lobes, which consists of the renal pyramid with associated overlying renal cortex and adjacent renal columns. each renal lobe drains at a papilla into a minor calyx, four or five of these unite to form a major calyx. each kidney normally has two or three major calyxes, which unite to form the renal pelvis, furthermore proximal ureter is connecte or started from renal pelvis which make a puj. the renal hilum is the entry to the renal sinus and lies vertically at the anteromedial aspect of the kidney. it contains the renal vessels and nerves, fat and the renal pelvis, which typically emerges posterior to the renal vessels, with the renal vein being anterior to the renal artery. renal function is removing excess water, salts, and wastes of protein metabolism from the blood. diagnostic imaging of kidney: • x-ray • mri • ct scan • ultrasound. 3. methodology ultrasound is used to detect urinary tract infections (uti), one of the most important renal diseases is puj dilatation which can be defined by ultrasound clearly if the image quality is high, in this study 2d ultrasound used for scanning, the scanning procedure starts using convex probe (transducer), with 3.5 mhz frequency, time gain compensation total gain compensator should be adjusted according to each patient, then applying translucent gel to the patient abdominal skin who requested to be in supine position. scanning point for the kidneys starts from both sides of the abdomen, which called also upper loin, scanning windows should be from three points as general, anterioposterily, lateromedially, and posteriomedially. with oreintations and angulations to obtain the best image in which puj should be clearly defined. 1357 patients scanned by ultrasound in harer general hospital/erbil/kurdistan/iraq for general abdominal scanning, 987 cases among them have detected as uti cases among this 987 case there were 73 case of them with puj dilatation. the 2d ultrasound images saved and transmitted to a computer to be processed by applying different types of filters to improve image quality, and finally comparing between filters types to choose the best type with best image quality of detecting puj dilatation. different types of filters (he, adh, clahe, and wiener) used to enhance 2d ultrasound images of (4) abnormal kidneys making image enhancement. the contrast enhances techniques performed through some operations such as point operations are referred to as graylevel transformations or spatial transformations. they can be expressed as: g(x,y) = t[f(x,y)] (1) where g(x,y) is the processed image, f(x,y) is the original image, and t is an operator on f(x,y). since the actual coordinates do not play any role in the way the transformation function processes the original image, equation (1) can be rewritten as: s = t[r] (2) heamn noori abduljabbar and sardar yaba perxdr: advanced methods for detecting puj dilatation by 2d ultrasound uhd journal of science and technology | april 2017 | vol 1 | issue 1 35 where r is the original gray level of a pixel and s is the resulting gray level after processing. point transformations may be linear (e.g., negative), piecewise linear (e.g., gray-level slicing), or nonlinear (e.g., gamma correction). contrast adjustment is one of the most common applications of point transformation functions (also known by many other names such as contrast stretching, gray-level stretching, contrast adjustment, and amplitude scaling). one of the most useful variants of contrast adjustment functions is the automatic contrast adjustment (or simply autocontrast), a point transformation that—for images of class uint 8 in matlab—maps the darkest pixel value in the input image to 0 and the brightest pixel value to 255 and redistributes the intermediate values linearly. the autocontrast function can be described as follows: min max min l-1 s= ( r r ) r r − − (3) where r is the pixel value in the original image (in the [0, 255] range), r max and r min are the values of its brightest and darkest pixels, respectively, s is the resulting pixel value, and l-1 is the highest gray value in the input image (usually l = 256). matlab has a built-in function imadjust to perform contrast adjustments. power law (gamma) transformations is given by the following transformation function: s = c • rγ (4) where r is the original pixel value, s is the resulting pixel value, c is a scaling constant, and γ is a positive value. histog ram equalization is one of the well-known enhancement techniques. in histogram equalization, the dynamic range and contrast of an image is modified by altering the image such that its intensity histogram has a desired shape. this is achieved using cumulative distribution function as the mapping function. the intensity levels are changed such that the peaks of the histogram are stretched and the troughs are compressed. if a digital image has n pixels distributed in l discrete intensity levels and n k is the number of pixels with intensity level i k and then the probability density function of the image is given by equation (5). the cumulative density function is defined in equation (6): k i k n f ( i )= n (5) k k k i jj=0 f ( i ) f ( i )=∑ (6) although this method is simple, since the gray values are physically far apart from each other in the image. due to this reason, histogram equalization gives very poor result images (sasi and jayasree, 2013). adaptive histogram equalization (ahe) is a method of contrast enhancement. it is different from ordinary histogram equalization. in adaptive method, many histograms are computed where each histogram corresponds to a different section of image. hence, ahe improves the local contrast of an image and more details can be observed. with this method, information of all intensity ranges of the image can be viewed simultaneously. there are many ordinary display devices that are not able to depict the full dynamic intensity range. this method brings a solution to this problem. other advantages include that it is automatic (i.e., no manual intervention is required) and reproducible from study to study [doi, kunio 1996] (doi 1996). apart from the advantage of local enhancement, ahe method has some limitations also. this method works too slowly on a general purpose computer although it works correctly. as enhancement is carried out in a local area, ahe tends to over enhance the noise content (gupta and kaur, 2014). 4. result and discussion four different types of filters used to enhance the images which are (he, adh, clahe, and wiener) after image processing completed results showed that (he, and adh) is the best method to be used for detecting this renal problem as in (he) filter puj, renal cortex, and renal pelvis is very clear to be defined in all of the images, furthermore using (adh) puj, renal cortex and renal pyramids can be clearly defined. while pictures that used (clahe), and (wiener) types of filter shows poor quality of the puj borders, and somewhat blurring with more noises compared with (he and adh). as the figures show in fig. 1-8. 5. conclusion image processing is a revolution in imaging as general, especially in the medical imaging due to the options that heamn noori abduljabbar and sardar yaba perxdr: advanced methods for detecting puj dilatation by 2d ultrasound 36 uhd journal of science and technology | april 2017 | vol 1 | issue 1 fig. 1. original two-dimensional ultrasound image shows simple (mild) pelviureteric junction dilatation fig. 3. original two-dimensional ultrasound image shows mild–moderate dilatation in pelviureteric junction fig. 5. original two-dimensional ultrasound image shows moderate dilatation of pelviureteric junction fig. 4. (a-d) image treated with the four filter types dc ba fig. 2. (a-d) image treated with the four filter types dc ba fig. 6. (a-d) image treated with the four filter types dc ba heamn noori abduljabbar and sardar yaba perxdr: advanced methods for detecting puj dilatation by 2d ultrasound uhd journal of science and technology | april 2017 | vol 1 | issue 1 37 provides to medical doctors and sonographers, so that in this study after saving 2d ultrasound images, different types of filters used to enhance those images, and improve its quality to be easier for diagnosis four different abnormal images of puj dilatation selected from all the patients who scanned, each case with different level of dilation from mild to moderate and sever. references [1] j. r. lindner, j. song, f. xu, a. l. klibanov, k. singbartl, k. ley and s. kaul. “noninvasive ultrasound imaging of inflammation using microbubbles targeted to activated leukocytes.” circulation, vol. 102. no. 22, pp. 2745-2750, 2000. [2] p. g. ransley, h. k. dhillon, i. gordon, p. g. duffy, m. j. dillon and t. m. barratt. “the postnatal management of hydronephrosis diagnosed by prenatal ultrasound.” the journal of urology, vol. 144, pp. 584-587, 1990. [3] t. d. brandt, h. l. neiman, m. j. dragowski, w. bulawa and g. claykamp. “ultrasound assessment of normal renal dimensions.” journal of ultrasound in medicine, vol. 1. no. 2, pp. 49-52, 1982. [4] s. mahant, j. friedman and c. macarthur. “renal ultrasound findings and vesicoureteral reflux in children hospitalised with urinary tract infection.” archives of disease in childhood, vol. 86. no. 6, pp. 419-420, 2002. [5] j. s. berns, d. h. ellison, s. l. linas and m. h. rosner. “training the next generation’s nephrology workforce.” clinical journal of the american society of nephrology, vol. 9. no. 9, pp. 1639-1644, 2014. [6] a. p. barker, m. m. cave, d. f. thomas, r. j. lilford, h. c. irving, r. j. arthur and s. e. smith. “fetal pelvi-ureteric junction obstruction: predictors of outcome.” british journal of urology, vol. 76. no. 5, pp. 649-652, 1995. [7] r. aviram, a. pomeranz, r. sharony, y. beyth, v. rathaus and r. tepper. “the increase of renal pelvis dilatation in the fetus and its significance.” ultrasound in obstetrics and gynecology, vol. 16. no. 1, pp. 60-62, 2000. [8] h. p. duong, a. piepsz, k. khelif, f. collier, k. de man, n. damry, f. janssen, m. hall and k. ismaili. “transverse comparisons between ultrasound and radionuclide parameters in children with presumed antenatally detected pelvi-ureteric junction obstruction.” european journal of nuclear medicine and molecular imaging, vol. 42. no. 6, pp. 940-946, 2015. [9] p. masson, g. de luca, n. tapia, c. le pommelet, a. es sathi, k. touati, a tizeggaghine and p. quetin. “postnatal investigation and outcome of isolated fetal renal pelvis dilatation.” archives de pediatrie, vol. 16. no. 8, pp. 1103-1110, 2009. [10] c. hong-lin. “surgical indications for unilateral neonatal hydronephrosis in considering ureteropelvic junction obstruction.” urological science, vol. 25. no. 3, pp. 73-76, 2014. [11] n. a. patel and p. p. suthar. “ultrasound appearance of congenital renal disease: pictorial review.” the egyptian journal of radiology and nuclear medicine, vol. 45. no. 4, pp. 1255-1264, 2014. [12] k. doi. “current status and future potential of computer-aided diagnosis in medical imaging.” the british journal of radiology, vol. 78. no. 1, pp. s3-s19, 2014. [13] k. doi. “current research and future potential of computer-aided diagnosis (cad) in radiology: introduction to overviews at five institutions around the world.” medical imaging technology, vol. 14. no. 6, pp. 621, 1996. fig. 7. original two-dimensional ultrasound image shows sever dilatation in the pelviureteric junction fig. 8. (a-d) image treated with the four filter types dc ba tx_1~abs:at/tx_2:abs~at 56 uhd journal of science and technology | jan 2021 | vol 5 | issue 1 1. introduction optical character recognition (ocr) is an application for image recognition that can read text from images. this can be achieved by taking an image that includes text, written in a specific language to be understood by the computer, and get the final computer representation for the text. ocr techniques may vary according to the nature of the language and the purpose of the ocr application [1]. emulating the human ability in associating symbolic identities with images of characters at a fast rate is the main goal of ocr. kurdish is one of the languages that present many challenges to ocr. the main challenge in kurdish is that it is mostly cursive. kurdish is written by connecting characters together to produce words or parts of words, as shown in fig. 1. kurdish text is written from right to left. the kurdish language has 34 basic characters, of which 14 have from one to three dots and four of them have diacritics. kurdish characters have many shapes and sizes depending on their position in the word. for example, the character is written in the form of ”ک“ “ ,at the start ”ک“ ” in the middle, and “ ” at the end of a word but the separated form of this character is “ ک.” kurdish characters vary with relevance to their position in the word, representing a great challenge for ocr [1]. based on the nature of kurdish fonts, characters may overlap vertically to produce certain compounds of letters at certain positions of the word segments, such as “ال, and نج,” which can be represented by single atomic graphemes called ligatures [3], [4]. kurdish text segmentation using projectionbased approaches tofiq ahmed tofiq, jamal ali hussien department of computer science, college of science, university of sulaimani, sulaimani, iraq a b s t r a c t an optical character recognition (ocr) system may be the solution to data entry problems for saving the printed document as a soft copy of them. therefore, ocr systems are being developed for all languages, and kurdish is no exception. kurdish is one of the languages that present special challenges to ocr. the main challenge of kurdish is that it is mostly cursive. therefore, a segmentation process must be able to specify the beginning and end of the characters. this step is important for character recognition. this paper presents an algorithm for kurdish character segmentation. the proposed algorithm uses the projection-based approach concepts to separate lines, words, and characters. the algorithm works through the vertical projection of a word and then identifies the splitting areas of the word characters. then, a post-processing stage is used to handle the over-segmentation problems that occur in the initial segmentation stage. the proposed method is tested using a data set consisting of images of texts that vary in font size, type, and style of more than 63,000 characters. the experiments show that the proposed algorithm can segment kurdish words with an average accuracy of 98.6%. index terms: optical character recognition, character segmentation, kurdish text segmentation, projection-based approach, and cursive writing optical character recognition corresponding author’s e-mail: tofiq ahmed tofiq, department of computer science, college of science, university of sulaimani, sulaimani, iraq. e-mail: tofiq.ahmad@univsul.edu.iq received: 23-01-2021 accepted: 12-05-2021 published: 16-05-2021 access this article online doi: 10.21928/uhdjst.v5n1y2021.pp56-65 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2021 al-janabi, et al. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l re se a rc h a rt i c l e uhd journal of science and technology tofiq ahmed tofiq and jamal ali hussien: kurdish text segmentation using projection-based approaches uhd journal of science and technology | jan 2021 | vol 5 | issue 1 57 some kurdish letters have single dots such as ج ,ن and ب, other letters have double dots, such as ش and ق and other have triple dots, such as ش and ڤ. furthermore, some letters have diacritics such as ڵ ,ۆ, and ێ. besides, the same shape with different dots and diacritics is used to represent different letters, such as, ۆ ,ل,ڵ ,چ ,ج ,ح, and و. the doted characters and letters with diacritics present a big challenge while being processed. this paper presents a kurdish text segmentation algorithm. the proposed algorithm uses the projection-based approach concepts to separate lines, words, and letters. the rest of the paper is organized as follows: section 2 segmentation-based methods, section 3 related works with different segmentation techniques. section 4 presents the proposed algorithm. section 5 demonstrates the results and performance analysis. section 6 concludes this paper. 2. segmentation-based methods in this part, the proposed algorithm for the segmentation of cursive text such as arabic, persian, and english handwriting text is discussed. the segmentation-based methods can be classified into: a. projection profile b. character skeleton based c. contour tracing based d. template matching based e. morphological operations based f. segmentation based on neural networks (nns) [2], [5]. projection profiles methods [9]-[11] are usually used to aim for lines, words, sub-words, and characters segmentation specifically when there is a certain gap between lines, words, sub-words, and characters. indeed, horizontal projection is used for line segmentation, while vertical projection is usually used for word, sub-word, and character segmentation. in the skeleton method, different thinning techniques are engaged for this goal [7], [12]. in many cases, the shape of the characters is different from the main character after thinning, making the splitting process more difficult. in contour tracking [13]-[16] methods, pixels that represent the external shape of a character or word are extracted. researchers used many methods to determine the cut points in the contour. in general, the contour-based technique avoids the issues that seem once applying to the thinning methods because they depend on extracting the structure of the word, which provides an obvious description of it. this type of method is sensitive to noise, which needs one to perform some preprocessing steps. morphological methods [17]-[19] use a set of morphological operations for segmentation. in general, closing and opening operations are applied. these methods are dependent, meaning that other techniques must be used in addition to segmentation. template matching methods [20], [21] often apply a sliding window over baselines. if any match is found, then the center pixel in the sliding window is considered as a cutting point. the main limitation of this method is when the cutting point locates under the baseline. finally, in nns segmentation, nns are used to verify the valid segmentation points by training the nns over manually classified valid segmentation points from the database of scanned images using features such as black pixel density and holes [22]. 3. related works zheng et al. [10] proposed a machine-printed arabic character segmentation algorithm that uses a vertical projection method and some rules or features, such as structural characteristics between background area and character components and the specification of isolated arabic characters to find segmentation points. cheung et al. [6] proposed a segmentation algorithm that uses a technique wherein the overlapping arabic words/ sub-words are horizontally separated, extensively utilizing a feedback loop among the character segmentation and final recognition stages. in the segmentation stage, a series of experimental lines have been produced in two processes, the first process uses amin’s character segmentation algorithm [21] and the second procedure use the convex dominant points detection algorithm developed by bennamoun and boashash [23]. shaikh et al. [7] propound an algorithm for sindhi text segmentation. the height profile vector (hpv) was used for the character extraction. more analysis was done over hpv to detect the locations of the possible segmentation points (psps), in some cases, the algorithm failed by performing under or over-segmentation. yeganeh et al. [13] introduced a segmentation algorithm for up and down contours based on qualified labeling, and the algorithm was developed for multifont farsi/arabic fig. 1. the character connectivity of kurdish text. tofiq ahmed tofiq and jamal ali hussien: kurdish text segmentation using projection-based approaches 58 uhd journal of science and technology | jan 2021 | vol 5 | issue 1 texts. the contour of the sub-word is measured using a convolution kernel with a laplacian edge recognition-based segmentation detection method. the algorithm goes through several stages including contour labeling of each sub-word, contour curvature grouping to improve the segmentation results, character segmentation, adaptive local baseline, and post-processing, the results showed that 97% of characters of the printed farsi texts were segmented correctly. mostafa [24] proposed a segmentation method for printed arabic text, especially for “simplified arabic” font with different sizes. most characters start with and end before a t-junction on the baseline, that is, the main rule used in this. this rule was fine for most characters, except for some special characters such as “س” and “ش,” which had a special solution. the algorithm was tested and achieved a 96.5% of good segmentation accuracy. alipour [8] presents an improved segmentation technique for persian text where a few structural features were used to regulate the relevant segment to increase the quality of segmentation. the vertical projection was used to bring out the word segment over the baseline dots and diacritics were not checkout then the segment was regulated in an additional step by merging the small fragments, this step was needful in the cases where one character is isolated into more than one segment such as “س” and “ش.” 4. the proposed algorithm our technique is a segmentation-based approach, which contains three main segmentation stages, as shown in fig. 2. the proposed method takes a binary image that has multiple lines of text and executes several image processing methods to finally segment characters in the image. in this method, the segmentation is performed at three levels: line segmentation, word/sub-word segmentation, and character segmentation. this work focused only on the line, word/sub-word, and character segmentation steps, considering that the input image has been preprocessed (by applying operations such as binarization, noise removal, and separating text from non-text regions). the image binarization used the below equation. ( ) üüü , üüü f x y t g x y f x y t < =  ≥ where, (x,y) is the coordinate of the pixels, t represents the threshold value, g(x,y) represents the binary image pixels, and f(x,y) represents gray level image pixels. 4.1. line segmentation line segmentation is achieved by image horizontal projection that calculates the horizontal axis for the binarized image. the horizontal projection matrix i j is calculated by summing up the pixel values p(i, j) along the x-axis for each y value, as shown in equation (1): for each j ∈ 0.m-1 − − = ∑ 1 0 ( , ) n j i i p i j (1) where, n and m are, respectively, the width and height of the image and p(i,j) is the pixel’s value at the index (i,j). this projection has information about the text lines that are indicated by areas of white intensity, as shown in fig. 3. white intensities indicate the text area that contains text fig. 2. the major steps of the proposed segmentation approach. fig. 3. horizontal projection of input image that contains text line. tofiq ahmed tofiq and jamal ali hussien: kurdish text segmentation using projection-based approaches uhd journal of science and technology | jan 2021 | vol 5 | issue 1 59 and the black intensities show the gap between the text lines. fig. 3 shows the line segmentation method that accepts an image of text written in the kurdish language and separates its lines. the horizontal projection technique does this in two stages. the first one is to locate each group of white intensities in the horizontal projection and the second is to indicate the line position to separate lines from each other. to find the indicator line’s position, we used the index of last white intensities (point 1) and the first index of next white intensities(point 2) and calculated the distance between these two points. the line position is in the middle of these two points (point 1 and point 2). 4.2. word/sub-word segmentation after the line segmentation stage, the subsequent stage is word segmentation. the method that is used for word segmentation is based on the connected component method. the algorithm takes a binary text line image of kurdish text without dots and diacritics as input, and the result is a word/ sub-word segmented image as output. the steps of word/ sub-word segmentation are described in detail below. 4.2.1. find the connected components in this step, the text line image from the previous section is used to find the connected components. a connected component is every component that has adjacent pixels that are connected. fig. 4(a) shows an example of connected components with boundaries. in the first version of the connected component result, all components are selected but for better word/sub-word extraction dots and diacritics must be ignored. to do so, the baseline of text lines must be found. a baseline is a fictitious line that follows and joins the lower and upper parts of the character bodies [35]. the baseline is the maximum value from the horizontal projection. the index of the max value determines the location of the baseline in the text line image. fig. 4(a) shows an example of baseline detection that shows by the horizontal line that crosses the whole word. in continuation, using the baseline, the connected components are filtered based on whether these components are intersected with the baseline. usually, dots and diacritics location are above or below the baseline, so we can ignore connected component that is not on the baseline. fig. (4a) shows the original image after determining the connected components and the baseline. in fig. (4b), the dots above the baseline are ignored. 4.2.2. word/sub-words extraction for the kurdish script, a connected component either represents a word or a sub-word. this means that a single word may consist of one or more connected components. for extraction, we applied vertical projection to find the space between the words/sub-words (places that the projection is zero). projection along the vertical axis is called the vertical projection. vertical projection is calculated for every column as the sum of all row pixel values inside the column, as shown in equation 2. for each i ∈ 0.n-1 − − = ∑ 1 0 ( , ) m i j i p i j (2) after finding gaps between components, the task is to decide if these gaps are spaces between words or between sub-words. in other words, what is the correct threshold to decide whether the separation space between two sub-words is a gap inside the same word or space between two different words? although these gaps are not constant and can be vary based on the font sizes, font type, or style. to deal with this issue, first, we find the pen size, which is the pen thickness used for the writing of the adjacent two sequential words/sub-words, and evaluate it with the distance between the current consecutive words/ sub-words. calculating the pen size can handle by taking the most frequent value in the vertical projection applied for each sub-word. however, taking the maximum common value from the vertical projection of some single characters like “ا” gives a wrong conjecture of the pen size. therefore, the pen size is calculated by considering the most common value calculated from the horizontal projection. hence, if the maximum frequent value computed from the horizontal projection is fig. 4. (a) all connected component. (b) binary version of ignored dots and diacritics. a b tofiq ahmed tofiq and jamal ali hussien: kurdish text segmentation using projection-based approaches 60 uhd journal of science and technology | jan 2021 | vol 5 | issue 1 greater than the maximum frequent value calculated from the vertical projection, then the pen size is equal to the maximum frequent value calculated from the vertical projection. pen size calculation is formally defined as: ( ), [ ( )] [ ( )] ( ), üüüüüüüüü ps mfv hp otherwise  > =   where, ps is the pen size, sw indicates sub-word, hp shows horizontal projection, vp shows vertical projection, and mfv represents the most frequent value. fig. 5 shows an example of a pen size calculation for two cases. after finding the pen size for each sub-word, it is compared with the spaces between the components. if the gap between two adjacent components (ss) is greater than the mean of the ps value of these two adjacent components, then this gap is a space between two separate words (ws), otherwise, this gap is a space between two sub-words (sws) that belong to the same word. figure 6 shows an example that determining the type of gap between the two words/sub-words in the same line. it is defined formally as: ( ) ( ) + + + > =   ( 1) , [ , ( 1)] 2 , ps i ps i ws ss sw i sw i sr sws otherwise where, sr, ws, sws, ss, and sw are the separation region, the word separation, the sub-word separation, the separation space, and the sub-word, respectively. 4.3. character segmentation the proposed algorithm for character segmentation is based on the vertical projection. the algorithm consists of two stages. the algorithm takes a binary image of word/sub-word and returns a binary image of segmented characters. each step is explained in detail below: 4.3.1. word/sub-word vertical projection vertical projection can provide a better definition of a letter’s shape. this method can give us an accurate result. at this stage, we will once again find a vertical projection for the word. this technique reveals all the convexity and dents in the word. fig. 7 shows an example of a vertical projection stage. 4.3.2. segmentation areas identification in this step, the vertical projection as shown in figure 8 is examined to identify the segmentation (splitting) areas between letters. the segmentation area between the two letters is an area that ended one letter and started another letter or ended the word and we know that the baseline shared between all letter in a word, this means, letters join by the baseline, and if we find the pixel of baseline, we can find the start and the end of a letter or the area between letters. in the different font sizes and font styles, the baseline height or the pen size can be found by the most frequent data (mfd) in the vertical projection that we discussed previously. the process starts with finding the mfd in the vertical projection array. after this, we compare the other data in the vp array with the mfd. if the fig. 6. calculate the distance between two sub-words in the same word and different words. fig. 5. finding the most frequent value in both vp and hp, this value is the sub-word pen size. tofiq ahmed tofiq and jamal ali hussien: kurdish text segmentation using projection-based approaches uhd journal of science and technology | jan 2021 | vol 5 | issue 1 61 difference of these data is less than 1, we will change the data of this index to the mfd, but if this difference is more than 1, we change the value of this cell to zero because this index is part of the character, not the splitting area. now, we have an array consisting of mfd and zero values. therefore, the mfd data in the array represent the splitting area between the two letters. by finding the start and end index of this group of data in the array, the beginning and end of the splitting area can be specified. after that, the first and last points in the region are considered as splitting area reference points. the start and endpoints can be used to separate the letters. by adding the middle line between these two points (separator line), as well as adding the start and end lines of connected components that are found in previous steps, we now have a separator line between the characters. fig. 8 shows an example of finding the splitting area and reference points. after identifying the splitting areas, each character is located between two consecutive splitting area. fig. 9 shows the separator line over splitting areas for some characters. however, there are some special cases where the splitting area locates within a character such as the letters “س” and in all forms. besides, some splitting areas locate within ”ش“ a character when the position of the character is at the end of the word or exists independently in the text such as hence, a post-processing step for character ”.ی“ and ”ب“ segmentation is necessary to ignore these spatial splitting areas. however, in this paper, we do not work on the “س” and “ش,” we only worked on the “ب” and “ی.” figure 10 shows some examples of these cases. 4.3.3. post-processing in the post-processing phase, we will take a step. in this main step, we can eliminate some of these special cases using the baseline that we found earlier, as well as the separator line that we found in the previous step. to do this, for any separation lines find the intersection point with the baseline. after that, check the value of this point (intersection point) in the image fig. 7. vertical projection example. fig. 9. splitting regions for some regular characters. fig. 8. splitting areas with reference points. fig. 10. example of special cases that must ignored. tofiq ahmed tofiq and jamal ali hussien: kurdish text segmentation using projection-based approaches 62 uhd journal of science and technology | jan 2021 | vol 5 | issue 1 binary data and if this value is zero its means this point not on the letter and we know that the separation line between the adjacent latter must be on the letters. finally, these separation lines can be removed from the list. in this way, we can solve some of these situations using this technique. fig. 11 shows some example after applying the post-processing step. 5. performance analysis in this section, the results of testing our approach on a collection of images that contain kurdish text are shown. we use the python programming language to implement and then test our proposed character segmentation algorithm since it is a commonly known high-level programming language that provides well-implemented packages for image processing. the performance of line segmentation is measured by computing the ratio of the number of lines that are correctly segmented to the total number of inputted lines. the same measurement is used for each of word and character segmentations. accauracy = num.of corrected segment total of segments the proposed algorithms (line, word, and character segmentations) were experimented and evaluated using a manually created dataset. we develop a software to generate a dataset with the ground truth from the random kurdish text that we collected. to make the dataset generic and comprehensive, the collected dataset includes text content from different sources (e.g. books, magazines, reports, and papers) and topics (e.g. religious, sport, and poetry texts), in addition to a considerable variation at font type, size, and style levels. these texts are converted to image word by word and add some noise to every image and saved all images. the proposed line segmentation methods were tested on 6099 lines and reported excellent results in terms of line segmentation ratio, which computed with an average of 99.9%. table 1 shows the results generated through the testing process using different font types, styles, and sizes. table 2: word segmentation results for different font types and size between 24 and 48 font font type total number of input word no. of correctly segmented word accuracy (%) plain uniqaidar_ blawkrawe 004 2218 2180 98.28 unikurd web 2218 2112 95.22 noto naskh arabic ui 2218 2150 96.93 uniqaidar_ jwneyd 2218 2119 95.53 total 8872 8561 96.5 table 1: line segmentation results for different font styles and types on text font font type total number of input lines no. of correctly segmented lines accuracy (%) plain uniqaidar_ blawkrawe 004 497 497 100 unikurd web 507 506 99.9 noto naskh arabic ui 525 524 99.9 uniqaidar_ jwneyd 504 504 100 bold uniqaidar_ blawkrawe 004 497 497 100 unikurd web 507 506 99.9 noto naskh arabic ui 525 524 99.9 uniqaidar_ jwneyd 504 504 100 italic uniqaidar_ blawkrawe 004 497 497 100 unikurd web 507 506 99.9 noto naskh arabic ui 525 524 99.9 uniqaidar_ jwneyd 504 504 100 total 6099 6093 99.9 fig. 11. example of post-processing before and after applying. tofiq ahmed tofiq and jamal ali hussien: kurdish text segmentation using projection-based approaches uhd journal of science and technology | jan 2021 | vol 5 | issue 1 63 the results of the word segmentation stage in terms of word segmentation ratio are reported in table 2. the proposed word segmentation methods are experimented on about 8872 words with four font types (noto naskh arabic ui, unikurd web, uniqaidar_jwneyd and uniqaidar_blawkrawe 004) and five font sizes (24, 26, 28, 36, and 48 points), with an average accuracy of 96.5%. the results show that the algorithm has almost the same performance when changing the font size. furthermore, we experimented with the character segmentation stage on different font types and sizes on about 63,548 characters. table 3 shows the performance of the proposed algorithm with an average accuracy of 98.6%. the results show that the algorithm has almost the same performance regardless of the font type, style, and size. table 3: character segmentation results for different font types and font size between 24 and 48 font font type total number of input character no. of correctly segmented character accuracy (%) plain uniqaidar_ blawkrawe 004 15,887 15,842 99.7 unikurd web 15,887 16,101 98.7 noto naskh arabic ui 15,887 15,636 98.4 uniqaidar_ jwneyd 15,887 15,495 97.5 total 63,548 63,074 98.6 table 4: comparing with other related work articles year segmentation method dataset font type font size font style average accuracy (%) zheng et al. [10] 2004 vertical histogram and some structural characteristics rules 500 samples of arabic text simplified arabic and arabic transparent 12, 14, 16, 18, 20, and 22 plain 94.8 javed et al. [27] 2010 pattern matching techniques a total of 1282 unique ligatures are extracted from the 5000 high-frequency words in a corpusbased dictionary noori nastalique font 36 plain 92 saabni [26] 2014 partial segmentation and hausdorff distance apti different fonts to cover different complexity of shapes of arabic printed characters 10 different sizes plain 96.8 anwar et al. [28] 2015 projection-based 127 sentences composed of 1061 letters traditional arabic 70 plain 97.5 amara et al. [29] 2016 histogram and contextual properties apti different font types different sizes plain, italic, and bold 85.6 radwan et al. [32] 2016 multichannel neural networks apti arial, tahoma, thuluth, and damas 18 plain 95.5 qomariyah et al. [33] 2017 interests points, contour-based 10 lines of 30 subwords not reported not reported plain 86.5 fakhry [30] 2017 connected component 5 lines 15 words not reported not reported plain 80.2 amara et al. [31] 2017 projection profile, svm apti advertising bold 6,8,10, 12 plain, italic, and bold 98.24 zoizou et al. [34] 2018 contour-based and template matching 83 lines of 984 words 34 different fonts different font sizes plain 94.7 mohammad et al. [25] 2019 contour-based method 1500 lines of (23,350 words) advertising bold, simplified arabic, arial, traditional arabic, and times new roman 8, 9, 10, 12, 14, 16, 18, and 24 plain, italic, and bold 98.5 our approach 2020 projection based 6099 line of (63,548) letters uniqaidar_004, unikurd_ web, noto_naskh arabic ui, uniqaidar_jwneyd 24, 26, 28, 38, and 48 plain 98.6 tofiq ahmed tofiq and jamal ali hussien: kurdish text segmentation using projection-based approaches 64 uhd journal of science and technology | jan 2021 | vol 5 | issue 1 table 4 shows our results compared with some previous related works. as shown in the table, the proposed algorithm performs better in comparison with other related works in: (i) using more font types, sizes, and styles than the other approaches (ii) and recording higher average accuracies. 6. conclusion in this paper, line, word, and character segmentation algorithms are proposed for kurdish printed text based on projection-based segmentation methods. the proposed algorithm can segment the characters of words. the algorithm can also handle certain complex cases that occur due to over-segmentation problems. we tested the algorithm on the manually created dataset by creating different versions of the same text using different font types, styles, and sizes. experimental results show the reliability of our algorithm in performing a correct segmentation of more than 63,074 out of 63,548 words without the س and ش letter. the segmentation of the kurdish text is prone to errors, which leads to classification errors. the proposed segmentation algorithms are capable of minimize errors and maximize the classification rate. an advanced method is proposed for word/sub-word segmentation. horizontal and vertical segmentations are used to distinguish between words and sub-words based on the size of the gaps that separate the connected components in comparison to the pen size. for the character segmentation step, an advanced projectionbased algorithm is proposed. the proposed algorithm is built easily and reliably that can fit a variety of fonts and styles, the character segmentation algorithm shows good results up to 98.6%. for future work, we plan to find the correct segmentation for characters, such as “س” and “ش,” by ignoring the oversegmentation part that occurs in these two special characters cases. furthermore, we want to extend the work to extract all characters more accurately to facilitate the recognition stage. references [1] h. althobaiti and c. lu. “a survey on arabic optical character recognition and an isolated handwritten arabic character recognition algorithm using encoded freeman chain code”. 2017 51st annual conference on information sciences and systems (ciss), baltimore, md, pp. 1-6, 2017. [2] a. lawgali. “a survey on arabic character recognition”. international journal of signal processing, image processing and pattern recognition, vol. 8, no. 2, pp. 401-426, 2015. [3] s. elaiwat and m. a. abu-zanona. “a three stages segmentation model for a higher accurate off-line arabic handwriting recognition. world of computer science and information technology journal, vol. 2, no. 3, pp. 98-104, 2012. [4] m. a. abdullah, l. m. al-harigy and h. h. al-fraidi. “off-line arabic handwriting character recognition using word segmentation”. journal of computing, vol. 4, pp. 40-44, 2012. [5] y. m. alginahi. “a survey on arabic character segmentation”. international journal on document analysis and recognition, vol. 16, no. 2, pp. 105-126, 2013. [6] a. cheung, m. bennamoun and n. w. bergmann. “an arabic optical character recognition system using recognition-based segmentation”. pattern recognition, vol. 34, no. 2, pp. 215-233, 2001. [7] n. a. shaikh, g. a. mallah and z. a. shaikh. “character segmentation of sindhi, an arabic style scripting language, using height profile vector”. australian journal of basic and applied sciences, vol. 3, no. 4, pp. 4160-4169, 2009. [8] m. m. alipour. “a new approach to segmentation of persian cursive script based on adjustment the fragments”. international journal of computers and applications, vol. 64, no. 11, pp. 21-26, 2013. [9] s. n. nawaz, m. sarfraz, a. zidouri and. w. g. ai-khatib. “an approach to offline arabic character recognition using neural networks”. in: 10th ieee the ieee international conference on electronics, circuits, and systems, ieee, vol. 3, pp. 1328-1331, 2003. [10] l. zheng, a. h. hassin and x. tang. “a new algorithm for machine printed arabic character segmentation”. pattern recognition letters, vol. 25, no. 15, pp. 1723-1729, 2004. [11] a. zidouri and k. nayebi. “adaptive dissection based subword segmentation of printed arabic text”. in: 9th international conference on information visualisation (iv), ieee, pp. 239-243, 2005. [12] j. ahmad. “optical character recognition system for arabic text using cursive multi-directional approach”. journal of computational science, vol. 3, pp. 549-555, 2007. [13] m. omidyeganeh, k. nayebi. “a new segmentation technique for multi font farsi/arabic texts”. in: ieee international conference on acoustics speech, and signal process., ieee, vol. 2, 2005. [14] t. sari, l. souici, and m. sellami. “off-line handwritten arabic character segmentation algorithm: acsa”. in: proceeding 8th international workshop front handwriting recognit., ieee, pp. 452-457, 2002. [15] r. mehran, h. pirsiavash and f. razzazi. “a front-end ocr for omni-font persian/arabic cursive printed documents”. in: digital image computing: techniques and applications (dicta), ieee, pp. 56-56, 2005. [16] a. al-nassiri, s. abdulla and r. salam. “the segmentation of offline arabic characters, categorization and review”. international journal on media technology, vol. 1, no. 1, pp. 25-34, 2017. [17] m. m. altuwaijri and m. a. bayoumi. “a thinning algorithm for arabic characters using art2 neural network”. ieee transactions on circuits and systems, vol. 45, no. 2, pp. 260-264, 1998. [18] a. a. a. ali and m. suresha. survey on segmentation and recognition of handwritten arabic script. sn computer science, vol. 1, p. 192, 2020. [19] i. aljarrah, o. al-khaleel, k. mhaidat, m. alrefai, a. alzu’bi and m. rabab’ah. 2012. automated system for arabic optical character tofiq ahmed tofiq and jamal ali hussien: kurdish text segmentation using projection-based approaches uhd journal of science and technology | jan 2021 | vol 5 | issue 1 65 recognition. in: proceedings of the 3rd international conference on information and communication systems(icics’12). [20] y. alginahi. “a survey on arabic character segmentation”. international journal on document analysis and recognition, vol. 16, pp. 105-126, 2013. [21] y. zhang, z. q. zha and l. f. bai. “a license plate character segmentation method based on character contour and template matching”. applied mechanics and materials, vol. 333, pp. 974979, 2013. [22] i. ahmed, m. sabri and p. mohammad. printed arabic text recognition. guide to ocr for arabic scripts, 2012. [23] m. bennamoun and b. boashash. “a structural-description-based vision system for automatic object recognition”. ieee transactions on systems, man, and cybernetics, vol. 27, no. 6, pp. 893-906, 1997. [24] m. mostafa.“an adaptive algorithm for the automatic segmentation of printed arabic text”. in: 17th national computer conference, international society for optics and photonics, saudi arabia, pp. 437-444, 2004. [25] k. mohammad, a. qaroush, m. ayesh, m. washha, a. alsadeh and s. agaian. contour-based character segmentation for printed arabic text with diacritics. journal of electronic imaging, vol. 28, no. 4, p. 1, 2019. [26] r. saabni. “efficient recognition of machine printed arabic text using partial segmentation and hausdorff distance”. in: 6th international conference soft computing and pattern recognition (socpar), pp. 284-289, 2014. [27] s. t. javed, s. hussain, a. maqbool, s. asloob, s. jamil and h. moin. “segmentation free nastalique urdu ocr”. world academy of science, engineering and technology, vol. 4, no. 10, pp. 456461, 2010. [28] k. anwar, adiwijaya and h. nugroho. “a segmentation scheme of arabic words with harakat”. in: ieee international conference on communications, networks and satellite (comnestat), pp. 111114, 2015. [29] m. amaram, k. zidi, g. ghedira and s. zidi. “new rules to enhance the performances of histogram projection for segmenting smallsized arabic words,” in: international conference on hybrid intelligent systems, 2016. [30] f. i. firdaus, a. khumaini and f. utaminingrum. “arabic letter segmentation using modified connected component labeling”. in: international conference on sustainable information engineering and technology (siet), pp. 392-397, 2017. [31] m. amara, k. zidi and k. ghedira. “an efficient and flexible knowledgebased arabic text segmentation approach”. the international journal of computer science and information security, vol. 15, no. 7, pp. 25-35, 2017. [32] m. a. radwan, m. i. khalil and h. m. abbas. “predictive segmentation using multichannel neural networks in arabic ocr system”. lecture notes in computer science, vol. 9896, pp. 233245, 2016. [33] f. qomariyah, f. utaminingrum and w. f. mahmudy. “the segmentation of printed arabic characters based on interest point”. the journal of telecommunication, electronic and computer engineering, vol. 9, no. 2-8, pp. 19-24, 2017. [34] a. zoizou, a. zarghili and i. chaker. “a new hybrid method for arabic multi-font text segmentation, and a reference corpus construction”. journal of king saud university computer and information sciences, vol. 32, no. 5, pp. 576-582, 2018. [35] a. fawzi, m. pastor and c. d. martínez-hinarejos. “baseline detection on arabic handwritten documents”. p proceedings of the 2017 acm symposium on document engineering, pp. 193196, 2017. tx_1~abs:at/tx_2:abs~at uhd journal of science and technology | jan 2022 | vol 6 | issue 1 1 1. introduction influenza viruses which are enveloped ssrna viruses can cause annual epidemics and pandemics with serious consequences for public health and the global economy, assessed with 1 billion cases, including 3-5 million severe cases, and 290 000-650 000 influenza-related respiratory deaths worldwide [1]. influenza a virus (iav) is due to the family orthomyxoviridae which possess a segmented, single-stranded, negative-sense rna genome. this family consists of five genera: influenzavirus a, b, and c, togavirus [2]. the virus is with a pleomorphic morphology, characterized by spherical, elongated, or filamentous particles [3]. in 2009 a pandemic influenza infection was caused by a subtype known as swine flu (h1n1) virus with genes that originate from human and avian influenza virus [4]. humans can be infected with h1n1, h1n2, or h3n2 through direct contact with infected animals or contaminated surroundings. the pandemic strain contains genes from four different flu viruses including two swine strains, one human strain, and one avian [5]. enveloped viruses have a matrix that interacted with the viral glycoproteins and nucleocapsid that can play an essential role in the gathering of the viral proteins and budding of the progeny virions [6]. novel re-assorted influenza h1n1 virus produced by reassortment between the viral genome segments and it was behind the pandemic h1n1 in 2009 [7]. during the past 100 years, five pandemic influenza outbreaks have occurred spanish flu (h1n1) in 1918, asian flu (h2n2) in 1957, hong kong flu (h3n2) in 1968, russian flu (h1n1) 1977, and swine flu seroprevalence and molecular detection of influenza a virus (h1n1) in sulaimani governorate-iraq kaziwa ahmad kaka alla1, salih ahmed hama1,2 1department of biology, college of science, university of sulaimani, kurdistan region, sulaymaniyah, iraq, 2department of medical laboratory science, college of health sciences, university of human development, kurdistan region, sulaymaniyah, iraq a b s t r a c t influenza a (h1n1) virus is now rapidly scattering across the world. early detection is one of the most effective measures to stop the further spread of the virus. the current study was aimed to detect influenza a (h1n1) serologically and by polymerase chain reaction (pcr) techniques. from september 2020 to june 2021, three hundred nasopharyngeal swabs and blood samples were collected from hiwa and shahid tahir hospitals in sulaimani city. obtained results revealed that 23.3% of the tested patients were seropositive anti-igg for influenza a, while 13.3% showed anti-igm seropositive results although 10% of the tested cases were with both anti-igg and anti-igm seropositive results. gender, residency, and flu symptoms showed no significant relations with seropositive results (p<0.05) whereas valuable relations were found between seropositive observations and smoking, the previous history of chronic diseases as well as employment status (p<0.05). it was concluded that hematologic investigations (cbc) were not dependable if h1n1 diagnosis and detection. only 1% of the tested samples showed positive results for influenza a (h1n1) rna using reverse transcription-pcr. index terms: influenza a, h1n1, anti-igg, anti-igm, reverse transcription-polymerase chain reaction, ssrna corresponding author’s e-mail: kaziwaahmad91@gmail.com received: 14-08-2021 accepted: 28-12-2021 published: 02-01-2022 access this article online doi: 10.21928/uhdjst.v6n1y2022.pp1-6 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2022 alla and hama this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l r e s e a r c h a r t i c l e uhd journal of science and technology kaziwa ahmad kaka alla and salih ahmed hama: seroprevalence and molecular detection of influenza a virus 2 uhd journal of science and technology | jan 2022 | vol 6 | issue 1 (h1n1) in 2009. in particular, the 1918 influenza pandemic affected almost 30% of the global population and is believed to have killed over 50 million people [8]. multiple one-step real-time reverse transcription-polymerase chain reaction (rtpcr) assays can simultaneously detect and discriminate flu a subtypes with dependable sensitivity and specificity, which is required for the early clinical diagnosis and viral surveillance of patients with flu a infection [9]. serological techniques commonly can be depended on for detection of influenza a infections through anti-influenza immunoglobulin g (igg_ and igm detection by elisa technique, especially igg and igm against hemagglutinin [10]. the aims of the current study are; serologic detection of influenza anti-igg and anti-igm and molecular detection of influenza rna using rt-pcr. 2. materials and methods 2.1. study population the study’s population included people visiting hiwa and shahid tahir hospitals in sulaimani city from september 2020 to june 2021, difficulties were found during sample collection due to the negative view of the patients. all tested patients were suffered from flue signs and symptoms, including fever, chills, cough, muscle or body aches, runny or stuffy nose, sneeze, headaches, fatigue, sore throat, and sweating. the sample size was 300 patients included 163 males and 137 females. 2.2. sample collection from each tested patient nasopharyngeal swabs were collected as well as 5 ml fresh venous blood was taken aseptically and divided into two parts; one for serum preparation and the rest for hematologic investigations. the collected samples were stored according to their uses as following: the blood samples were stored in 4oc till hematological investigations were done. the serum samples were divided into two parts; one for serology and stored in −20oc, while the other part of the separated serum was stored in −80oc (for molecular tests). 2.3. anti-influenza virus antibody detection by elisa indirect-elisa method was depended to detect anti-influenza virus a antibody igg and igm using a special elisa kit (cusabio/whan-china, elab-science/korea, novalisa®/ germany). the microtiter plate wells were precoated with recombinant influenza antigens. all preserved sera samples were transferred to room temperature for about 30 min. 100 μl of each diluted sample, standard, and blank were added to the desired wells for igm (for igg 200 μl of diluted sample was added). the plate was incubated for 30 min but (an hour for igm according to supplied company instructions) at 37°c in shading light. the process of washing and aspirating of each well with 350 μl washing buffer was done five times for igm using elisa washer (for igg 300 μl of washing buffer was used four times as directed by the supplied company). about 100 μl of hrp conjugate was added to each well except the blank and incubated for 30 min at 37°c in shading light. the process of washing was repeated and aspirated five times for igm and four times for igg. to each well, 50 μl of the substrate reagent a and 50 μl substrate reagent b was added and mixed, then incubated for 15 min at 37°c in shading light for both igg and igm. for each well, 50 μl of stop solution was added and the optical density was measured at 450 nm and 620 nm for igg and igm. 2.4. viral rna extraction and amplification the rna extraction was performed according to the manufacturer’s protocol included in addprep viral nucleic acid extraction kit (add bio-tech, korea). addprep viral nucleic acid extraction kit (add bio-tech/korea) buffer system provides the effective binding condition of rna to the microfiber-silica-based membrane through the mix with lysis and binding buffers, and then the impurities on the membrane are washed away by two different washing buffers. starting with a 200 µl of swab sample to 1.2 tubes and followed spin column purification with final elute of 150 µl rna. extracted viral nucleic acid was stored at -80oc until the day of examination. 2.5. pcr reaction a total volume of master mix addscript rt-pcr nuclease-free (d.w), forwarding primer, reverse primer, and nasopharyngeal swab fluid/standard/negative/positive control was prepared as directed by the supplied company. the process of pcr programming for detecting iva nucleic acid was performed starting with the reverse transcription step, denaturation, renaturation, annealing, elongation, and the data were collected. item volume nuclease-free (d.w) forwarding primer reverse primer nasopharyngeal swab fluid/standard/negative/ positive control 2x master mix addscript rt-pcr 5 μl 1 μl 1 μl 3 μl 10 μl total volume 20 μl step temperature (°c) duration cycle cdna synthesis initial denaturation denaturation, annealing, extention and final extension 50 95 95 55–65 72 72 30 min 10 min 15–30 s 15–30 s 1 min 5 min 1 35 kaziwa ahmad kaka alla and salih ahmed hama: seroprevalence and molecular detection of influenza a virus uhd journal of science and technology | jan 2022 | vol 6 | issue 1 3 subtype oligo seq influenza a (h1n1) pdm09 h1f1 h1r1264 h1f848 h1ruc n1f1 n1r1099 n1f401 naruc nafuc mf1 mr1027 5ʹ agcaaaagcaggggaaaataaaagc 3ʹ (25mer) 5ʹ cctactgctgtgaactgtgtattc 3ʹ (24mer) 5ʹ gcaatgcaaagaaatgctggatctg 3ʹ (25mer) 5ʹ atatcgtctcgtattagtagaaacaagggtgtttt 3ʹ (35mer) 5ʹ agcaaaagcaggagtttaaaatg 3ʹ (23mer) 5ʹ cctatccaaacaccattgccgtat 3ʹ (24mer) 5ʹ ggaatgcagaaccttcttcttgac 3ʹ (24mer) 5ʹ atatggtctcgtattagtagaaacaaggagtttttt 3ʹ (36mer) 5ʹ tattggtctcagggagcaaaagcaggagt 3ʹ (29mer) 5ʹ agcaaaagcaggtagatattgaaaga 3ʹ (26mer) 5ʹ agtagaaacaaggtagttttttactc 3ʹ (26mer) influenza a (h3n2) nafuc h3n2r109 n2f387 naruc h3a1f6 h3a1r1 h3a1f3 haruc 5ʹ tattggtctcagggagcaaaagcaggagt 3ʹ (29mer) 5ʹ tcatttccatcatcraaggccca 3ʹ (23mer) 5ʹ catgcgatcctgacaagtgttatc 3ʹ (24mer) 5ʹ atatggtctcgtattagtagaaacaaggagtttttt 3ʹ (36mer) 5ʹ aagcaggggataattctattaacc 3ʹ (24mer) 5ʹ gtctatcattccctcccaaccatt 3ʹ (24mer) 5ʹ gtctatcattccctcccaaccatt 3ʹ (24mer) 5ʹ atatcgtctcgtattagtagtagaaacaagggtgtttt 3ʹ (35mer) 3. results both sexes were included in the current study, out of 300 participants (163 males and 137 females), (71, 23.7%) showed seropositive results for anti-h1n1 igg, respectively, considering the gender (table 1). seropositive observations considering anti-h1n1 igm showed lower positive results comparing to anti-h1n1 igg. it was noticed that (40, 13.33%) cases were seropositive for anti-h1n1 igm among males and females (table 1). it was noticed that some tested cases were seropositive for both anti-h1n1 igg and igm at the same time (10%) (fig. 1). the percentage of seropositive results among males was relatively higher (56.3%) than among females (43.7%), although there were significant differences considering the gender regarding anti-h1n1 igg (p < 0.05) (fig. 2). as in the case of igm results, the seropositive results were higher among males (55%) when compared with females (45%). statistical analysis showed significant differences between males and females considering anti-h1n1 igm (p < 0.05) (fig. 2). from this ratio, 55% were among males and 45% among females. furthermore, the percentage of igg and igm among males (11.7%) was higher to compere females (7.8%) (fig. 2). the pcr positive result was among seropositive males (0.67%) only, while the seropositive females showed negative pcr results (fig. 3). when the relationships of certain risk factors were evaluated on the seropositive observations, it appeared that gender has significant effects on the h1n1 seropositive results considering h1n1 anti-igg, anti-igm (p < 0.05) (table 1). moreover, as mentioned in the methodology, some of the cases were symptomatic others were asymptomatic, so depending on the presence of flu syndrome, it appeared that the occurrence of flu symptoms have significant relations with the obtained seropositive results (p < 0.05) which indicates that the symptoms are dependable in h1n1 diagnosis (table 1). studying the effects of residency indicated that it has no significant effects on the percentage of seropositive results (p ˃ 0.05) (table 1). in addition to these factors, the effects of smoking also were evaluated, it was noticed that smoking has significant effects on the results (p < 0.05), so smoking can be considered as a risk factor for h1n1 infections (table 1). similarly, both previous history of chronic diseases and employment can be strongly related with observations recorded in the current study (p < 0.05) for both factors (table 1). depending on the complete blood count (cbc) picture done for all studied cases, it was concluded that no valuable changes were seen between seropositive cases and negative ones (p > 0.05). as well as comparison of the calculated hematologic parameters with the normal rages from reference textbooks clarified that no significant abnormal (elevation and decline) of these parameters were recorded although slight changes or elevations in some parameters were seen, but were nonsignificant (p > 0.05) (table 2). 2.6. primers and probes kaziwa ahmad kaka alla and salih ahmed hama: seroprevalence and molecular detection of influenza a virus 4 uhd journal of science and technology | jan 2022 | vol 6 | issue 1 4. discussions life-threatening infection by influenza a virus stays behind health complaints and death worldwide. it was estimated around the world that seasonal influenza can cause about 3–5 million cases of severe illness, and about 290,000–650,000 respiratory deaths worldwide each year [11]. certain factors may explain the low percentage rates of rt-pcr results from the current study; among them, the limited numbers of the samples, technical errors as well as high sensitivity of the viral rna for degradation by enzymes and environmental factors, since most of the analyzed samples were previously collected and preserved in the specified hospitals. the relatively high seropositivity rates (23.3%) of inf-a (h1n1) virus infection among the studied cases in the current study can be explained, especially since the vast majority of patients were with a previous history of flu infection. they were suspected of having an influenza virus infection. on the other hand, most of the studied cases were from cancer treatment centers and suffering from immunologic complaints, and were at high risk for different infections including influenza. several studies and investigators reported a higher prevalence of influenza a virus infections than our observation. in a previous study, it was reported that the prevalence of influenza a virus seropositivity (anti-igg and anti-igm) was relatively higher than the current results [12]-[14]. whereas the current results were in agreement with conclusions reported by other investigators [15]. it was reported that some factors were significantly effective on the seropositivity of influenza a (h1n1), which was parallel with observations recorded by a study done in the american society of clinical oncology, who found that occupation, immunocompetency, previous history of chronic diseases, smoking, showed significant effects on respiratory viral infections especially influenza a virus [16]. the current observations were relatively similar and agreed with results reported by the iranian research groups who reported in 2019 [17]. moreover, our conclusions nearby with results reported in a study done in switzerland [18]. moreover, other investigators reported a relatively higher prevalence of influenza a viral infections and transmissions [19]. 23.3 13.3 10 76.7 86.7 90 0 20 40 60 80 100 anti-iva (h1n1) igg anti-iva (h1n1) igm ant-h1n1 igg +igm positive negative p er ce nt ag e (% ) fig. 1. seropositive results of iva (h1n1) among tested patients. 56.3 55 11.7 43.7 45 7.8 0 10 20 30 40 50 60 anti-iva (h1n1) igg anti-iva (h1n1) igm ant-h1n1 igg +igm males females p er ce nt ag e (% ) fig. 2. seropositive results of ant-h1n1 igg and igm among males and females. 0.67 0 0 0.5 1 1.5 2 males females p er ce nt ag e (% ) fig. 3. positive pcr results among males and females. table 1: evaluation of relations between some risk factors and h1n1 seropositive results variables anti-igg positive (no, %) anti-igm positive (no, %) anti-igg, igm positive (no, %) p-value gender males 40 (13.33) 22 (07.33) 18 (06.00) p<0.05 p˃0.05 p<0.05 females 31 (10.33) 18 (06.00) 12 (04.00) flu symptoms yes 37 (12.33) 23 (07.67) 20 (06.67) no 34 (11.33) 17 (05.67) 10 (03.33) residency urban 33 (11.00) 19 (06.33) 12 (04.00) rural 38 (12.67) 21 (07.00) 18 (06.00) smoking smoker 33 (11.00) 29 (09.67) 21 (07.00) p<0.05 non-smoker 18 (6.000) 11 (03.67) 09 (03.00) chronic diseases yes 40 (13.33) 22 (07.33) 18 (06.00) no 31 (10.33) 18 (06.00) 12 (04.00) employed yes 47 (15.67) 31 (10.33) 22 (07.33) no 24 (08.00) 09 (03.00) 08 (02.67) kaziwa ahmad kaka alla and salih ahmed hama: seroprevalence and molecular detection of influenza a virus uhd journal of science and technology | jan 2022 | vol 6 | issue 1 5 some factors may be behind the high prevalence rates of h1n1 anti-igg and anti-igm seropositivity among males in the current study including the cultural behavior where males mostly enter into the crowded areas without following standard protection protocols, as well as smoking are more common among males in comparison to females. these observations were in agreement with results reported by epidemiological studies conducted in different areas among different groups and populations [20]. preparation planning surveyed by the response to the first influenza pandemic of the 21st century delivered a unique opportunity for construction and applying a global system of surveillance to chance both global and national needs [21]-[23]. the current work found a limited number of pandemic influenza a (h1n1) among the tested cases although the vast majority of them were within the flu-like syndrome. this may be due to the other pandemic viral infection by sars-cov 2 which is known as covid-19. there are mixes between symptoms for both cases that may confuse the physicians and researchers in their discissions and more other laboratory investigations are necessary to be followed. reports achieved by other workers support this conclusion and explanation [24]. this opinion opens a gate for a fact which is essential to recognize the co-infections by way of some individual can be treated with antibiotics and antivirals [25]. the current study revealed that cbc may not help identify influenza a virus (h1n1), which was parallel to other conclusions reported by others [26], although other investigators reported that the possibility of high monocytosis and lymphopenia could be considered as a good indicator [27]. 5. conclusions it was concluded that the percentage rates of anti-igg and anti-igm seropositivity for influenza a (h1n1) viral infections was relatively in an accepted range in sulaymani governorate. smoking, previous history of chronic diseases, and the employment status of the tested cases showed to be among the significant risk factors for influenza a viral infections, especially h1n1. it was concluded that hematologic tests and parameters are not dependable in h1n1 diagnosis. limited numbers of the studied cases showed positive results for rt-pcr comparing to the serologic investigations. references [1] n. takeshi. “native morphology of influenza virions”. frontiers in microbiology, vol. 2, p. 269, 2012. [2] s. v. bourmakina and a. garcía-sastre. “reverse genetics studies on the filamentous morphology of influenza a virus”. journal of general virology, vol. 84, no. 3, pp. 517-527, 2003. [3] b. szewczyk, k. bieńkowska-szewczyk and e. król. “introduction to molecular biology of influenza a viruses”. acta biochimica polonica, vol. 61, no. 3, pp. 397-401, 2014. [4] d. b. smith, e. r. gaunt, p. digard, k. templeton and p. simmonds. detection of influenza c virus but not influenza d virus in scottish respiratory samples”. journal of clinical virology, vol. 74, pp. 5053, 2016. [5] c. brockwell‐staats, r. g. webster and r. j. webby. “diversity of influenza viruses in swine and the emergence of a novel human pandemic influenza a (h1n1)”. influenza and other respiratory viruses, vol. 3, no. 5, pp. 207-213, 2009. [6] k. wu, j. liu, r. saha, d. su, v. d. krishna, m. c. j. cheeran and j. p. wang. “magnetic particle spectroscopy for detection of influenza a virus subtype h1n1”. acs applied materials and interfaces, vol. 12, no. 12, pp. 13686-13697, 2020. [7] z. yu, k. cheng, h. he and j. wu. “a novel reassortant influenza a (h1n1) virus infection in swine in shandong province, eastern china”. transboundary and emerging diseases, vol. 67, no. 1, pp. 450-454, 2020. [8] j. a. pulit-penaloza, c. pappas, j. a. belser, x. sun, n. brock, h. zeng, t. m. tumpey and t. r. maines. “comparative in vitro and in vivo analysis of h1n1 and h1n2 variant influenza viruses isolated from humans between 2011 and 2016”. journal of virology, vol. 92, no. 22, p. e01444-18, 2018‏ [9] p. j. campbell, s. danzy, c. s. kyriakis, m. j. deymier, a. c. lowen and j. steel. “the m segment of the 2009 pandemic influenza virus confers increased neuraminidase activity, filamentous morphology, and efficient contact transmissibility to a/puerto rico/8/1934-based reassortant viruses”. journal of virology, vol. 88, no. 7, p. 3802, 2014. [10] c. w. potter. “a history of influenza”. journal of applied microbiology, vol. 91, no. 4, pp. 572-579, 2001. [11] s. davis. “the different types of flu explained-seasonal influenza, swine flu, and avian flu. sa pharmacists assistant, vol. 19, no. 2, pp. 10-11, 2019‏ [12] p. j. gavin and r. b. jr. thomson. “review of rapid diagnostic tests for influenza”. clinical and applied immunology reviews, vol. 4, no. 3, pp. 151-172, 2004. [13] m. petric, l. comanor and c. a. petti. “role of the laboratory in the diagnosis of influenza during seasonal epidemics and potential pandemics”. the journal of infectious diseases, vol. 194, no. suppl 2, pp. s98-s110, 2006. [14] s. dellière, m. salmona, m. minier, a. gabassi, a. alanio, j. le goff, c. delaugerre, m. l. chaix and saint-louis core table 2: hematologic parameter evaluation of seropositive cases with normal ranges hematologic parameters units h1n1 seropositive (mean±sd) normal range p-value wbc total hemoglobin rbc platelets lymphocytes granulocytes mid 109 g/di 1012 109 % % % 7.4±2.2 11.92±1.71 2.72±0.88 186.61±59.24 28.8±7.67 56.48±12.73 7.64±1.92 4–11 11.5–15.5 3.8–4.8 150–450 2–45 40–80 2–10 p>0.05 kaziwa ahmad kaka alla and salih ahmed hama: seroprevalence and molecular detection of influenza a virus 6 uhd journal of science and technology | jan 2022 | vol 6 | issue 1 (covid research) group. “evaluation of the covid-19 igg/ igm rapid test from orient gene biotech”. journal of clinical microbiology, vol. 58, no. 8, p. e01233-20, 2020. [15] m. von lilienfeld-toal, a. berger, m. christopeit, m. hentrich, c. p. heussel, j. kalkreuth, m. klein, m. kochanek, o. penack, e. hauf, c. rieger, g. silling, m. vehreschild, t. weber, h. h. wolf, n. lehners, e. schalk and k. mayer. “community-acquired respiratory virus infections in cancer patients guideline on diagnosis and management by the infectious diseases working party of the german society for haematology and medical oncology”. european journal of cancer, vol. 67, pp. 200-212, 2016. [16] r. el ramahi and a. freifeld. “epidemiology, diagnosis, treatment, and prevention of influenza infection in oncology patients”. journal of oncology practice, vol. 15, no. 4, pp. 177-184, 2019. [17] v. rahmanian, m. shakeri, h. shakeri, a. s. jahromi, a. bahonar and a. madani. “epidemiology of influenza in patients with acute lower respiratory tract infection in south of iran (2015-2016)”. acta facultatis medicae naissensis, vol. 36, no. 1, pp. 27-37, 2019. [18] l. p. hariri, c. m. north, a. r. shih, r. a. israel, j. h. maley, j. a. villalba, v. vinarsky, j. rubin, d. a. okin, a. sclafani, j. w. alladina, j. w. griffith, m. a. gillette, y. raz, c. j. richards, a. k. wong, a. ly, y. p. hung, r. r. chivukula, c. r. petri, t. f. calhoun, l. n. brenner, k. a. hibbert, b. d. medoff, c. c. hardin, j. r. stone and m. mino-kenudson. “lung histopathology in coronavirus disease 2019 as compared with severe acute respiratory syndrome and h1n1 influenza”. chest, vol. 159, no. 1, pp. 73-84, 2020. [19] e. kenah, d. l. chao, l. matrajt, m. e. halloran and i. m. jr. longini. “the global transmission and control of influenza”. plos one, vol. 6, no. 5, p. e19515, 2011. [20] v. m. konala, s. adapa, v. gayam, s. naramala, s. r. daggubati, c. b. kammari and a. chenna. “co-infection with influenza a and covid-19”. european journal of case reports in internal medicine, vol. 7, no. 5, p. 001656, 2020. [21] s. briand, a. mounts and m. chamberland. “challenges of global surveillance during an influenza pandemic”. public health, vol. 125, no. 5, pp. 247-256, 2011. [22] b. n. archer, c. cohen, d. naidoo, j. thomas, c. makunga, l. blumberg, m. venter, g. timothy, a. puren, j. mcanerney, a. cengimbo and b. schoub. “interim report on pandemic h1n1 influenza virus infections in south africa, april to october 2009: epidemiology and factors associated with fatal cases”. eurosurveillance, vol. 14, no. 42, pp. 19369, 2009. [23] d. miyazawa. “why obesity, hypertension, diabetes, and ethnicities are common risk factors for covid‐19 and h1n1 influenza infections”. journal of medical virology, vol. 93, no. 1, pp. 127-128, 2021. [24] z. a. memish, a. m. assiri, r. hussain, i. alomar and g. stephens. “detection of respiratory viruses among pilgrims in saudi arabia during the time of a declared influenza a (h1n1) pandemic”. journal of travel medicine, vol. 19, no. 1, pp. 15-21, 2012. [25] z. shimoni, j. glick and p. froom. “clinical utility of the full blood count in identifying patients with pandemic influenza a (h1n1)”. the journal of infection, vol. 66, no. 6, pp. 545-547, 2013.‏ [26] o. coșkun, i. y. avci, k. sener, h. yaman, r. ogur, h. bodur and c. p. eyigün. “relative lymphopenia and monocytosis may be considered as a surrogate marker of pandemic influenza a (h1n1)”. journal of clinical virology, vol. 47, no. 4, pp. 388-389, 2010. [27] y. egawa, s. ohfuji, w. fukushima, y. yamazaki, t. morioka, m. emoto, k. maeda, m. inaba and y. hirota. “immunogenicity of influenza a (h1n1) pdm09 vaccine in patients with diabetes mellitus: with special reference to age, body mass index, and hba1c”. human vaccines and immunotherapeutics, vol. 10, no. 5, pp. 1187-1194, 2014. tx_1~abs:at/tx_2:abs~at 66 uhd journal of science and technology | jul 2021 | vol 5 | issue 2 1. introduction tracking health results is fundamental to reinforce quality initiative, managing health care, and educating consumer. at present, employing computer applications in medical fields have had a direct impact on doctor’s productivity and accuracy. health results measurement is one of these applications. health outcomes are playing an increasing role in health-care purchasing and administration. nowadays and in most countries, cancer is becoming one of the leading causes of death. at present, lung cancer is the most common presage for thoracic surgery [1]. in the last several decades, there has been a lot of study in the field of medical science that has used various computing approaches. in the case of medical care, new approaches to data abstraction make data extraction quick and accurate, providing a larger opportunity to work with data for measuring health results. cancer is a serious health threat that the world is confronting, thus knowing how to anticipate results is essential [2]. selecting attribute and features in a massive amount of data and using machine learning approaches in recent medical technique might cause the computing process faster and decrease the amount of redundant data. removing unnecessary data are advantageous since it decreases the difficulty of data processing. attribute classifier of the data is significant, in the case of thoracic cancer, it leads to the extraction of varied information regarding a specific case of a patient. to reduce and control the victims of lung cancer and comparative study of supervised machine learning algorithms on thoracic surgery patients based on ranker feature algorithms hezha m.tareq abdulhadi1, hardi sabah talabani2 1department of information technology, national institute of technology (nit), sulaymaniyah, krg, iraq, 2department of applied computer, college of medical and applied sciences, charmo university, sulaymaniyah, krg, iraq a b s t r a c t thoracic surgery refers to the information gathered for the patients who have to suffer from lung cancer. various machine learning techniques were employed in post-operative life expectancy to predict lung cancer patients. in this study, we have used the most famous and influential supervised machine learning algorithms, which are j48, naïve bayes, multilayer perceptron, and random forest (rf). then, two ranker feature selections, information gain and gain ratio, were used on the thoracic surgery dataset to examine and explore the effect of used ranker feature selections on the machine learning classifiers. the dataset was collected from the wroclaw university in uci repository website. we have done two experiments to show the performances of the supervised classifiers on the dataset with and without employing the ranker feature selection. the obtained results with the ranker feature selections showed that j48, nb, and mlp’s accuracy improved, whereas rf accuracy decreased and support vector machine remained stable. index terms: ranker feature selection, information gain, gain ratio, supervised machine learning algorithms, thoracic surgery, cross-validation corresponding author’s e-mail: hezha m.tareq abdulhadi, department of information technology, national institute of technology (nit), sulaymaniyah, krg, iraq. e-mail: hezha.abdulhadi@nit.edu.krd received: 25-07-2021 accepted: 12-12-2021 published: 15-12-2021 access this article online doi: 10.21928/uhdjst.v5n2y2021.pp66-74 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2021 abdulhadi and talabani. this is an open access article distributed under the creative commons attribution noncommercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l r e s e a r c h a r t i c l e uhd journal of science and technology abdulhadi and talabani: comparative study of smla on tsp based on rfa uhd journal of science and technology | jul 2021 | vol 5 | issue 2 67 thoracic surgery patients, ranker feature selection techniques became an important and necessary method, because it can challenge and solve this kind of problems. in general, machine learning and ranker algorithms are a technique for classifying patient and disease datasets and separate the data to relevant and irrelevant. there are several studies worked on thoracic surgery. therefore, this work shed a light on the success rate of machine learning algorithms with ranker feature selections in classifying thoracic surgery patients. the major goal is to obtain an accurate prediction of the result after employing different approaches [3]. this research is done by a famous tool which is weka, used for analyzing and classifying data with famous machine learning algorithms. five different machine learning algorithms employed in this study which are j48, random forest (rf), naïve bayes, multilayer perceptron, and support vector machine (svm) with two famous ranker feature selections algorithms, information gain and gain ratio (gr). we have performed a classification on the thoracic surgery dataset through machine learning techniques and ranker algorithms. the rest of this paper is organized as follows: section 2 describes some background concepts relevant to our review. section 3 describes the problem and proposed method. section 4 will present the experiments and results, and finally, the conclusion is stated in section 5. 2. literature review various studies have been published that emphasize the significance of methodology in the realm of medical diagnosis. this research used various methods to the problem and obtained reasonable classification accuracies. following are some examples: several studies have been implemented in the medical field for analyzing data to discover patterns and predict outcomes. techniques such as synthetic minority over-sampling technique (smote) are used to rectify the unbalanced data. various measures are used for predicting results. for balancing the data by oversampling the minority class, the comparison between prediction methods such as artificial neural network (ann), naive bayes techniques, and decision tree algorithm is explained in [3] by employing 10-fold cross-validation and smote. the receiver operating characteristics summed the classifier performance based on the true positives and true negatives error rates; the ann achieves the highest accuracy in this scenario. another 10 folds cross-validation study in life expectancy prediction was conducted by [1] using naïve bayes, logistic regression, and svm with the rf concept, which uses the tree classification technique to average deep multiple trees that are trained using different fragments of the current training set. jahanvi joshi et al. offered the detailed proof that k-nearest neighbor (knn) provides preferable accuracy than expectation-maximization classification technique. employing the farthest first algorithm, they showed that 80% of patients were healthy and 20% of patients were sick, which are very close to knn technique outcome [4]. vanaja et al. explained that each feature selection approach has its effects and weak points inclusion of greater characteristics reduces accuracy. this survey was demonstrated that the feature selection algorithms improve the classifier accuracy consistently [5]. zieba et al. employed boosted svm to estimate post-operative life expectancy in their study. during the research, an oraclebased technique to extract decision rules from the boosted svm for solving problems with unbalanced data had been used [6]. sindhu et al. analyzed thoracic surgical data using six classification techniques (naive bayes, j48, part, oner, decision stump, and rf). an experiment was done and discovered that rf provides the greatest classification accuracy with all split percentages [1]. another research evaluated the performance of four machine learning algorithms (nave bayes, simple logistic regression, multilayer perceptron, and j48) with their boosted variants using various measures. the outcomes showed that the boosted simple logistic regression approach outperforms or is at least competitive with the other four machine learning techniques, with an average score of 84.5% [7]. in this work, four various machine learning algorithms will be used for post-life expectancy estimation after thoracic surgery, by employing two novel metrics which are information gain (ig) and gr that can be used to improve the accuracy of the algorithms and provide a reasonable result. 3. methodology in this work demonstrated in fig. 1, the thoracic surgery dataset is used and pre-processed to remove unbalanced and useless data, then filling missing values. the pre-processed abdulhadi and talabani: comparative study of smla on tsp based on rfa 68 uhd journal of science and technology | jul 2021 | vol 5 | issue 2 dataset will be used in two different tests. the two main purposes of this paper are as follows: first, to analyze the effect of number of attributes on accuracy of machine learning to solve the problem for prediction of the postoperative life in lung cancer patients reducing the number of attributes and increasing the accuracy is required to minimize the computational time of prediction techniques. second, to make a comparison between the supervised classifiers performances before and after using ranker feature algorithms with employing 10-fold cross-validation technique for splitting the dataset. notably, cross-validation is a method to evaluate a predictive model by partitioning the original sample into a training set to train the model and a validation/ test set to evaluate it. the first test will be done on the dataset employing supervised machine learning classifiers then the results will be compared with the other test according to some measurement criteria. the second test will be done on the dataset using the attribute ranking methods (ig and gr) to eliminate the redundant and irrelevant attributes from the original set of attributes and to evaluate the importance of an attribute by measuring the ig and gr with regard to the class. after attribute evaluation, the dataset will be separated randomly by applying 10-fold cross-validation and then the classification process will begin with the supervised classifiers to find the best performance among them. the final classification model of both tests will be evaluated and compared based on some performance criteria explained in the next chapter. 3.1. thoracic surgery corpus the dataset used in this paper was collected from the information of patients who were suffering from lung cancer and underwent lung resections in 2007 and 2011 at the center for thoracic surgery in wroclaw, which, in turn, is affiliated with the lower silesian center for pulmonary diseases and the department of thoracic surgery at the university of wroclaw medical. it is worth noting that this dataset has been extracted from wroclaw thoracic surgery centre that has been gathered by the national lung cancer registry of the polish institute of lung diseases and tuberculosis in warsaw [8]. in general, the dataset consists of 17 attributes (14 nominal and three numeric) with 470 records, which are detailed in table 1. 3.2. pre-processing the dataset is pre-processed removing unbalanced and useless data through smote, a bootstrapping algorithm to solve this issue (smote). other methods, ros, are also being tested (random over sampler) for that issue. in this work, several new features are designed to better describe the underlying connections among different dataset features, resulting in enhanced model performance [9]. the operations of correcting discrepancies in the data reducing noise in outliers and filling in missing values using one of the data preprocessing methods called (data cleansing). 3.3. ranker feature selection the two basic principles of ranker-based feature selection algorithms are as follows: first, the evaluation of features related to their impact on the process of data classification or analysis. second, building a ranking list based on its score using the desired features (the most influential on the accuracy of the algorithm performance) that were identified to create a subset. among the different types of rank-based feature selection algorithms, two main types. gr and ig fig. 1. flowchart of the proposed method. abdulhadi and talabani: comparative study of smla on tsp based on rfa uhd journal of science and technology | jul 2021 | vol 5 | issue 2 69 were adopted and applied to check whether they had a positive effect in increasing the performance accuracy of the supervised algorithms used in this paper. indeed, and through the obtained results, it was proved that after their application, there was a relative increase in the performance of the algorithms [10]. 3.3.1. gr it is an enhancement version of ig. it calculates the gr in connection with the class. whereas the ig selects the feature with a huge number of value, this method’s objective is to maximize the feature ig while decreasing the value numbers [11]. ga i n r a t i o fe a t u r e ga i n fe a t u r e sp l i t in f o fe a t u r e � ( ) ( ) ( ) = (1) in the following, the value for splitting information is shown. it is the result of splitting the training dataset d into v partitions, each corresponding to v outcomes on the attribute feature: sp l i t in f o a d d l o g d dj v j j( ) = − = ∑ 1 2 | | | | . � (2) 3.3.2. ig the attribute values are evaluated by the ig method with the calculation of ig concerning the class which calculated the difference in information between cases where the feature’s value is known and cases unidentified. each feature will get an assigned score, indicating how much more information about the class is fetched when that feature is used [11]. infogain (feature) = h (class) h (class |feature) (3) where, h refers to entropy is: ( ) ( ) − = −∑ 2 1 . ( ) n i i i h x p x l o g p x (4) 3.4. 10-fold cross-validation cross-validation is one of the standard machine learning techniques used in weka workbench. ten-fold crossvalidation is a mechanism for evaluating predictive models by dividing the original dataset into two subsets: the training set and the test set in which the used dataset is randomly divided into 10 equal-sized of subparts, one subpart is kept as validation data for testing, and the remaining nine parts are used as training data. hence, iterating the cross-validation process 10 times, the results for 10-fold can then be averaged to produce one evaluation. the advantage of this technique is that all the datasets will be used in both training set and testing set [12]. the reason for the selection of the cross-validation technique is that it reduces the variance in the estimation a lot more than the other techniques. accordingly, the dataset used in this paper has been separated according to this technique. this ensures that we will obtain the necessary estimations as well as monitor the performance of the classifiers. 3.5. supervised machine learning classifiers supervised learning mechanism is a type of machine learning in which machines are trained employing labelled training data. in other words, when the used dataset is divided into training and testing. the supervised learning mechanism is used on a training dataset consisting of known input data (x) table 1: descriptions of thoracic surgery dataset attributes attribute id attribute name attribute type attribute description 1 dgn nominal diagnosis-specific combination of icd-10 codes for primary and secondary as well multiple tumors if any (dgn3, dgn2, dgn4, dgn6, dgn5, dgn8, and dgn1) 2 pre4 numeric forced vital capacity – fvc 3 pre5 numeric volume that has been exhaled at the end of the first second of forced expiration – fev1 4 pre6 nominal performance status – zubrod scale (prz2, prz1, and prz0) 5 pre7 nominal pain before surgery (t,f) 6 pre8 nominal hemoptysis before surgery (t,f) 7 pre9 nominal dyspnea before surgery (t,f) 8 pre10 nominal cough before surgery (t,f) 9 pre11 nominal weakness before surgery (t,f) 10 pre14 nominal t in clinical tnm – size of the original tumor, from oc11 (smallest) to oc14 (largest) (oc11, oc14, oc12, and oc13) 11 pre17 nominal type 2 dm – diabetes mellitus (t,f) 12 pre19 nominal mi up to 6 months (t,f) 13 pre25 nominal pad – peripheral arterial diseases (t,f) 14 pre30 nominal smoking (t,f) 15 pre32 nominal asthma (t,f) 16 age numeric age at surgery 17 risk1y nominal 1 year survival period – (t)rue value if died (t,f) abdulhadi and talabani: comparative study of smla on tsp based on rfa 70 uhd journal of science and technology | jul 2021 | vol 5 | issue 2 and output variable (y) to build a module and implement it to predict the output variables (y) of the testing data [13]. the following are the supervised learning algorithms that have been used in this paper. 3.5.1. rf a rf algorithm, as its name suggests, is made up of a large number of individual decision trees that act as a set. each tree in the rf emerges from the prediction of the class and becomes the class with the most votes the basic principle behind the rf algorithm is a simple but powerful concept – the wisdom of the majority crowd. in data science, the reason the rf model is so successful is that a large number of relatively uncorrelated (trees) models acting as a committee will outperform any of the single-component models. the low correlation coefficient between the models is key. just like how investments with a low coefficient of correlation are aggregated, uncorrelated models can produce aggregate forecasts that are more accurate than any individual forecasts. the reason for this wonderful effect is that trees protect each other from their mistakes (as long as they don’t all err in the same direction constantly). while some trees may be wrong, many others will be right so that the trees as a group can move in the right direction [14]. the mathematical formula of the algorithm is as follows [15]. r ff i n o r m f i ti j a l l t r e e s i j= ∈ ∑ (5) where, rffi sub(i)= the significance of feature i calculated from all trees in the rf model normfi sub(ij) = the normalized feature importance for i in tree j. 3.5.2. j48 the process of classification using a decision tree uses gain information to divide the tree. the first step is to gain information for each attribute. the attribute with the largest amount of ig will be the node root of the decision tree. the decision tree technique aims to divide the database with a specific goal that has already been determined, and the presence of a certain element in one of the groups, which is represented here by the branches, becomes a result because it achieved the series of conditions set down to this branch and not only because it is similar to the rest of the elements [16]. although, it has not been defined similarity in this case. the j48 and the algorithms that are used to produce it can be complex, but the results that lead to it can be shown in a simple, easy-to-understand form, and with a high level of utility. the algorithm steps are as follows: first: if the instances belong to the same class, the leaf is tagged with a comparable class. second: the prospective data for each attribute will be calculated, and the data gain from the attribute test will be calculated. third: eventually, based on the current selection parameter, the best attribute will be selected. 3.5.3. naive bayes it is a classification model in machine learning fields which based on probability. a naive bayesian model is simple to construct and does not require iterative parameter estimation, making it ideal for huge datasets [17]. from p(c), p(x), and p(x|c), the bayes theorem may be used to get the posterior probability, p(c|x). the effect of the value of a predictor (x) on a given class (c) is independent of the values of other predictors, according to the naive bayes classifier. the following is the formula of the model. p c x | | ( ) ( ) ( ) = ( ) p x c p c p x (6) p(c|x) =p(x 1 |c) p(x 2 |c).p(x n |c) p(c) p ( c | x ) : r e a r p r o b a b i l i t y o f c l a s s ( t a r g e t ) g i v e n predictor(attribute). p(c): the prior probability of class. p(x|c): likelihood which is the probability of predictor given class. p(x): the prior probability of predictor. 3.5.4. multilayer perceptron it is a category of feedforward ann which creates a set of outputs from a set of inputs. the perceptron, which comprises numerous inputs xi multiplied by a scalar value known as weight wij and a bias bj, was one of the earliest pes constructed [18]. a specified activation function f is used to process the acquired result, which may be explained as follows: y f w x bj i i j i i= ∑ +[ ( * ) (7) y f w x bj i i j i i= ∑ +[ ( * ) ] (8) abdulhadi and talabani: comparative study of smla on tsp based on rfa uhd journal of science and technology | jul 2021 | vol 5 | issue 2 71 the hyperbolic tangent function tanh, which is represented as follows, is the most frequent activation function f utilized in perceptron. t a n h . ,x x w h e r e x i s x e e x x � � � � � � � � � � � � 2 2 1 1 � � � (9) the mlp network is used to solve nonlinear separation issues by connecting numerous perceptions in one or more hidden layer topologies. the aim is to discover the error function with the lowest possible error in proportion to the connection weights. the error function is explained as follows: e y y j m m m� � � � 1 2 2( )^ (10) with y^ m being the desired output of m’th y m. 3.5.5. svm the svm algorithm classifies data for two divisions by taking input data and generating output predictably. the best way for implementing this technique is to build a model to text corpuses while any training sample belonged to one of the classes. after that, the data will be divided into two categories with the way of constructing an n-dimensional hyperplane. to separate data, svm will build two hyperplanes but they should be paralleled in both sides of the hyperplane while the separated hyperplane will increase the space between other hyperplanes [19]. svm is capable of conducting regression analyze and extending it while performing a numerical calculation. the formula of the algorithm is shown below: k x y x y c, .( ) = +( ) (11) 4. experiments and results in machine learning, and specifically in the field of data classification, there are many commonly accepted criteria for measuring the classification performance for the machine learning algorithms. in this research, the scales shown in the following tables were used to explain the difference in the performance of the algorithms used to classify the data. then, the performance of each algorithm is compared before and after applying each of the classifier feature selection algorithms gr and ig. in general, through the results obtained in tables 2 and 3 with figs 2-5, it is clear that there is a difference in the stability and instability in the classification performance of algorithms with or without ranker feature selections in the process of classifying thoracic surgery datasets. to begin with regard table 2: performance measurements before implementing ranker attribute evaluators supervised algorithms precision recall f-measure accuracy j48 0.200 0.014 0.027 84.46% rf 0.182 0.029 0.049 83.62% nb 0.208 0.157 0.179 78.51% mlp 0.259 0.214 0.234 79.14% svm 0.724 0.849 0.782 84.89% 0 10 20 30 40 50 60 70 80 90 100 j48 rf nb mlp svm accuracy% fig. 2. accuracy of the classifiers before feature selections. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 precision recall f-measure j48 rf nb mlp svm fig. 3. precision/recall and f-measure of the classifiers before feature selections. 0 10 20 30 40 50 60 70 80 90 gr ig gr ig gr ig gr ig gr ig j48 rf nb mlp svm accuracy % and error rate % error rate % accuracy % fig. 4. accuracy and error rate of the classifiers after using feature selections. abdulhadi and talabani: comparative study of smla on tsp based on rfa 72 uhd journal of science and technology | jul 2021 | vol 5 | issue 2 accuracy of mlp is 79.14% without ranker feature selections, as shown in table 2 and fig. 2, this accuracy is enhanced with ranker gr and ig to 81.063% and 83.404%, respectively, as shown in table 3 and fig. 4. another point to consider is with regard to the rf algorithm, we notice a decrement in performance accuracy of rf which was 83.62% without ranker feature selections, as shown in table 2 and fig. 2, the accuracy is raised after employing ranker gr and ig to 81.702% and 81.063%, respectively, as shown in table 3 and fig. 4. whereas, in testing svm algorithm, there are no changes observed in the accuracy during classification as it remains equal in both cases and its performance did not change with both feature selections, the accuracy without ranker selections was 84.89%, as shown in table 2 and fig. 2, and it remains stable with no any effectiveness with ranker selections with accuracy 84.89% for both feature selection algorithms gr and ig, as shown in table 3 and fig. 4. in table 4, it is clear that svm is the most accurate algorithm in classifying instances correctly with 399 instances out of a total of 470 instances units without ranker feature selections. however, it is not the fastest in constructing the model, as it took 0.09 seconds for classifying the whole dataset records. table 3: performance measurements after implementing ranker attribute evaluators (gr)/(ig) performance measurements j48 rf nb mlp svm gr ig gr ig gr ig gr ig gr ig precision 0.724 0.724 0.750 0.751 0.744 0.745 0.767 0.799 0.724 0.724 recall 0.851 0.849 0.817 0.811 0.828 0.819 0.811 0.834 0.849 0.849 f-measure 0.783 0.782 0.777 0.776 0.777 0.776 0.785 0.811 0.782 0.782 error rate % 14.893 15.106 18.297 18.936 17.234 18.085 18.936 16.595 15.106 15.106 accuracy % 85.106 84.893 81.702 81.063 82.766 81.914 81.063 83.404 84.893 84.893 table 4: classification/time measurements before implementing ranker attribute evaluators classification measurements j48 rf nb mlp svm correctly classified instances 397 393 369 372 399 incorrectly classified instances 73 77 101 98 71 time (milliseconds) 30 210 9 1820 90 table 5: classification/time measurements after implementing ranker attribute evaluators (gr)/(ig) classification measurements j48 rf nb mlp svm gr ig gr ig gr ig gr ig gr ig correctly classified instances 400 399 384 381 389 385 381 392 399 399 incorrectly classified instances 70 71 86 89 81 85 89 78 71 71 time (milliseconds) 40 10 140 90 10 9 1150 1290 30 40 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 gr ig gr ig gr ig gr ig gr ig j48 rf nb mlp svm precision recall f-measure fig. 5. precision/recall and f-measure of the classifiers after ranker evaluators. in j48, nb, and mlp algorithms, we noticed an increment in accuracy of the classification performance, in which the accuracy of j48 is 84.46% without using ranker feature selections, as shown in table 2 and fig. 2, this performance has been improved using ranker feature selections gr and ig to 85.106% and 84.893%, respectively, as shown in table 3 and fig. 4. furthermore, the classification performance accuracy of nb is 78.51% without ranker, as shown in table 2 and fig. 2, the performance is raised with ranker gr and ig to 82.766% and 81.914%, respectively, as shown in table 3 and fig. 4. moreover, the classification performance abdulhadi and talabani: comparative study of smla on tsp based on rfa uhd journal of science and technology | jul 2021 | vol 5 | issue 2 73 besides, mlp is the slowest algorithm among the other algorithms in the classification process as it took 1.82 seconds without using ranker feature selections. in contrast, nb is the lowest in classifying instances correctly with 369 instances out of a total of 470 instances without using franker feature selections. however, it is the fastest in building the model, as it took 0.00 seconds to classify the whole dataset records. in table 5, a drastic change can be observed, it is clear that j48 is the most accurate algorithm in classifying instances correctly with 400 instances out of a total of 470 instances units with ranker feature. however, it is one of the fastest algorithms in constructing the model using ig which took 10 milliseconds for classifying the whole dataset records. in contrast, both rf using ig and mlp using gr are the lowest in classifying instances correctly with 381 instances out of a total of 470 instances without ranker feature. furthermore, mlp remained the slowest in building the model, as it took 1290 milliseconds to classify the whole dataset records using ig. the nb remained the fastest algorithm among the others in the classification models as it took 0.00 seconds with ig. in contrast, both rf using ig and mlp using gr are the lowest in classifying instances correctly with 381 instances out of a total of 470 instances without franker feature selections. however, mlp remained the slowest in building the model, as it took 1290 milliseconds in classifying the whole datasets using ig. finally, nb remained the fastest algorithm among the other algorithms in classifying the dataset as it took 9 milliseconds with using ig. 5. conclusion the comparison made in this paper showed a significant effect of the ranker features on supervised classification algorithms. through the obtained results, we concluded that the use of ranker feature selections leads to improving the classification performance of particular algorithms, as done with j48, mlp, and nb algorithms. in contrast, ranker feature selection reduced the performance of rf. moreover, specific algorithms such as svm remained stable before and after ranker feature selection concerning classification performance. similarly, as for the speed of building the model, the nb algorithm did not change its speed in both cases by recording the least time for data classification and the fastest among the other algorithms, 9 milliseconds. eventually, the highest performance in the accuracy of classification was the j48 algorithm using gr, which amounted to 85.1%. other feature selection algorithms can be employed to improve the used algorithms’ performance in future work. references [1] s. prabha, s. veni and s. prabha. “thoracic surgery analysis using data mining techniques”. international journal of computer technology and applications, vol. 5, no. 1, pp. 578-586, 2014. [2] k. kourou, t. p. exarchos, k. p. exarchos, m. v. karamouzis and d. i. fotiadisa. “machine learning applications in cancer prognosis and prediction”. computational and structural biotechnology journal, vol. 13, pp. 8-17, 2015. [3] a. s.dusky and l. m. el bakrawy. “improved prediction of postoperative life expectancy after thoracic surgery”. advances in systems science and applications, vol. 16, no. 2, pp. 70-80, 2016. [4] j. joshi, r. doshi and j. patel. “diagnosis of breast cancer using clustering data mining approach”. international journal of computer applications, vol. 101, no. 10, pp. 13-17, 2014. [5] s. vanaja and k. r. kumar. “analysis of feature selection algorithms on classification: a survey”. international journal of computer applications, vol. 96, no. 17, pp. 29-35, 2014. [6] m. zięba, j. tomczak, m. lubicz and j. świątek. “boosted svm for extracting rules from imbalanced data in application to prediction of the post-operative life expectancy in the lung cancer patients”. applied soft computing, vol. 14, pp. 99-108, 2014. [7] m. u. harun and n. alam. “predicting outcome of thoracic surgery by data mining techniques”. international journal of advanced research in computer science and software engineering, vol. 5, no. 1, pp. 7-10, 2015. [8] m. lubicz, k. pawelczyk, a. rzechonek and j. kolodziej. “uci machine learning repository: thoracic surgery data data set”, 2021. available from: https://archive.ics.uci.edu/ml/datasets/ thoracic+surgery+data [last accessed on 2021 oct 08]. [9] s. xu. “machine learning-assisted prediction of surgical mortality of lung cancer patients”. the ieee international conference on data mining, 2019. [10] s. subbiah and j. chinnappan. “an improved short term load forecasting with ranker based feature selection technique”. journal of intelligent and fuzzy systems, vol. 39, no. 5, pp. 6783-6800, 2020. [11] d. el zein and a. kalakech. “feature selection for android keystroke dynamics”. 2018 international arab conference on information technology, 2018. [12] h. talabani and a. v. c. engin. “performance comparison of svm kernel types on child autism disease database”. international conference on artificial intelligence and data processing, 2018. [13] f. y. osisanwo, j. e. t. akinsola, o. awodele, j. o. hinmikaiye, o. olakanmi and j. akinjobi. “supervised machine learning algorithms: classification and comparison”. international journal of computer trends and technology, vol. 48, no. 3, pp. 128-138, 2017. [14] m. rathi and v. pareek. “spam mail detection through data mining a comparative performance analysis”. international journal of modern education and computer science, vol. 5, no. 12, pp. 3139, 2013. [15] j. wong. “decision trees medium”, 2021. available from: https:// towardsdatascience.com/decision-trees-14a48b55f297 [last accessed on 2021 oct 08]. [16] a. yadav and s. chandel. “solar energy potential assessment of western himalayan indian state of himachal pradesh using j48 algorithm of weka in ann based prediction model”. renewable energy, vol. 75, pp. 675-693, 2015. [17] k. vembandasamy, r. sasipriya and e. deepa. “heart diseases abdulhadi and talabani: comparative study of smla on tsp based on rfa 74 uhd journal of science and technology | jul 2021 | vol 5 | issue 2 detection using naive bayes algorithm”. international journal of innovative science, engineering and technology, vol. 9, no. 29, pp. 441-444, 2015. [18] m. khishe and a. safari. “classification of sonar targets using an mlp neural network trained by dragonfly algorithm”. wireless personal communications, vol. 108, no. 4, pp. 2241-2260, 2019. [19] h. talabani and a. v. c. engin. “impact of various kernels on support vector machine classification performance for treating wart disease”. international conference on artificial intelligence and data processing, 2018. tx_1~abs:at/tx_2:abs~at uhd journal of science and technology | jul 2021 | vol 5 | issue 2 11 1. introduction unusual behavior detection refers to a finding patterns process in a dataset that does not have the expected behavior. network intrusion is also known as a set of unusual behaviors in the network. mode detection provides essential information in a variety of applications that will improve network performance. an intrusion detection system (ids) is a device or software application that monitors the network to look for suspicious activity, threats, or policy breaching, and on encountering such activities, it alerts the security personnel. ids monitors inbound as well as outbound network flow for abnormal behavior and then alert the admin or user that a network intrusion might be occurring. it performs the task by comparing signatures of a known malware against the system. it monitors the user behavior, system processes, and system configurations for any unusual behavior. security personnel is alerted on security breaches with data consisting of the addresses of the source, the target, and the type of attack. the problem of intrusion detection is a complicated issue. a compromise must be made between detection accuracy, detection speed, intrinsic network dynamics, and high data volume for processing, and the methods used must be able to distinguish between state and abnormal behaviors. be normal behaviors in the network. the ids’s primary purpose can be considered network display for any mode such as dos, u2r, r2l, some of which are listed in table 1 [1]. network intrusion detection using a combination of fuzzy clustering and ant colony algorithm yadgar sirwan abdulrahman it department kurdistan technical institute, sulaymaniyah, kurdistan region, iraq a b s t r a c t as information technology grows, network security is a significant issue and challenge. the intrusion detection system (ids) is known as the main component of a secure network. an ids can be considered a set of tools to help identify and report abnormal activities in the network. in this study, we use data mining of a new framework using fuzzy tools and combine it with the ant colony optimization algorithm (acor) to overcome the shortcomings of the k-means clustering method and improve detection accuracy in idss. introduced ids. the acor algorithm is recognized as a fast and accurate meta-method for optimization problems. we combine the improved acor with the fuzzy c-means algorithm to achieve efficient clustering and intrusion detection. our proposed hybrid algorithm is reviewed with the nsl-kdd dataset and the iscx 2012 dataset using various criteria. for further evaluation, our method is compared to other tasks, and the results are compared show that the proposed algorithm has performed better in all cases. index terms: intrusion detection, data mining, fuzzy clustering, ant colony corresponding author’s e-mail: yadgar sirwan abdulrahman, it department kurdistan technical institute, sulaymaniyah, kurdistan region, iraq. e-mail: yadgar.abdulrahman@kti.edu.krd receied: 01-04-2021 accepted: 07-07-2021 published: 16-07-2021 access this article online doi: 10.21928/uhdjst.v5n2y2021.pp11-19 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2021 abdulrahman. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l r e s e a r c h a r t i c l e uhd journal of science and technology abdulrahman: nid using fc-acor 12 uhd journal of science and technology | jul 2021 | vol 5 | issue 2 among these modes, distributed denial of service (ddos) is one of the security threats associated with computer networks, especially the internet, which targets access to network resources, and the purpose of this mode is to disrupt network service. one of the most dangerous and most recent situations on the internet is not to disrupt the service, but to force the network and server to be unable to provide regular service to target network bandwidth. it is done with a victim who drowns the victim’s network or processing capacity in information packets and prevents users and customers from accessing the service. one of the most common and significant threats on the internet today is a denial of service by interfering with configuration information. routers and ip source fraud occurs, leading to reduced network performance. for years, experts have warned about the poor security of internet-connected devices and equipment, and poor security has made them vulnerable configuration of equipment and the heterogeneity of operating systems such devices a very convenient yet easy target for attackers. one of the main exploits of hackers and destroyers of these devices and equipment is to capture them to execute the distributed model. during this state, an army of these hacked devices bombards it by sending simultaneous requests to the victim’s server, which is called this type of hacked equipment with a net. receiving a request from thousands and sometimes tens of thousands of devices with different ip addresses at the same time will eventually lead to slowing down or even stopping the server service to users. when a ddos attack occurs, the first step is to determine what layer of the open systems interconnection model the attack is on; the mode is usually on layers 3 and 4 of the network and the scope of an attack depends on features such as volume and the number of packets sent per second. layers 3 and 4 are very difficult to control. dispose of it. the issue of identifying and providing a suitable solution is one of the biggest challenges facing network security professionals. the methods of diagnosis and prevention that have been presented so far have either not been effective or have not been adequately responded to by attackers with increasing level of knowledge, and most of the detection is in the form of statistical methods and monitoring and control of network traffic. in the case of this type of attack and high traffic, if two attacks occur with different traffics combined in the network, this type of method will no longer be a good answer us, and because the attack speed and the network traffic volume is very high in a short time, it should be possible to detect and deal with the attack as soon as possible, in other words, the system notices the denial of service when the network attack increases and affects traffic volume and it is no longer possible to deal with, and when it detects an attack on the traffic volume network, it will be more careful with the defense layers on the internet, before the defense layer service can react. for example, if we can detect the mode in network routers and these hand-held routers can make a proper diagnosis, the probability of service denial mode is reduced and further risks are avoided. in this problem, the attack detection importance in the lower layers of the network, such as the network layer and the data link layer, is seen more, and in the higher layers, such as the application layer, it requires data packets to be examined, which will take many of our resources and time. a computer network attack detection system is one of the most important parts to prevent illegal intrusions in the network. detecting and detecting intrusion can reduce the misuse of individuals’ personal information as well as prevent financial risks for users and service companies. various algorithms and methods for this classification have been proposed in the previous works, each of which has its advantages and disadvantages. in this study, we present our proposed framework along with the improvements in the algorithms used to distinguish dos/ddos mode from normal network mode in the iscxids2012 dataset, which are fully described in section 3. this study tries to provide a suitable framework for attack detection using fuzzy clustering and feature selection. this paper’s structure is as follows: section 2, we mentioned an overview of the related work. section 3 describes the table 1: the network attack types classification. attack class attack name attack description probing probe an attacker performs port scanning and monitoring activities to gather information or find vulnerabilities in a network dos denial of services an attacker fills a busy network resource (such as memory or bandwidth) with repeated requests, causing network resources to overflow and users’ requests not to be answered user to root u2r an attacker accesses a regular account and searches for a vulnerability to gain unauthorized root access to the system r2l remote to local an attacker gains access to a system through a remote network and attempts to gain unauthorized local access through a remote system abdulrahman: nid using fc-acor uhd journal of science and technology | jul 2021 | vol 5 | issue 2 13 problem and the proposed method. section 4 will present the experiments and results and finally, the conclusion is stated in section 5. 2. related work much work has been done in the field of intrusion detection in computer networks, and in this section, we will briefly mention some of these studies. in chitrakar and chuanhe [2], to solve the problem of high data volume requirements in works that use k-means or k-means clustering and forward neural network, a combination of support vector machine (svm) with k-means clustering the kyoto 2006+ dataset is used in this work and the simulation results show that the use of svm in any volume of data has higher classification accuracy. to evaluate the operations performed in this work, sensitivity criteria. and false alarm has been used in chitrakar and chuanhe [2], to detect network intrusion, the combined method of k-means clustering with naive bayes classification with the same working criteria and the same dataset (kyoto 2006+ dataset) has been used. the results of this work were somewhat weaker than the work done in chitrakar and chuanhe [2]. in saifullah [3], they propose a defense mechanism to detect an attack using a distributed algorithm that runs a moderate load valve in the opposite direction of the router. the valve has a medium load because the traffic intended for the server is controlled [3]. the operation (increase or decrease) is performed using a perforated bucket in the router based on the number of connected users, who are directly or indirectly connected to the router. at the beginning of the algorithm, the remaining capacity is underestimated by the router. the remaining initialization capacity is the minimum or normal value at the beginning of the algorithm. the speed is updated (increase or decrease), sends to small routers based on server feedback, and finally multiplies all routers in descending order. the convergence of the whole server is loaded with an acceptable capacity range. in syarif et al. [4] there are three objectives: (1) effective feature selection and dimension reduction, –(2) a strong algorithm selection in the classification field, and –(3) unconventional detection, using clustering algorithms based on segmentation; to achieve the first goal genetic algorithm and particle swarm optimization (pso) are used. for the classification operation, the nearest neighbor classification method has been used, and finally, by comparing different types of clustering methods, the expectation maximization method has had the best performance. the results for falsepositive and accuracy criteria have been investigated using four classification methods, the highest accuracy being related to the decision tree. in revathi and malathi [5], just like [4] methods of optimizing collective intelligence have been used. this work uses simplified collective intelligence, which is a kind of simplified and improved pso algorithm (sso) along with the random forest method on the kddcup 1999 dataset. in singh and singh [6] an attack detection method is presented in manet based on the aco algorithm. in this work, after the intrusion detection by the ant colony algorithm, the genetic algorithm has been used to retrieve the network, in which the number of recovered nodes has been investigated for the number of 10–80 replications and the probability of mutation between 0.2 and 0.4. in both papers cheng et al. [7] and xia et al. [8] ip address interaction (iai), the algorithm is proposed due to sudden traffic changes, the interaction between addresses, the asymmetry between many addresses, source distribution, and focus of targeted goals, iai algorithm is designed to describe the network stream validity important features. the svm classifier, which is sorted by iai interval with normal attack current, is used to classify the current network stream validity and identify ddos mode. the method is defined in real-time attack flow detection as well as attacker assessment power based on fuzzy reasoning. the process consists of two steps: (a) statistical analysis of network traffic interval and (b) ddos attack identification and evaluation based on intelligent fuzzy reasoning mechanism. in kamarudin et al. [9], the feature selection was performed using the random forest genetic algorithm without an observer and a subset of the total features for each of the two datasets darpa1999 and iscx2012 was obtained. the use of random forest classification has been performed. in vargas-munoz et al. [10], an ids system based on the bayesian network is also presented. eight features, the iccx2012 feature set is used in the classification section.[11] feature selection is based on entropy. in their work, the raw features of the iscx2012 dataset and five features based on entropy values are considered. they classify mlp neural network, rnn (alternative decision tree) adt has been used. 3. proposed method in this study, we present our proposed framework along with the improvements in the algorithms used to distinguish dos/ ddos attacks in normal network mode in the iscxids2012 abdulrahman: nid using fc-acor 14 uhd journal of science and technology | jul 2021 | vol 5 | issue 2 dataset. all the steps to achieve attack detection using data mining and artificial intelligence can be summarized in the following sections. • data preprocessing • combined clustering using ant colony optimization and fuzzy clustering • classification and review of quantitative criteria. after performing these steps, the constructed approach can be used. the most important feature of this study is to present a model for ddos mode detection that improves the accuracy of the combination of the fuzzy c-means algorithm and the ant colony optimization and improves this algorithm in clustering, which improves the detection accuracy. in the following, the main steps of the proposed method are discussed. 3.1. data collection and preprocessing the study uses two datasets, nsl-kdd and iscx datasets, which are described in the following. the number of training instances in each attack class is shown in both kdd train (kdd cnp99) and nsl-kdd(kdd train+) datasets in table 2. the nsl-kdd dataset also includes two training datasets. kdd train+ and kdd train+_20% of which kdd train+_20% is the improved version of kdd train+. test examples for the nsl-kdd dataset also include two sets test kdd+ and kdd test-21. kdd test-21 has more difficulty distinguishing samples than kdd test+ as can be seen, most of the samples removed from kdd cup are in dos mode with a removal rate of 82.98%. nsl-kdd is obtained by removing approximately 43.97% of the samples in the kupd cup99 dataset. in total, the nsl-kdd dataset has 25,192 samples and 43 features. (this dataset is comprised four sub-datasets: kddtest+, kddtest-21, kddtrain+, and kddtrain+_20percent, although kddtest-21 and kddtrain+_20percent are subsets of the kddtrain+ and kddtest+. from now on, kddtrain+ will be referred to as train and kddtest+ will be referred to as a test. the kddtest-21 is a subset of the test, without the most difficult traffic records (score of 21), and the kddtrain+_20percent is a subset of the train, whose record count makes up 20% of the entire train dataset. that being said, the traffic records that exist in the kddtest-21 and kddtrain+_20percent are already in test and train, respectively, and are not new records held out of either dataset.) as mentioned earlier, this article also uses the iscx dataset [12]. the structure of the network used to generate this dataset is shown in fig. 1. as shown in fig. 1, the test structure consists of four separate lans, and the fifth lan consists of servers that provide web, email, dns, and nat services. all links are set on 10m bits/s. the data began on friday, june 11, 2020, and lasted exactly 7 days. this article examines the ddos mode detection performed on tuesday compared to the normal network mode (no attack) performed on friday. given that this dataset is available in pcap format, we use cicflowmeter software. together with winpcap software, we have used to extract 24 features. then specify the data path in pcap format and the csv data storage path to obtain user data in the preprocessing section. 4. clustering 4.1. ant colony optimization algorithm (acor) algorithm we discuss the design process of the ant colony optimization algorithm of the continuous domain for solving unconstrained optimization problems and constrained optimization problems based on the position distribution model of ant colony foraging [13]. assuming the whole ant colony consists of m groups of substructure, each group contains n of ants. as shown in the following equation: table 2: nsl-kdd data information class kddcup’99 (kdd train) nsl-kdd (kddtrain+) % reduction % in nsl kdd normal 972,781 67,343 93.07 53.46 dos 3,883,370 45,927 98.82 36.46 probe 41,102 11,656 71.64 9.25 u2r 52 52 0 0.04 r2l 1126 995 11.63 0.79 total 4,898,431 125,973 97.43 fig. 1. data generation network structure [12]. abdulrahman: nid using fc-acor uhd journal of science and technology | jul 2021 | vol 5 | issue 2 15 x1 11 21 x x ant ant ant ant ant ant 2 n 12 1n 22 2n … … … � ant antm2 mn � � …antm1                 (1) the position ant ij corresponding to the value x j of the variable for j-ant in any sub colony i, the sub colony i of all the ants in the sequence of {ant i1 , ant i2 ,…, ant in } represents a solution of the optimization problem. in the position distribution model of ant colony foraging, each ant releases pheromone according to the quality of a food source of their position; pheromones are dispersed in the entire space, with increasing distance of the source and the concentration decreasing. therefore, we need to choose a probability density function as the distribution model of ant pheromone in the optimization algorithm of continuous domains. the gaussian function is a common probability density function; we assume ants of the ant colony release pheromone externally on the function. at this point, j ant in any sub colony ant i corresponding to pheromone distribution model τ ij (x) can be expressed as τ πσ σ χ µ σ ij j j j j x e u l n n ij j( ) ( ) ( ( )) , (( ) / ),= = − + − −1 2 1 1 2 22 ψ (2) where µ ij is the position ant ij of ant j in the sub colony of ants i, namely, the distribution center, σj(σj>0) means the width of the distribution function, u j is the maximum allowable value of the variable x j , l j is the minimum allowable value of the variable x j , n is the dimension of solution for the optimization problem, ψ (ψ > 0) is a parameter, and σj is used to adjust the size. before updating the position of the ant colony, we need to choose a group as a parent from m sub colony. first, we use formula (3) to calculate each group of sub colony corresponding to the assessed value of the solution. consider the following: eval ei f ant ant ant ti i in = + 1 1 1 2( ) ,( , , )/ (3) where f(ant i1 , ant i2 ,…, ant in ) is the assessment value of the sub colony ant i; t (t>0) is the adjustment coefficient used to adjust the pressure of selection. after the assessment value for each group of sub colony is obtained, we calculate the selected probability for each group of sub colony according to pi eval eval i jj m= =∑ 1 . (4) finally, we select parent colony c according to formula (5) arg , , max (eval ) q q i=1,2, m c i 0≤  q>q0 ,      (5) where (0≤q0≤1) is a given parameter, q is a random variable is distributed in [0,1] uniformly. c is a random variable that is generated according to formula (5). after getting the parent ant colony c, the ant pheromone distribution model function τ cj (x) in the ant colony corresponding to random number generator for sampling, the k groups of children colony are generated. then, according to the size of assessment value for each group of sub colony, we select the large assessment value of m group from (m+k) group of sub colony to achieve a position of ant colony update. 4.2. basic fuzzy c-means in this section, the basic fuzzy c-means algorithm [14] will be briefly introduced. the objective function of this algorithm is defined as bellow: ( )2 0 1 n c m iji j j d xi vj = = = −∑ ∑ (6) the µ ij determines the degree to which the i_th sample belongs to the center of the cluster j, and m determines the fuzzy degree. here d2 (xi-vj) is the non-euclidean distance equal to (xi-vj)2. as x i is the i_th sample and v j is the center of the j_th cluster. for this objective function, there are constraints 1 0 n iji n = < <∑ and µij ε [0-1]. the values of the variables i and j are in the range of 1≤i≤n and 1≤j≤c. based on the objective function introduced in equation (6), equations for improving the centers and functions of the affiliation will be as follows: 12 ( 1) 21 ( ) 1/ ( ) ( ) c i j m ij l i l d x v d x v  − = − = −∑ (7) 1 1 ( ) n m ij ii nj m iji x v   = = = ∑ ∑ abdulrahman: nid using fc-acor 16 uhd journal of science and technology | jul 2021 | vol 5 | issue 2 1 1 ( ) n m ij ii nj m iji x v   = = = ∑ ∑ (8) 4.3. acor improvement as mentioned before, the x matrix is the original data matrix with dimensions n*d, where n is the number of observations and d is the number of attributes. the output of the first layer is the x matrix with dimensions n*r where r≤d is the selected properties. the second layer is responsible for clustering data in k clusters. the proposed method in this layer is to combine the fuzzy c-means clustering algorithm with the acor optimization algorithm. furthermore, in the acor [13] algorithm and fuzzy c-means algorithm [14], changes have been applied to improve the performance of the proposed method, which is stated. in the acor algorithm, to determine the weights, an exponential relationship is used, which causes a high difference for the weight of the answers with high and low rank. in other words, high-ranking answers are very important in each iteration, which reduces the breadth of finding the answer and catching the algorithm in the local minimum. for this reason, in this study, we have proposed the following weighting function to determine the weights. in this case, as the answer rank increases, the number of weight increases, but these changes occur much smoother than the case suggested in socha and dorigo [13]. the proposed function for means and standard deviation (σ = kq) (different) is shown in fig. 2. as can be seen, it is an s-shape function. the cumulative distribution function (cdf) of the normal, or gaussian, distribution with standard deviation σ and mean µ. and then, which is τij (x) will be calculated as described below. instead of using formula 2 in, and from kq we can rewrite 2nd formula as formula 9. then for a smoother objective function due to arithmetic estimation mentioned below. as in reference 15, we can compute erf(x). its not a new formula. its using an approximation function of the main objective function of acor algorithm. that gets us a smoother result in practice now we can use this cdf estimation for formula (2) which is a pdf or gaussian function: ae x µb x − −    ≈ + −        µ σ σ 2 1 2 1 2 erf �� (9) here, if we consider a=1/√2πσ and b=1/2, then from using formula (9) and (2) we have:, 1 üüü 2 2 ij ij x µ x erf  −  = +    (10-a) as we know then we can replace (σ = kq) to and reach 1 üüü 2 2 ij ij x µ x erf qk  −  = +    (10-b) and we can calculate erf(x) as we have this in [15], erf x e dtt x ( ) = −∫ 2 2 0 (11) the combination of acor and fuzzy c-means algorithms can be achieved in three ways. as follows: • in the first case, the selection of cluster centers can be done by the acor algorithm in such a way that this algorithm calculates cluster centers based on the defined objective function; these centers are then applied to the fuzzy c-means clustering algorithm as the initial mean • in the second case, after random generation of the first states, the fuzzy c-means algorithm of events is executed and according to the desired population in successive iterations, the centers of the clusters are obtained and then these centers are given to the acor algorithm as the initial population and this algorithm is based on the function. the defined goal performs the clustering operation. this will usually not produce the desired result • in the third case, the acor algorithm starts clustering the data, with the difference that at the same time the fig. 2. weighting functions for means and different standard deviation. abdulrahman: nid using fc-acor uhd journal of science and technology | jul 2021 | vol 5 | issue 2 17 fuzzy c-means algorithm is applied and improves the location of the best available country. this method requires more time than the previous two modes. in this paper, we used the first case, as discussed above. we use the acor algorithm to cluster center selection; these centers are then applied to the fuzzy c-means clustering algorithm as the initial mean. and as an improvement of acor weighting function, instead of formula (2) in [13] which is a pdf function, we used a smoother arithmetic estimation function or cdf function from formula (9). then the weighting function will become as formula (10-a), by using (σ = kq), we’ll get to the formula (10-b). and to calculate erf(x) we can use the formula (11). now we have a new smoother weighting function as mentioned in formula (10-b), which can be easily calculated. 5. simulation results in this study, we used python with pycharm ide for implementation, and an hp laptop with 8gigabyte ram, core i7 6600u cpu, windows 10. the first case is used and the results obtained with pso-k-means, ica-k-means, k-means, fuzzy c-means ++, and dbscan methods and methods proposed in kumar and kumar [16], kaur et al. [17] and soheily-khah et al. [18]. it should be noted that when using optimization methods for clustering, the defined objective function must be a function appropriate to the clustering problem. defined for clustering, in which the objective function of the k-means problem is used, which is as follows. cost c xi cj j k i n( ) = − = … =∑ min ���������{ } , , ,1 21 where the selected distance is between the sample and the center of the cluster. to present and compare the results in this section, the training curves of hybrid algorithms along with the correct data clustering rate (d.r), accuracy, and false alarm have been calculated. detection rate tptp fn = + (12) accuracy tp tntp tn fp fn= + + + + (13) false alarm fp fp tn = + (14) the parameters used for the aco algorithm are given in table 3. furthermore, the parameters of the fuzzy c-means algorithm are obtained by the colonial competition optimization algorithm with the davis boldin clustering cost function. fig. 3 shows the training curve for the proposed method. as it turns out, the proposed method has the best performance in minimizing the cost function. table 4 shows all the parameters of relationships 2–3 for all methods using the iscx dataset along with a comparison of the methods kumar and kumar [16] and soheily-khah et al. [18] and mbgwo [19] with the proposed method. in table 3: parameters used in aco iteration algorithm (max) iteration(max( population #of antes α k mu 1000 100 15 1 2 0.1 table 4: clustering evaluation indicators for different methods using all 24 attributes in the iscx dataset clustering methods accuracy false alarm rate detection rate proposed method 99.93 0.04 99.55% ica-fuzzy c-means 97 % 0.06 96.2% pso k-means 94.6% 0.06 94.1% fuzzy c-means++ 91.4% 0.08 91% k-means 67.45% 0.12 67% dbscan 68.67% 0.12 68% (kumar, 2013) method 95.2% 0.07 94.5% (soheily-khah, 2018) method 99.91% 0.05 99.51% mbgwo (m. alzubi1 2019) 99.22 0.0064 99.10 fig. 3. training curves for the proposed method and single-particle swarm optimization algorithm. abdulrahman: nid using fc-acor 18 uhd journal of science and technology | jul 2021 | vol 5 | issue 2 calculating the criteria, the total number of performances is nt=10. as shown in table 4, the proposed method in clustering has shown the best performance in accuracy and detection rate indicators. the following results are reviewed for the nsl-kdd dataset. in this dataset, the number of clusters is equal to four types of attacks and normal mode (a total of five clusters) has been considered. for this dataset, all the available features have been considered in the clustering section and no further reduction has been made. table 5 shows all the parameters of relationships 2–3 for all methods using the nsl-kdd dataset with a comparison of the method [17] with the proposed method. as can be seen in this case, the combined algorithm fuzzy c-means and aco for continuous domains give better results for three metrics met with a slight detection of the condition. tables 6 and 7 show the comparison of proposed algorithm to some recent deep network idss. both nsl-kdd and iscx dataset are included in the study. the algorithms are described in the table. in table 6, there are bgru [20], long short-term memory (lstm) [21], and ann [22] accuracy calculated on nslkdd dataset. and two deep algorithms have better performance than proposed algorithm. in table 6, there are deep cnn [23], lstm [24], and predicting future tokens [25] accuracy calculated on nslkdd dataset. and two deep algorithms have better performance than proposed algorithm. experiments on iscx dataset shows that the proposed method does well and is a bit better than the three deep algorithms compared. 6. conclusion in this study, the proposed framework for identifying ddos is discussed. at first, the necessary preprocessing was performed on the data, and then the state diagnosis was performed using a combined fuzzy clustering algorithm and compared to some other methods. the results showed that, the proposed method of this study, has shown better performance in the quantitative criteria considered at most. in comparison to both classic methods and deep methods for intrusion detection, our proposed method is doing well on iscx dataset, but for nsl-kdd dataset deep algorithms shown better performance. as future work we should try to improve our proposed method to discover attacks, or if it will be possible, extend it to a deep method. table 5: clustering evaluation indicators for different methods in the nsl-kdd dataset clustering methods accuracy false alarm rate detection rate proposed method 86.98% 0.14 78% ica-fuzzy c-means 84% 0.18 74.52% pso k-means 81.16% 0.212 70.1% fuzzy c-means++ 74.14% 0.219 69.21% k-means 63.35% 0.31 57% dbscan 65.37% 0.31 59% (kaur, 2017) method 82.1% 0.211 72.57% table 6: clustering evaluation indicators for different methods in the nsl-kdd dataset algorithm accuracy description jiang et al. (2019) 98.94 training multiple long short term memory nets (one hidden layer) for different features extracted xu et al. (2018) 99.24 5-class classification gru and bidirectional gru (bgru) nets. model has one layer with 128 gru nodes, 3 feed-forward layers with 48 nodes bgru gives best results with fast convergence vinayakumar et al. (2019) 78.5 ann(shallow neural network) has five hidden layers with 1024, 768, 512, 256, and 128 nodes. relu activation proposed method 86.98 table 7: clustering evaluation indicators for different methods in the iscx dataset algorithm accuracy description zeng et al. (2019) 99.85 5-class classification deep cnn (convolutional nural network): 2 1d convolutional layers, 1 fully connected layers chilamkurti (2018a) 99.91 binary classification 30 embedding layers, 10 lstm (long short term memory) layers, and sigmoid output layer radford et al. (2018) 97.01 anomaly detection by predicting future tokens (unsupervised) token embedding layer proposed method 99.93 abdulrahman: nid using fc-acor uhd journal of science and technology | jul 2021 | vol 5 | issue 2 19 references [1] m. mazini, b. shirazi and i. mahdavi. “anomaly network-based intrusion detection system using a reliable hybrid artificial bee colony and adaboost algorithms”. journal of king saud universitycomputer and information sciences, vol. 31, no. 4, pp. 541-553, 2019. [2] r. chitrakar and h. chuanhe. “anomaly based intrusion detection using hybrid learning approach of combining k-medoids clustering and naïve bayes classification”. ieee, united states, 2012. [3] a. saifullah. “defending against distributed denial-of-service attacks with weight-fair router throttling”, 2009. available from: https://www.openscholarship.wustl.edu/cse_researchhttps://www. openscholarship.wustl.edu/cse_research/23. [last accessed on 2021 may 10]. [4] i. syarif, a. prügel-bennett and g. wills. “data mining approaches for network intrusion detection: from dimensionality reduction to misuse and anomaly detection”. journal of information technology review, vol. 3, no.2, pp. 70-83, 2012. [5] s. revathi and a. malathi. “data preprocessing for intrusion detection system using swarm intelligence techniques”. international journal of computer applications, vol. 75, no. 6, pp. 22-27, 2013. [6] k. singh and k. singh. “intrusion detection and recovery of manet by using aco algorithm and genetic algorithm”. advances in intelligent systems and computing, vol. 638, pp. 97-109, 2018. [7] j. cheng, c. zhang, x. tang, v. s. sheng, z. dong and j. li. “adaptive ddos attack detection method based on multiple-kernel learning”. security and communication networks, vol. 2018, p. 5198685, 2018. [8] z. xia, s. lu, j. li and j. tang. “enhancing ddos flood attack detection via intelligent fuzzy logic”. informatica, vol. 34, no. 4, pp. 497-507, 2010. available from: http://www.informatica.si/index. php/informatica/article/view/323. [last accessed on 2021 may 11]. [9] m. h. kamarudin, c. maple, t. watson and n. s. safa. “a new unified intrusion anomaly detection in identifying unseen web attacks”. security and communication networks, vol. 2017, p. 2539034, 2017. [10] m. j. vargas-munoz, r. martinez-pelaez, p. velarde-alvarado, e. moreno-garcia, d. l. torres-roman and j. j. ceballos-mejia. “classification of network anomalies in flow level network traffic using bayesian networks”. in: 2018 28th international conference on electronics, communications and computers, conielecomp 2018, vol. 2018, pp. 238-243, 2018. [11] a. koay, a. chen, i. welch and w. k. g. seah. “a new multi classifier system using entropy-based features in ddos attack detection”. in: international conference on information networking, vol. 2018, pp. 162-167, 2018. [12] a. shiravi, h. shiravi, m. tavallaee and a. a. ghorbani. “toward developing a systematic approach to generate benchmark datasets for intrusion detection”. computers and security, vol. 31, no. 3, pp. 357-374, 2012. [13] k. socha and m. dorigo. “ant colony optimization for continuous domains”. european journal of operational research, vol. 185, no. 3, pp. 1155-1173, 2008. [14] j. c. bezdek. “pattern recognition with fuzzy objective function algorithms”. springer, united states, 1981. [15] l. c. andrews. “special functions of mathematics for engineers”, 2021. available from: https://www.books. google.nl/books?id=2caqsf-rebgc and pg=pa110 and redir_ esc=y#v=onepage&q&f=false. [last accessed on 2021 jun 02]. [16] g. kumar and k. kumar. “design of an evolutionary approach for intrusion detection”. the scientific world journal, vol. 2013, p. 962185, 2013. [17] a. kaur, s. k. pal and a. p. singh. “hybridization of k-means and firefly algorithm for intrusion detection system”. international journal of systems assurance engineering and management, vol. 9, no. 4, pp. 901-910, 2018. [18] s. soheily-khah, p. f. marteau and n. bechet. “intrusion detection in network systems through hybrid supervised and unsupervised machine learning process: a case study on the iscx dataset”. in: proceedings-2018 1st international conference on data intelligence and security, icdis 2018, pp. 219-226, 2018. [19] q. m. alzubi, m. anbar, z. n. m. alqattan, m. a. al-betar and r. abdullah. “intrusion detection system based on a modified binary grey wolf optimisation”. neural computing and applications, vol. 32, no. 10, pp. 6125-6137, 2020. [20] c. xu, j. shen, x. du and f. zhang. “an intrusion detection system using a deep neural network with gated recurrent units”. ieee access, vol. 6, pp. 48697-48707, 2018. [21] f. jiang, y. fu, b. b. gupta, f. lou, s. rho, f. meng and z. tian. “deep learning based multi-channel intelligent attack detection for data security”. ieee transactions on sustainable computing, vol. 5, no. 2, pp. 204-212, 2020. [22] y. bengio, a. courville and p. vincent. “representation learning: a review and new perspectives”. ieee transactions on pattern analysis and machine intelligence, vol. 35, no. 8, pp. 1798-1828, 2013. [23] y. zeng, h. gu, w. wei and y. guo. “deep-full-range: a deep learning based network encrypted traffic classification and intrusion detection framework”. ieee access, vol. 7, pp. 4518245190, 2019. [24] a. diro and n. chilamkurti. “leveraging lstm networks for attack detection in fog-to-things communications”. ieee communications magazine, vol. 56, no. 9, pp. 124-130, 2018. [25] b. j. radford, l. m. apolonio, a. j. trias and j. a. simpson. “network traffic anomaly detection using recurrent neural networks”, 2018. available from: http://arxiv.org/abs/1803.10769. [last accessed on 2021 jun 08]. tx_1~abs:at/tx_2:abs~at uhd journal of science and technology | feb 2021 | vol 5 | issue 2 1 1. introduction galaxies come in a diversity of sizes and cover a very wide range of luminosities, extending from the faintest dwarfs to the most luminous giant ellipticals. to know how these galaxies are distributed with respect to their luminosities, the luminosity function (lf) is used. it is one of the most important techniques used for studying galaxy formation and evolution. a suitable approximation to this function was given by paul schechter in 1976 [1]. it can be written as � � l l l l exp l l � � � � � � � � � � � � � � � � � � � � � � * * * * � � (1) where, l* is a characteristic luminosity, indicating the change from power law (l < l*) to exponential law (l > l*), α is the faint-end slope, and ф* is a normalization constant for the distribution. these parameters may take different values for different morphological types and also for different environments. considering an interval dl in luminosity, ф(l) dl gives the number density of galaxies. galaxy clusters are ideal systems for studying the galaxy lf due to the existence of a large number of galaxies at almost the same distance. many studies have thus been devoted to the lf of cluster galaxies to discover the influence of environment on their evolution. after the earlier works on the lf, carried out by hubble (1936) [2], [3], zwicky (1942) [4], oemler (1974) [5], and others, schechter (1976) [1] proposed the analytic expression given by equation (1), which is called the schechter function. he suggested that the cluster lf is universal in shape. this universality has been supported by various studies [6], [7], [8]. however, studies carried out by others [9], [10], [11] have demonstrated that the shape of the cluster lf is not universal. the luminosity function of galaxies in some nearby clusters mariwan a. rasheed1,2 and khalid k. mohammad2 1development center for research and training, university of human development, sulaimani, kurdistan region, iraq, 2department of physics, college of science, university of sulaimani, sulaimani, kurdistan region, iraq a b s t r a c t in the present work, the galaxy luminosity function (lf) has been studied for a sample of seven clusters in the redshift range (0.0 ≲ z ≲ 0.1), within abell radius (1.5 h−1 mpc), in the five sdss passbands ugriz. in each case, the absolute magnitude distribution is found and then fitted with a schechter function. the fitting is done, using the χ2 – minimization method to find the best values of schechter parameters ф* (normalization constant), m* (characteristic absolute magnitude), and α (faint-end slope). no remarkable changes are found in the values of m* and α, for any cluster, in any passband. furthermore, the lf does not seem to vary with such cluster parameters as richness, velocity dispersion, and bautz–morgan morphology. finally, it is found that m* becomes brighter toward redder bands, whereas almost no variation is seen in the value of α with passband, being around (−1.00). index terms: galaxies, clusters, luminosity function, galaxy formation, galaxy evolution corresponding author’s e-mail: khalid k. mohammad, department of physics, college of science, university of sulaimani, sulaimani, kurdistan region, iraq. e-mail: khalid.mohammad@univsul.edu.iq received: 21-04-2021 accepted: 01-07-2021 published: 03-07-2021 access this article online doi: 10.21928/uhdjst.v5n2y2021.pp1-10 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2021 al-janabi, et al. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l re se a rc h a rt i c l e uhd journal of science and technology mariwan a. rasheed and khalid k. mohammad: luminosity function of galaxies 2 uhd journal of science and technology | feb 2021 | vol 5 | issue 2 the lf of cluster galaxies has been compared to that of field galaxies through several studies. some of these studies found them to be identical [12], [13], [14], while others found them to be different [8], [15], [16]. the cluster lf has been found to vary with cluster-centric radius [11], [17]. this is because different galaxy morphological types have different lfs [18] and that the mixture of these morphological types varies with cluster-centric radius, according to the morphology-density relation [19]. in fact, studying the variation of the cluster lf with such characteristics as cluster-centric radius, galaxy morphologies, and, also, galaxy colors is very important in constraining theories of galaxy formation and evolution. in the present work, we study the lf of a sample of seven abell-type galaxy clusters having redshifts in the range (0.0 ≲ z ≲ 0.1). a detailed description of the sample is given in section 2, and the results and discussion are presented in section 3. our conclusions are outlined in section 4. throughout the work, ʌcdm parameters (ω m = 0.27, ωʌ = 0.73, h 0 = 73 km s−1 mpc−1) are used. 2. sample and data in this work, we consider a sample of seven nearby galaxy clusters, selected from abell catalogue [20] within the redshift range (0.0 ≲ z ≲ 0.1). their basic data are given in table 1. all possible member galaxies within abell radius (ra= 1.5 h −1 mpc) of each cluster were taken into account. for membership confirmation, redshift data were obtained from the sloan digital sky survey (sdss-dr9) [21] database (for a1656, a2199, and a2147) and the nasa/ipac extragalactic database (ned) (for a2255 and a2144). for the other two clusters a85 and a2029, redshift data were obtained from agulli et al. (2016) [22] and sohn et al. (2017) [23], respectively. petrosian magnitudes, taken from the sdss database, were used for calculating the absolute magnitudes in the five bands u (3551å), g (4686å), r (6166å), i (7480å), and z (8932å). these magnitudes were then corrected for galactic foreground extinction, using values given by schlafly and finkbeiner (2011) [24], and, also, k-corrected, using a method given by chilingarian et al. 2010 [25] and chilingarian and zolotukhin (2012) [26]. with both of these corrections taken into consideration, the relation between absolute and apparent magnitudes for any passband can be written as: m m d k z a bl= − − − ( ) −5 2510log ( ) /sin( ) (2) where, dl is the luminosity distance, k(z) is the k correction, aλ is the galactic foreground extinction, and b is the galactic latitude. 3. results and discussion it is convenient to write the lf in terms of absolute magnitude, m, rather than luminosity [27]. these two quantities are related through the expression m m l l * *. log �− =     25 (3) hence, the lf becomes [28] � �m m m � � � � � �� � �� �04 10 10 04 1 . ln * . *� �* . * exp m m −   −( )1004 (4) where, m* is the characteristic absolute magnitude corresponding to l*. figures 1-5 show the absolute magnitude distributions of galaxies in the ugriz bands, within ra= 1.5 h −1 mpc, for the whole cluster sample, each fitted with a schechter function. the fitting is done using the χ2 – minimization method, and for each case, we vary the magnitude bins until we get the best χ2 that gives the optimal values of schechter parameters. table 2 summarizes the results of the best-fitting schechter parameters ɸ*, m*, and α, for the whole clusters, in all passbands. since ɸ* is just a normalization constant which defines the overall density of galaxies, we focus our attention only on table 1: the basic data of the cluster sample. cluster equ. j2000.0 redshift velocity dispersion σ (km/s) richness class bautz– morgan type r.a. dec. a1656 125, 948.7 +275,850 0.0231 970 2 ii a2199 162, 838.0 +393,255 0.0302 733 2 i a2147 160, 218.7 +160,112 0.0350 859 1 iii a0085 004, 150.1 −091,809 0.0551 963 1 i a2029 151, 055.0 +054,312 0.0773 1247 2 i a2255 171, 231.0 +640,533 0.0806 998 2 ii-iii a2142 155, 820.6 +271,337 0.0909 1008 2 ii mariwan a. rasheed and khalid k. mohammad: luminosity function of galaxies uhd journal of science and technology | feb 2021 | vol 5 | issue 2 3 0 10 20 30 40 50 60 -2 0. 6 -2 0. 2 -1 9. 8 -1 9. 4 -1 9 -1 8. 6 -1 8. 2 -1 7. 8 -1 7. 4 -1 7 -1 6. 6 -1 6. 2 no . o f g al ax ie s m( u ) 0 5 10 15 20 25 30 35 -2 0. 4 -2 0 -1 9. 6 -1 9. 2 -1 8. 8 -1 8. 4 -1 8 -1 7. 6 -1 7. 2 -1 6. 8 -1 6. 4 no . o f g al ax ie s m( u ) 0 5 10 15 20 25 30 35 40 -2 0. 2 -2 0 -1 9. 8 -1 9. 6 -1 9. 4 -1 9. 2 -1 9 -1 8. 8 -1 8. 6 -1 8. 4 -1 8. 2 -1 8 -1 7. 8 -1 7. 6 -1 7. 4 -1 7. 2 -1 7 -1 6. 8 -1 6. 6 -1 6. 4 no . o f g al ax ie s m( u ) 0 5 10 15 20 25 30 35 -2 0. 6 -2 0. 2 -1 9. 8 -1 9. 4 -1 9 -1 8. 6 -1 8. 2 -1 7. 8 -1 7. 4 -1 7 -1 6. 6 -1 6. 2 -1 5. 8 no . o f g al ax ie s m( u ) 0 5 10 15 20 25 30 35 40 -2 0. 4 -2 0. 2 -2 0 -1 9. 8 -1 9. 6 -1 9. 4 -1 9. 2 -1 9 -1 8. 8 -1 8. 6 -1 8. 4 -1 8. 2 -1 8 -1 7. 8 -1 7. 6 -1 7. 4 -1 7. 2 -1 7 -1 6. 8 -1 6. 6 no . o f g al ax ie s m( u ) 0 5 10 15 20 25 30 35 -2 0. 6 -2 0. 4 -2 0. 2 -2 0 -1 9. 8 -1 9. 6 -1 9. 4 -1 9. 2 -1 9 -1 8. 8 -1 8. 6 -1 8. 4 -1 8. 2 -1 8 -1 7. 8 -1 7. 6 no . o f g al ax ie s m( u ) 0 10 20 30 40 50 60 -2 1. 2 -2 0. 8 -2 0. 4 -2 0 -1 9. 6 -1 9. 2 -1 8. 8 -1 8. 4 -1 8 -1 7. 6 -1 7. 2 -1 6. 8 -1 6. 4 no . o f g al ax ie s m( u ) a1656 (u-band) m*=-18.90, α=-1.12 a2199 (u-band) m*=-18.52, α=-0.86 a2147 (u-band) m*=-18.92, α=-1.15 a85 (u-band) m*=-18.85, α=-1.00 a2029 (u-band) m*=-18.66, α=-0.97 a2255(u-band) m*=-18.85, α=-0.75 a2142 (u-band) m*=-19.46, α=-1.14 fig. 1. the luminosity distributions (histograms) of the cluster sample in the u-band, fitted with schechter functions (solid curves). mariwan a. rasheed and khalid k. mohammad: luminosity function of galaxies 4 uhd journal of science and technology | feb 2021 | vol 5 | issue 2 0 10 20 30 40 50 60 -2 2. 6 -2 2. 2 -2 1. 8 -2 1. 4 -2 1 -2 0. 6 -2 0. 2 -1 9. 8 -1 9. 4 -1 9 -1 8. 6 -1 8. 2 -1 7. 8 -1 7. 4 -1 7 -1 6. 6 no . o f g al ax ie s m( g ) 0 5 10 15 20 25 30 35 -2 1. 8 -2 1. 4 -2 1 -2 0. 6 -2 0. 2 -1 9. 8 -1 9. 4 -1 9 -1 8. 6 -1 8. 2 -1 7. 8 -1 7. 4 no . o f g al ax ie s m( g ) 0 5 10 15 20 25 30 35 40 -2 2. 2 -2 1. 8 -2 1. 4 -2 1 -2 0. 6 -2 0. 2 -1 9. 8 -1 9. 4 -1 9 -1 8. 6 -1 8. 2 -1 7. 8 -1 7. 4 no . o f g al ax ie s m( g ) 0 5 10 15 20 25 30 35 -2 1. 8 -2 1. 4 -2 1 -2 0. 6 -2 0. 2 -1 9. 8 -1 9. 4 -1 9 -1 8. 6 -1 8. 2 -1 7. 8 -1 7. 4 no . o f g al ax ie s m( g ) 0 5 10 15 20 25 30 35 40 -2 1. 8 -2 1. 6 -2 1. 4 -2 1. 2 -2 1 -2 0. 8 -2 0. 6 -2 0. 4 -2 0. 2 -2 0 -1 9. 8 -1 9. 6 -1 9. 4 -1 9. 2 -1 9 -1 8. 8 no . o f g al ax ie s m( g ) 0 5 10 15 20 25 30 35 -2 3. 2 -2 2. 8 -2 2. 4 -2 2 -2 1. 6 -2 1. 2 -2 0. 8 -2 0. 4 -2 0 -1 9. 6 -1 9. 2 no . o f g al ax ie s m( g ) 0 10 20 30 40 50 60 -2 2. 2 -2 1. 8 -2 1. 4 -2 1 -2 0. 6 -2 0. 2 -1 9. 8 -1 9. 4 -1 9 -1 8. 6 -1 8. 2 -1 7. 8 -1 7. 4 no . o f g al ax ie s m( g ) a1656 (g-band) m*=-20.66, α=-1.10 a2199 (g-band) m*=-20.21, α=-1.00 a2147 (g-band) m*=-20.64, α=-1.06 a85 (g-band) m*=-20.54, α=-1.01 a2029 (g-band) m*=-20.16, α=-1.00 a2255(g-band) m*=-21.23, α=-1.10 a2142 (g-band) m*=-20.43, α=-0.87 fig. 2. the luminosity distributions (histograms) of the cluster sample in the g-band, fitted with schechter functions (solid curves). mariwan a. rasheed and khalid k. mohammad: luminosity function of galaxies uhd journal of science and technology | feb 2021 | vol 5 | issue 2 5 0 10 20 30 40 50 60 -2 3. 4 -2 2. 8 -2 2. 2 -2 1. 6 -2 1 -2 0. 4 -1 9. 8 -1 9. 2 -1 8. 6 -1 8 -1 7. 4 no . o f g al ax ie s m( r ) 0 5 10 15 20 25 30 35 -2 2. 6 -2 2. 2 -2 1. 8 -2 1. 4 -2 1 -2 0. 6 -2 0. 2 -1 9. 8 -1 9. 4 -1 9 -1 8. 6 -1 8. 2 -1 7. 8 no . o f g al ax ie s m( r ) 0 5 10 15 20 25 30 35 40 -2 3 -2 2. 6 -2 2. 2 -2 1. 8 -2 1. 4 -2 1 -2 0. 6 -2 0. 2 -1 9. 8 -1 9. 4 -1 9 -1 8. 6 -1 8. 2 no . o f g al ax ie s m( r ) 0 5 10 15 20 25 30 35 -2 2. 6 -2 2. 2 -2 1. 8 -2 1. 4 -2 1 -2 0. 6 -2 0. 2 -1 9. 8 -1 9. 4 -1 9 -1 8. 6 -1 8. 2 -1 7. 8 no . o f g al ax ie s m( r ) 0 5 10 15 20 25 30 35 40 -2 2. 6 -2 2. 4 -2 2. 2 -2 2 -2 1. 8 -2 1. 6 -2 1. 4 -2 1. 2 -2 1 -2 0. 8 -2 0. 6 -2 0. 4 -2 0. 2 -2 0 -1 9. 8 -1 9. 6 no . o f g al ax ie s m( r ) 0 5 10 15 20 25 30 35 -2 3. 8 -2 3. 4 -2 3 -2 2. 6 -2 2. 2 -2 1. 8 -2 1. 4 -2 1 -2 0. 6 -2 0. 2 -1 9. 8 -1 9. 4 no . o f g al ax ie s m( r ) 0 10 20 30 40 50 60 -2 3. 2 -2 2. 8 -2 2. 4 -2 2 -2 1. 6 -2 1. 2 -2 0. 8 -2 0. 4 -2 0 -1 9. 6 -1 9. 2 -1 8. 8 no . o f g al ax ie s m( r ) a1656 (r-band) m*=-21.32, α=-1.08 a2199 (r-band) m*=-21.05, α=-1.03 a2199 (r-band) m*=-21.05, α=-1.03 a85 (r-band) m*=-21.39, α=-1.02 a2029 (r-band) m*=-20.80, α=-0.91 a2255(r-band) m*=-21.67, α=-0.91 a2142 (r-band) m*=-21.03, α=-0.82 fig. 3. the luminosity distributions (histograms) of the cluster sample in the r-band, fitted with schechter functions (solid curves). mariwan a. rasheed and khalid k. mohammad: luminosity function of galaxies 6 uhd journal of science and technology | feb 2021 | vol 5 | issue 2 0 10 20 30 40 50 60 -2 3. 6 -2 3 -2 2. 4 -2 1. 8 -2 1. 2 -2 0. 6 -2 0 -1 9. 4 -1 8. 8 -1 8. 2 -1 7. 6 no . o f g al ax ie s m( i ) 0 5 10 15 20 25 30 35 -2 2. 8 -2 2. 4 -2 2 -2 1. 6 -2 1. 2 -2 0. 8 -2 0. 4 -2 0 -1 9. 6 -1 9. 2 -1 8. 8 -1 8. 4 no . o f g al ax ie s m( i ) 0 5 10 15 20 25 30 35 40 -2 3. 4 -2 3 -2 2. 6 -2 2. 2 -2 1. 8 -2 1. 4 -2 1 -2 0. 6 -2 0. 2 -1 9. 8 -1 9. 4 -1 9 no . o f g al ax ie s m( i ) 0 5 10 15 20 25 30 35 -2 3. 2 -2 2. 8 -2 2. 4 -2 2 -2 1. 6 -2 1. 2 -2 0. 8 -2 0. 4 -2 0 -1 9. 6 -1 9. 2 -1 8. 8 -1 8. 4 no . o f g al ax ie s m( i ) 0 5 10 15 20 25 30 35 40 -2 3 -2 2. 8 -2 2. 6 -2 2. 4 -2 2. 2 -2 2 -2 1. 8 -2 1. 6 -2 1. 4 -2 1. 2 -2 1 -2 0. 8 -2 0. 6 -2 0. 4 -2 0. 2 -2 0 no . o f g al ax ie s m( i ) 0 5 10 15 20 25 30 35 -2 4. 4 -2 4 -2 3. 6 -2 3. 2 -2 2. 8 -2 2. 4 -2 2 -2 1. 6 -2 1. 2 -2 0. 8 -2 0. 4 -2 0 -1 9. 6 no . o f g al ax ie s m( i ) 0 5 10 15 20 25 30 35 40 45 -2 3. 4 -2 3 -2 2. 6 -2 2. 2 -2 1. 8 -2 1. 4 -2 1 -2 0. 6 -2 0. 2 -1 9. 8 -1 9. 4 -1 9 -1 8. 6 no . o f g al ax ie s m( i ) a1656 (i-band) m*=-21.47, α=-1.04 a2199 (i-band) m*=-21.52, α=1.06 a2147 (i-band) m*=-21.27, α=-0.87 a85 (i-band) m*=-21.76, α=-1.02 a2029 (i-band) m*=-21.38, α=-1.11 a2255(i-band) m*=-22.71, α=-1.22 a2142 (i-band) m*=-21.90, α=-1.00 fig. 4. the luminosity distributions (histograms) of the cluster sample in the i-band, fitted with schechter functions (solid curves). mariwan a. rasheed and khalid k. mohammad: luminosity function of galaxies uhd journal of science and technology | feb 2021 | vol 5 | issue 2 7 0 10 20 30 40 50 60 -2 3. 8 -2 3. 2 -2 2. 6 -2 2 -2 1. 4 -2 0. 8 -2 0. 2 -1 9. 6 -1 9 -1 8. 4 -1 7. 8 no . o f g al ax ie s m( z ) 0 5 10 15 20 25 30 35 -2 3 -2 2. 6 -2 2. 2 -2 1. 8 -2 1. 4 -2 1 -2 0. 6 -2 0. 2 -1 9. 8 -1 9. 4 -1 9 -1 8. 6 no . o f g al ax ie s m( z ) 0 5 10 15 20 25 30 35 40 -2 3. 8 -2 3. 4 -2 3 -2 2. 6 -2 2. 2 -2 1. 8 -2 1. 4 -2 1 -2 0. 6 -2 0. 2 -1 9. 8 -1 9. 4 -1 9 -1 8. 6 no . o f g al ax ie s m( z ) 0 5 10 15 20 25 30 35 -2 3. 2 -2 2. 8 -2 2. 4 -2 2 -2 1. 6 -2 1. 2 -2 0. 8 -2 0. 4 -2 0 -1 9. 6 -1 9. 2 -1 8. 8 -1 8. 4 no . o f g al ax ie s m( z ) 0 5 10 15 20 25 30 35 40 -2 4. 2 -2 3. 8 -2 3. 4 -2 3 -2 2. 6 -2 2. 2 -2 1. 8 -2 1. 4 -2 1 -2 0. 6 -2 0. 2 -1 9. 8 -1 9. 4 -1 9 -1 8. 6 no . o f g al ax ie s m( z ) 0 5 10 15 20 25 30 35 -2 4. 2 -2 3. 8 -2 3. 4 -2 3 -2 2. 6 -2 2. 2 -2 1. 8 -2 1. 4 -2 1 -2 0. 6 -2 0. 2 no . o f g al ax ie s m( z ) 0 5 10 15 20 25 30 35 40 45 -2 3. 6 -2 3. 2 -2 2. 8 -2 2. 4 -2 2 -2 1. 6 -2 1. 2 -2 0. 8 -2 0. 4 -2 0 -1 9. 6 -1 9. 2 -1 8. 8 -1 8. 4 no . o f g al ax ie s m( z ) a1656 (z-band) m*=-21.67, α=-1.01 a2199 (z-band) m*=-21.65, α=-1.00 a2147 (z-band) m*=-21.68, α=-0.94 a85 (z-band) m*=-21.85, α=-0.98 a2029 (z-band) m*=-22.25, α=-1.04 a2255(z-band) m*=-22.44, α=-1.10 a2142 (z-band) m*=-22.19, α=-0.95 fig. 5. the luminosity distributions (histograms) of the cluster sample in the z-band, fitted with schechter functions (solid curves). mariwan a. rasheed and khalid k. mohammad: luminosity function of galaxies 8 uhd journal of science and technology | feb 2021 | vol 5 | issue 2 the characteristic absolute magnitude, m*, and the faintend slope α, as the shape of the lf is defined by these two parameters [29]. no remarkable variations are seen in both of these parameters, for all clusters, in each band, within the redshift range considered in this work. furthermore, by noting the basic data listed in table 1, we conclude that, in each band, the lf does not vary with such cluster characteristics as velocity dispersion (in agreement with propis et al. [8]), richness (in agreement with colless [6], propis et al. [8]), and bautz–morgan morphology (in agreement with colless [6], propis et al. [8], lugger [30]). the above results confirm the universality of the cluster lf, in agreement with several other works (for example, colless [6]). for this reason, we can deal with the mean values of the schechter parameters m* and α, for the whole clusters. these values are listed in table 3. it is obvious from table 3 that the characteristic absolute magnitude, m*, becomes brighter towards redder bands, while no remarkable change is noted in the value of the faint-end slope with passband. the reason for this variation of galaxy lf with passband is the contribution of different mechanisms in galaxy evolution. at ultraviolet, for example, the shape of the lf is strongly influenced by star formation since most of the flux is generated by young stars [31]. on the other hand, the lf in the red bands determines the typical stellar distribution [28]. the results in the present work are in good agreement with the previous works [32], [33]. the flat faint-end slope (α~−1) obtained in the present work (table 3) agrees well with the one obtained by blanton et al. (2003) [32]. this flat faint-end slope is a result of the disruption of a large number of dwarf galaxies inside clusters during the first stages of cluster formation [10]. at the bright end of the lf, the exponential decrease of the number density of galaxies is caused by various feedback processes quenching star formation in massive galaxies. the mechanisms proposed for this quenching are either the effect of supernova explosions or an accreting supermassive black hole. in either case, the gas content is heated and then ejected out of the galaxy, quenching star formation process. 4. conclusions the galaxy lfs of some nearby clusters were studied in all of the sdss passbands ugriz. in each case, a schechter function was fitted to the bright end of the distribution, using the χ2 – minimization technique, to obtain the best-fitting schechter parameters, ф*, m*, and α. for each passband, no noticeable variations were observed in the values of m* and α in any cluster. further, it was found that the lf does not change with such cluster parameters as richness, velocity dispersion, and bautz–morgan morphology. from the mean values of m* and α, it was found that m* becomes brighter toward redder bands, whereas no remarkable change was noted in the value of α with passband, being about (−1.00). 5. acknowledgment funding for sdss-iii has been provided by the alfred p. sloan foundation, the participating institutions, the national science foundation, and the u.s. department of energy office of science. the sdss-iii web site is http://www. sdss3.org/. sdss-iii is managed by the astrophysical research consortium for the participating brazilian participation group, brookhaven national laboratory, carnegie mellon university, university of florida, the french participation table 2: the best-fitting schechter parameters for the cluster sample in the ugriz bands. cluster schechter parameters u-band g-band r-band i-band z-band a1656 ɸ* 5.19 4.46 4.69 5.28 5.40 m* −18.90 −20.66 −21.32 −21.47 −21.67 α −1.12 −1.10 −1.08 −1.04 −1.01 a2199 ɸ* 5.03 3.54 3.15 2.76 2.99 m* −18.52 −20.21 −21.05 −21.52 −21.65 α −0.86 −1.00 −1.03 −1.06 −1.00 a2147 ɸ* 3.52 3.44 3.48 5.20 4.21 m* −18.92 −20.64 −21.31 −21.27 −21.68 α −1.15 −1.06 −1.03 −0.87 −0.94 a0085 ɸ* 3.19 2.63 2.42 2.35 2.61 m* −18.85 −20.54 −21.39 −21.76 −21.85 α −1.00 −1.01 −1.02 −1.02 −0.98 a2029 ɸ* 6.04 6.09 6.30 4.79 3.35 m* −18.66 −20.16 −20.80 −21.38 −22.25 α −0.97 −1.00 −0.91 −1.11 −1.04 a2255 ɸ* 7.09 3.81 5.07 2.36 3.52 m* −18.85 −21.23 −21.67 −22.71 −22.44 α −0.75 −1.10 −0.91 −1.22 −1.10 a2142 ɸ* 4.36 7.67 8.37 5.17 5.31 m* −19.46 −20.43 −21.03 −21.90 −22.19 α −1.14 −0.87 −0.82 −1.00 −0.95 table 3: the mean values of the schechter parameters m* and α for the cluster sample in the ugriz bands. band m* α u −18.88±0.11 −1.00±0.06 g −20.55±0.13 −1.02±0.03 r −21.22±0.11 −0.97±0.04 i −21.72±0.18 −1.05±0.04 z −21.96±0.12 −1.00±0.02 mariwan a. rasheed and khalid k. mohammad: luminosity function of galaxies uhd journal of science and technology | feb 2021 | vol 5 | issue 2 9 group, the german participation group, harvard university, the instituto de astrofisica de canarias, the michigan state/ notre dame/jina participation group, johns hopkins university, lawrence berkeley national laboratory, max planck institute for astrophysics, max planck institute for extraterrestrial physics, new mexico state university, new york university, ohio state university, pennsylvania state university, university of portsmouth, princeton university, the spanish participation group, university of tokyo, university of utah, vanderbilt university, university of virginia, university of washington, and yale university. this research has made use of the ned which is operated by the jet propulsion laboratory, california institute of technology, under contact with the national aeronautics and space administration. references [1] p. schechter. “an analytic expression for the luminosity functions for galaxies. astrophysical journal, vol. 203, pp. 297-306, 1976. [2] e. hubble. “the luminosity function of nebulae. i. the luminosity function of resolved nebulae as indicated by their bright stars”. astrophysical journal, vol. 84, p. 158, 1936. [3] e. hubble. “the luminosity function of nebulae. ii. the luminosity function as indicated by residuals in velocity-magnitude relations”. astrophysical journal, vol. 84, p. 270, 1936. [4] f. zwicky. “on the large scale distribution of matter in the universe”. physical review, vol. 61, pp. 489-503, 1942. [5] a. oemler. “the systematic properties of clusters of galaxies. i. photometry of 15 clusters”. astrophysical journal, vol. 194, pp. 1-20, 1974. [6] m. colless. “the dynamics of rich clusters-ii. luminosity functions”. monthly notices of the royal astronomical society, vol. 237, pp. 799-826, 1989. [7] e. j. gaidos. “the galaxy luminosity function from observations of twenty abell clusters”. astrophysical journal, vol. 113, pp. 117-129, 1997. [8] r. de propis, m. colless, s. p. driver, w. couch, j. a. peacock, i. k. baldry, c. m. baugh, j. bland-hawthorn, t. bridges, r. cannon, s. cole, c. collins, n. cross, g. b. dalton, g. efstathiou, r. s. ellis, c. s. frenk, k. glazebrook, e. hawkins, c. jackson, o. lahav, i. lewis, s. lumsden, s. maddox, d. s. madgwick, p. norberg, w. percival, b. peterson, w. sutherland and k. taylor. “the 2df galaxy redshift survey: the luminosity function of cluster galaxies”. monthly notices of the royal astronomical society, vol. 342, pp. 725, 2003. [9] a. dressler. “a comprehensive study of 12 very rich clusters of galaxies. i. photometric technique and analysis of the luminosity function”. astrophysical journal, vol. 223, pp. 765-787, 1978. [10] o. lόpez-cruz, h. k. c. yee, j. p. brown, c. jones and w. forman. “are luminous cd halos formed by the disruption of dwarf galaxies”? apj, vol. 475, p. l97, 1997. [11] p. popesso, a. biviano, h. böhringer and m. romaniello. “rasssdss galaxy cluster survey. iv. a ubiquitous dwarf galaxy population in clusters”. astronomy astrophysics, vol. 445, pp. 2942, 2006. [12] r. de propris, p. r. eisenhardt, s. a. stanford and m. dickinson. “the infrared luminosity function of galaxies in the coma cluster”. astrophysical journal, vol. 503, p. l45, 1998. [13] l. cortese, g. gavazzi, a. boselli, j. iglesias-paramo, j. donas and b. milliard. “the uv luminosity function of nearby clusters of galaxies”. astronomy astrophysics, vol. 410, p. l25, 2003. [14] l. bai, g. h. rieke, m. j. rieke, j. l. hinz, d. m. kelly and m. blaylock. “infrared luminosity function of the coma cluster”. astrophysical journal, vol. 639, pp. 827, 2006. [15] c. a. valotto, m. a. nicotra, h. muriel and d. g. lambas. “the luminosity function of galaxies in clusters”. astrophysical journal, vol. 479, p. 90, 1997. [16] m. yagi, n. kashikawa, m. sekiguchi, m. doi, n. yasuda, k. shimasaku and s. okamura. luminosity functions of 10 nearby clusters of galaxies. ii. analysis of the luminosity function”. astrophysical journal, vol. 123, p. 87, 2002. [17] s. m. hansen, t. a. mckay, r. h. wechsler, j. annis, e. s. sheldon and a. kimball. “measurement of galaxy cluster sizes, radial profiles, and luminosity functions from sdss photometric data”. astrophysical journal, vol. 633, p. 122, 2005. [18] b. binggeli, a. sandage and g. a. tammann. “the luminosity function of galaxies”. annual review of astronomy and astrophysics, vol. 26, pp. 509-560, 1988. [19] a. dressler. “galaxy morphology in rich clusters: implications for the formation and evolution of galaxies”. astrophysical journal, vol. 236, pp. 351-356, 1980. [20] g. o. abell, h. g. corwin and r. p. olowin. “a catalog of rich clusters of galaxies”. astrophysical journal supplement series, vol. 70, p. 1, 1989. [21] c. p. ahn, r. alexandroff, c. a. prieto, s. f. anderson, t. anderton, b. h. andrews, é. aubourg, s. bailey, e. balbinot, and r. barnes. “the ninth data release of the sloan digital sky survey: first spectroscopic data from the sdss-iii baryon oscillation spectroscopic survey”. astrophysical journal supplement series, vol. 203, p. 21, 2012. [22] i. agulli, j. a. l. aguerri, r. sánchez-janssen, c. dalla vecchia, a. diaferio, r. barrena, l. dominguez palmero and h. yu. “deep spectroscopy of nearby galaxy clusters i. spectroscopic luminosity function of abell 85”. monthly notices of the royal astronomical society, vol. 458, p. 1590-1603, 2016. [23] j. sohn, m. j. geller, h. j. zahid, d. g. fabricant and a. diaferio. “the velocity dispersion function of very massive galaxy clusters: abell 2029 and coma”. astrophysical journal supplement series, vol. 229, p. 20, 2017. [24] e. f. schlafly and d. p. finkbeiner. “measuring reddening with sloan digital sky survey stellar spectra and recalibrating sfd”. astrophysical journal, vol. 737, p. 103, 2011. [25] i. v. chilingarian, a. l. malchoir and i. y. zolotukin. “analytical approximations of k-corrections in optical and near-infrared bands”. monthly notices of the royal astronomical society, vol. 405, pp. 1409-1420, 2010. [26] i. chilingarian and i. zoloyukin. “a universal ultraviolet-optical colour-colour-magnitude relation of galaxies”. monthly notices of the royal astronomical society, vol. 419, pp. 1727-1739, 2012. [27] m. s. longair. galaxy formation. springer, germany, 2008. [28] p. schneider. extragalactic astronomy and cosmology: an introduction, springer, germany, 2006. mariwan a. rasheed and khalid k. mohammad: luminosity function of galaxies 10 uhd journal of science and technology | feb 2021 | vol 5 | issue 2 [29] h. karttunen, p. kröger, h. oja, m. poutanen and k. j. donner. fundamental astronomy. springer, germany, 2007. [30] p. m. lugger. luminosity functions for nine abell clusters”. astrophysical journal, vol. 303, pp. 535-555, 1986. [31] r. de propris, m. bremer and s. phillips. “luminosity functions of cluster galaxies. the near-ultraviolet luminosity function at < z > ~ 0.05”. astronomy astrophysics, vol. 1807, p. 10775, 2018. [32] m. r. blanton, j. brinkmann, i. csabai, m. doi, d. eisenstein, m. fukugita, j. e. gunn, d. w. hogg and d. j. schlegel. “estimating fixed-frame galaxy magnitudes in the sloan digital sky survey”. astronomical journal, vol. 125, pp. 2348-2360, 2003. [33] p. popesso, h. böhringer, m. romaniello and w. voges. “rasssdss galaxy cluster survey. ii. a unified picture of the cluster luminosity function”. astronomy astrophysics, vol. 433, pp. 415429, 2005. tx_1~abs:at/tx_2:abs~at uhd journal of science and technology | jan 2022 | vol 6 | issue 1 43 1. introduction it has been described two experiments that examine our technique of selecting eye objects using traditional mouse selection. we have already looked at how people behave when interacting with their eyes in the demos. the next step is to show that our method can withstand tougher use and that people like to select objects using their gaze for an extended period. we rated the overall performance of gaze interaction with that of a widely used and widely used device: the mouse. the interaction of the eye requires different hardware and software. it’s a question of whether it’s worth it. if it works properly, we could also get some secondary benefits that are difficult to quantify through an additional, passive, or mild input channel. for example, we’ve found that when the visual interaction works well, the device senses almost as if it were waiting for user controls. just like you are studying the mind of the user. you want no more guidance input and you have your palms free from various tasks. it slows down the interaction and can “cover costs” in a simple experimental comparison with the mouse, regardless of the immaturity of current eye tracking technology. the eye interaction approach is faster, but we see it as an advantage; however, it is not now the primary motivation for using eye monitoring in most environments. our experiments have measured the time required to perform simple and representative direct manipulative arithmetic tasks. one asked to select a highlighted circle from a circle grid. the second asked the test person to select the named letter on a loudspeaker from a letter grid. our results show a wonderful and measurable pacemaker benefit for searching the mouse in the same experimental setting, persevering in every eye tracking technique for controlling computer game objects tara qadir kaka muhammad1, hawar othman sharif1, mazen ismaeel ghareb2 1department of computer, college of science, university of sulaimani, sulaimani, iraq, 2department of computer science, college of science and technology, university of human development, kurdistan region, iraq a b s t r a c t the study explored the employment of associate in accessible eye tracer with keyboard and mouse input devices for video games. an interactive game has been developed using unity with multiple balls objects and by hitting they could collect more point for each player. it has been used different techniques to hit the balls using mouse, keyboard, and mixed. eye tracker input has been help to increase the performance of collected the player points. the research explains how the eye tacking techniques can be used in widely in video game and it is very interactive. finally, we examine the use of visual observation in relevancy the keyboard and mouse input control and show the difference. our results indicate that the employment of a watch huntsman will increase the immersion of a computer game and considerably improve the video game technology. index terms: computer games, eye tracking, eye gaze interaction, facial expressions as game input, evaluating peripheral interaction corresponding author’s e-mail: tara qadir kaka muhammad, department of computer, college of science, university of sulaimani, sulaimani, iraq. e-mail: tara.qadir@univsul.edu.iq received: 01-09-2021 accepted: 30-03-2022 published: 20-04-2022 access this article online doi: 10.21928/uhdjst.v6n1y2022.pp43-51 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2022 muhammad, et al. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l re se a rc h a rt i c l e uhd journal of science and technology muhammad et al.: game object interactions using eye tracking 44 uhd journal of science and technology | jan 2022 | vol 6 | issue 1 experiment. the key points of the test allow us to understand how our method of visible interaction works and why it is effective. as expected, the method is a little faster than the mouse. our search suggests that the eye can also go faster than the hand. our method verification is how our entire interaction method and algorithm preserve this eye speed advantage on a proper object. we study the physiology of the eye and use these records to extract useful facts about the user’s overall intentions from noisy and fearful eye movement data. even if it does, this algorithm is primarily based on understanding eye movement. it used to be no longer clear that our eye interaction approach would keep pace with the eye, as the eye monitoring hardware entails additional latencies [1]. the overall performance of any interaction science results from its software and hardware program. the previous experiments show that keyboard, mouse and joystick considered as traditional game inputs and there are many new techniques need to be considers [2], [3], [4]. manipulate the voice [5] and monitor the head [6]. for the past few years, while using various entry level techniques, the goal of researchers has been to find out which technique is most accurate, most immersive, and most convenient for users. here, we explain the benefits of using eye movements as a game controller. the eye tracking techniques have been compared to mouse input and have showed the results. eye tracking technology has been shown to increase immersion and make games more fun for the player. as in ivanchenko et al. [7], almansouri [8], it shows that eye tracking as game input is very precise. the comparison of mouse, keyboard, and appearance in jiménez-rodríguez et al. [9] is also used as an entry level solo controller. in contrast to these studies, which deal specifically with the comparison between mouse and eye control in terms of precision and effectiveness, many studies have focused on investigating the game experiment. in article [10], you focused on this immersive evaluation and the user experience. however, their research showed that when comparing the game with mouse data, the players were more immersed in the game, but in paper [11], they achieved a reliable questionnaire evaluation and a high score in terms of stamina. feelings of fluidity and immersion in gaze-controlled play compared to the study by modi and singh [10], the result of which had to be further investigated. the way people treat the computer as human-computer interaction (hci) has user actions on three different levels: physical, cognitive, and emotional; however, the emotional level is a new topic that not only tries to make the interaction experience pleasant but also affects the further use of the machine by the user [12]. however, to better understand the emotional level at hci, that is, the user’s involvement in the machine’s use, an evaluation of the user is required for the use of a peripheral device with test emotions. emotional use of a peripheral device at the same time is difficult. research [13] examined this primary function that requires continuous interaction and secondary work that takes place on the periphery. some research on hci has used emotions as a starting point. in research [14], bernhaupt et al. designed an emotion flower set and used facial emotions as input. they used positive emotions (joy and surprise) to grow flowers and negative emotions (disgust, anger, sadness, and fear) to slow growth. their game was intended for the workplace and they understood that their game improved the player’s emotional state while playing, while the game did not affect people’s general mood, but their work has become a fundamental work for lankes et al. group [15]. his research works alongside the redesign of the emotion flower game; they tried wearing it in a mall and examined the player’s emotional feedback to add more contrast to their basic work. our approach is to combine existing input techniques such as (mouse, keyboard, and eye tracking). add more facial emotions (joy, anger, and surprise) than inputs to a main game used in the proposed game [16]. the user can choose one. the game is using all three types of inputs such as (mouse, keyboard and eye tracker). later, the damaged balloons are collected, and the score is increased. however, the emotion of the face has a peripheral role that helps the user control the speed of the balloons and gets more points in a time [17]. in addition, evaluation and effectiveness are important to us; measure effectiveness by comparing recorded results from different users. however, the assessment is made by comparing the input parameters in two different categories (emotional and unemotional). 2. related work 2.1. review stage since there is extensive research on evaluating emotions and emotions with the help of users, research on emotions is not limited to facial expression, as the ability to become aware of people’s emotions has an impact on social interaction and human behavior [18], moreover, some researchers worked on the body as in ghareb [19] argue that recognition of emotions through non-verbal communication can be achieved by sensing body expression. however, our research on facial emotions touches a small portion of this area. muhammad et al.: game object interactions using eye tracking uhd journal of science and technology | jan 2022 | vol 6 | issue 1 45 ekman in chittaro and sioni [20] defined that facial expression is an example of things one can do through the face, his dialogue focused on the set of facial expressions – happiness, surprise, anger, sadness, worry, and disgust – that are culturally international and on which cultures depend to decide, to show rules. as in emotion recognition and its application in software engineering [21], it examines a number of scenarios to assess the possibility of applying emotional cognition strategies in four areas: software programming, website personalization, school, and games. video games are contingencies that can dynamically respond to the emotions of the diagnosed contemporary gamer. massive investigations worked on a specific reenactment for capturing emotions as discussed in ekman et al. [22] an open-source evg (emotion evoking game) and a first and formative comparison strange end result ordinary variations were determined by comparison and facial expressions of surprise, joy, and disappointment been. there is a lot of research on recovery that has been used in focusing emotions in deferent approaches [18], [23], [24]. in addition to the emotion, the evaluation of peripheral units is a trend phase that has led to extensive discussions, countless findings focused on a unique type of peripheral evaluation. some of the systems in the literature were evaluated on the basis of studies of test subjects [25], [26]. in addition, a learnable subject [27] using unique modalities (tangible, tactile, and hands-free); however, for the peripheral interaction, all peripheral works are ultimately a simple interaction, unless the focus is on controlling the audio participant. collecting the points using user friendly method and easy way for evaluation different technology in the game. besides working on the contrast of consumer emotions at a certain point in the game, our device focuses on evaluating the input devices (keyboard, mouse, and gaze); some research has observed the usefulness of the gaze as an access system [28], [29], [30], [31], [32], [33]. the rating of the normal mouse with momentary was examined in almansouri [8], among the recently introduced (adult, middle aged, and the elderly) the rating of the eye over size for middle aged and the elderly; while jacob [34], the evaluation was equated and the operability, in contrast to the menu resolution selection technique of a developed web browser, was experienced again as an operator together with the component, because it results in the system’s operability. in roose and veinott [3], ivanchenko et al. [7], sibert and jacob [35], murata et al. [36], as a tutorial on how the view can be combined with different input methods. however [9], he worked on using the gaze as a solo input and studied the general performance variations for gaze, mouse, or keyboard for a similar project in the game. today, the most common structure of the eye tracker is the “corneal reflection” unit of the laptop. these structures place the surroundings of the user’s gaze as a display screen coordinate on a monitor. to decide where the consumer is looking. these structures sing to one or each of the eyes they look at with a digital camera equipped with an infrared (ir) filter. the previous game have configured the input of eye using eye cornea and most be configured near the eye using new camera. because the user’s corneal floor is roughly spherical, the corneal reflex area remains constant as the user’s eyes move relative to the head, the role of the student in relation to this reflex creates the position of the wearer’s eye. a calibration sequence is used to map eye movements to display screen coordinates. there are also portable structures that are beneficial for ubiquitous computing scenarios. these structures use the same method as computer systems, but the archive view as a coordinate in a digital camera built into the user’s head [37]. there are several games has been research of have performance issues of using keyboard and mouse as an input. by testing eye input versus mouse input in three extraordinary pc games, smith and graham [40] concluded that using eye monitoring can guarantee the player an additional immersive journey using the eye tracker in first-person shooters. each test participant was asked to play the same sport using three specific input techniques: (1) mouse, keyboard, and eye tracker; (2) mouse and keyboard only; or (3) a console gamepad. the results are not exactly encouraging now, suggesting that the overall performance with the eye tracker was once well below the two different ones. however, isokoski and martin attributed these results to the players’ greater experience and expertise. the concern about the players who using typical input methods and when offering alternating input for playing the game it need further training. other studies came to comparable results. the authors in research [39] created a simple look in which the participant was once asked to remove 25 balloons that were moving across the screen at extraordinary speeds. the participant would move the mouse or the eye tracker over the pointer and remove the balloons with the help of a mouse click. two prerequisites were tested: with and besides the time limit for completing the task. the results confirmed that besides the time limit, the accuracy and time to complete the task were earlier worse when using the eye tracker than when using a mouse [40]. performance was once based solely on the percentage of balls that the player wiped out. michael the paper [43] ended with clearly contrary results. players also mentioned that the eye tracker used to be exceptionally fun to use. these contradicting research consequences suggest that exercising and the approach to enhancing fair recreation are key factors in achieving a continued satisfactory outcome [44], [45]. muhammad et al.: game object interactions using eye tracking 46 uhd journal of science and technology | jan 2022 | vol 6 | issue 1 2.2. designing a proposed game the comparison of peripheral interplay requires at least two tasks: a foremost task, which must be the focal point of the participant’s attention, and a secondary task, which needs to be carried out in the periphery. this secondary project is normally a given: the assignment supported using the peripheral gadget being evaluated. 2.2.1. designing the primary task the tasks are to tell the player to play with keyboard first, then focused on playing with keyboard and mouse if possible, then calculating the results and timing from the players. how many balls have been crashed and how much point has been collected with timing for each techniques. later on, it has been concentrated on eye tracking for each ball movement and how can control the speed of the ball to hit and crash as much balls as possible. most of the players have been collected full points but with more time is needed it. 3. methodology the methodology of this study is to observe the players interaction with the games before and after using eye tracking as an input methods for the game. the players all have experience in gaming which they have training for using this game and how can used eye tracker device. the eye tracking techniques help the players to be more interaction with the game. the observation has been conducted and extracted the data from the player before and after using the eye tracker and illustrates the difference between the results using several statistical measurements. 3.1. game scenario the game is designed and implemented to destroy difference color balloons to collect points. the balloons have been destroyed by three means of inputs, mouse, keyboard, and eye. the main objective of the game is finding the difference efficiency of several inputs such as mouse, keyboard, and eye focused. 3.1.1. hardware requirements the game has developed on pc with these requirements, cpu intel i7-2600, 3.6 ghz, ram 4gb, hard desk 512 gb, and graphic card nvid quard 600. tobii eye tracker has been used in this game. the only devices are capable of tracking both head and eye movements for game interaction, exports training, and streaming. the tobii 4c eye tracker is the hardware device which will track eye movement in the game. it has a driver define it and then define the eye of the player and later will detect the eye movement in the game and program it in game source code. fig. 1 shows the tobii 4c eye tracker. 3.1.2. game design the figures below have been explained the game interface and game rules how player can use the game and what is the interactive with players. final figure explain how the interaction between eye tracker with ball movement to score point for the player. the game has been developed using unity version 18.3, c# ultimate 2012 and using microsoft windows 10 64 bit. fig. 1 shows the game interface and how can user hit the balloons and collects points. fig. 2 explains users using mouse and fig. 3 shows how user can use eye movement to hit the balloons and score point. fig. 2. users select input option for playing the game. fig. 1. tobii 4c eye tracker device. fig. 3. mouse input for users. muhammad et al.: game object interactions using eye tracking uhd journal of science and technology | jan 2022 | vol 6 | issue 1 47 4. data collection collecting data about the player have been conducted selecting 48 users. the player was undergraduate students from computer science department and all have been trained to use the game. each user has been played with mouse, eye, and eye with space for collecting the points. table 1 shows the time collected using mouse, eye, or combination of the two. the results have been showed that 45% eye input performance is better than mouse input and the performance of eye input is 66% which is better performance than combination input. this indicates that the eye input has acceptable values as input for the players. the linear pearson’s correlation has been used for this study. we have been used pearson correlation coefficient (pcc) to measure a liner correlation between eye control dataset and mouse and other two correlations between eye and eye input control. the measure calculates the covariance of two variables and the product of standard deviations. the results are between –1 and 1. the covariance only can reflect the relationships or correlations. pearson correlation coefficient = ρ(x,y) = σ[(xi – x ̄) * (yi –ȳ)]/(σx*σy) [46] table 2 has shown that the pcc [46] between mouse and eye input is 0.44 which indicates that the timing for eye input is related to mouse and has acceptable values as input. table 3 has shown that the pcc [46] between eye and eye space is 0.54 which indicates that the timing for eye input is related to mouse and has acceptable values as input. table 4 shows the regression statistics of difference between eye and mouse tracker and has been generated using spss statistical tool. all the results indicate significant results for user timing compare to mouse input. these statistical results indicate that eye input has slightly better performance than mouse. this means that game industry can use eye interaction techniques beside mouse input. p values show significant values of eye movement regarding effect of the eye input to the user interaction in the game with mouse input also. table 5 shows the regression statistics of difference between eye tracker input and keyboard input; these results have been extracted from spss statistical tool. all the results indicate that eyes tracker input slightly has less performance for user timing compare to mouse input. these statistical results indicate that eye tracker input has slightly slower performance than keyboard. this means that game industry can use eye interaction techniques beside mouse input. descriptive statistics for the three games input for mean, standards error, median, standard deviation, sample variance, and confidence level are explained in table 6. fig. 4 has been explain the using of eye tracking as it shows the ball speed has been changes according of eye focus. the results explain better results for eye tracker in some of statically factors. these results indicate that users can use eye tracking and eye tracking with keyboard combination and make it the game more interactive and better performance. figs. 5-7 have shown the histogram for speed performance of three different method of game input eye, mouse, and eye keyboard. 5. results an evaluation of pearson correlation [46] and regression records used to be carried out on the overall performance measures for every sport to realize any big variations between the two entry modalities. users carried out visually well. however, no sizable overall performance variations had been found for each mouse and keyboard. for pointing tasks, for example, the consumer will frequently appear at the goal and then go the cursor solely when he picks a target. however, with the eye pointer, the cursor strikes every time the consumer strikes their eyes. these effects in a widespread expand in the quantity of remarks the person receives from the game, even if the person does now not consciously raise out an express action. users additionally confirmed a robust choice for the eyepiece tracker throughout playback. we suppose this is because of the decreased quantity of effort it takes to go through the persona when the use of the eyes. to entire the task, customers had to make over one cursor fig. 4. eye input for users. muhammad et al.: game object interactions using eye tracking 48 uhd journal of science and technology | jan 2022 | vol 6 | issue 1 moves throughout the whole screen. when the use of the mouse, the consumer appears at the favored goal and then explicitly acts with the mouse to pass the cursor. however, with the eye pointed, truly searching at the favored spot shifted the cursor, casting off the want for any hand motion. a participant commented that, “i ought to discover with the view freely and only clicked on the mouse in case of need.” people naturally use eye actions to factor when speaking with table 1: player time difference for different inputs no. of players mouse eye combination of two 1 46.29168 50.68896 72.26347 2 49.36595 45.9901 47.98271 3 47.02565 54.16142 74.47321 4 35.69894 96.90728 64.29069 5 36.75071 45.54176 39.36242 6 40.79372 68.80875 63.11514 7 47.87828 43.61938 55.02164 8 62.88597 52.35537 52.68043 9 65.5258 70.47275 56.20034 10 42.11875 44.93463 43.45302 11 63.0501 77.57199 90.01323 12 42.8105 24.0855 41.06178 13 39.97028 72.89788 71.18179 14 64.43954 70.54054 52.99361 15 80.71351 91.87479 83.32175 16 36.42005 23.09589 46.32096 17 47.7062 64.47396 60.83725 18 66.36465 61.36431 44.36486 19 69.42387 72.35069 81.43118 20 49.9991 70.50659 78.45703 21 44.26379 51.86754 60.57484 22 57.42333 24.09982 67.70646 23 48.06614 38.37986 46.19806 24 76.02664 64.09002 102.3053 25 40.44497 40.8985 38.73352 26 81.27607 79.20071 38.13029 27 32.56762 24.599 27.99892 28 60.89792 43.21954 81.91699 29 66.81228 53.31702 69.39323 30 68.82234 31.56778 62.89731 31 80.09913 62.05175 74.31324 32 48.12384 81.68127 84.82584 33 30.01091 58.38366 60.69171 34 56.67857 54.97182 62.73943 35 63.83644 77.20864 98.52775 36 71.37659 63.88522 73.57787 37 32.53166 34.77013 40.04244 38 70.75024 47.13963 51.47002 39 84.64526 61.72007 73.09223 40 47.7101 70.00181 76.68742 41 31.31399 39.43198 34.08674 42 35.38079 36.04421 48.4115 43 47.42 53.64031 40.08095 44 38.87392 46.62032 55.43325 45 53.62331 53.33913 58.65588 46 133.2718 85.60275 62.73277 47 48.19358 67.05249 62.56892 48 64.12367 58.64465 68.15075 table 3: pearson correlation between eye and eye space input eye eye space eye 1 0.54 eye space 0.54 1 table 2: pearson correlation between mouse and eye input mouse input eye input mouse input 1 0.44 eye input 0.44 1 fig. 6. mouse game inputs performance. fig. 5. players eye game input performance. fig. 7. eye and keyboard input performance. muhammad et al.: game object interactions using eye tracking uhd journal of science and technology | jan 2022 | vol 6 | issue 1 49 table 6: descriptive statistics for the three game input mouse, keyboard, and eye tracker mouse keyboard eye tracking mean 55.20413 mean 56.36817 mean 61.26604458 standard error 2.717134 standard error 2.594117 standard error 2.469118725 median 48.77977 median 54.56662 median 61.703085 standard deviation 18.82486 standard deviation 17.97257 standard deviation 17.10655632 sample variance 354.3753 sample variance 323.0133 sample variance 292.6342692 kurtosis 4.833343 kurtosis –0.42544 kurtosis –0.35056645 skewness 1.58376 skewness 0.050186 skewness 0.27816714 range 103.2609 range 73.81139 range 74.30638 minimum 30.01091 minimum 23.09589 minimum 27.99892 maximum 133.2718 maximum 96.90728 maximum 102.3053 sum 2649.798 sum 2705.672 sum 2940.77014 count 48 count 48 count 48 largest (1) 133.2718 largest (1) 96.90728 largest (1) 102.3053 smallest (1) 30.01091 smallest (1) 23.09589 smallest (1) 27.99892 confidence level (95.0%) 5.466169 confidence level (95.0%) 5.21869 confidence level (95.0%) 4.967226088 table 4: regression statistics for eye and mouse inputs regression statistics multiple r 0.442484013 r square 0.195792102 adjusted r square 0.178309321 standard error 17.0641992 observations 48 anova df ss ms f significance f regression 1 3261.042739 3261.043 11.19914 0.001638 residual 46 13394.59714 291.1869 total 47 16655.63987 coefficients standard error t stat p-value lower 95% upper 95% lower 95.0% upper 95.0% intercept 29.07932151 8.185906615 3.552364 0.000895 12.60195 45.5567 12.60195 45.55669665 eye input 0.463467353 0.138492678 3.346512 0.001638 0.184696 0.742239 0.184696 0.742238651 table 5: regression statistics for keyboard and eye tracker input regression statistics multiple r 0.549515 r square 0.301967 adjusted r square 0.286792 standard error 15.17813 observations 48 anova df ss ms f significance f regression 1 4584.343 4584.343 19.89942 5.23e-05 residual 46 10597.28 230.3757 total 47 15181.62 coefficients standard error t stat p-value lower 95% upper 95% lower 95.0% upper 95.0% intercept 20.99721 8.226233 2.55247 0.014081 4.438661 37.55576 4.438661 37.55576 eye space 0.577334 0.129422 4.460877 5.23e-05 0.316822 0.837846 0.316822 0.837846 muhammad et al.: game object interactions using eye tracking 50 uhd journal of science and technology | jan 2022 | vol 6 | issue 1 different people and eye monitoring permits the identical visible cues to be prolonged in the digital world. we consider that the distinction in overall performance between the eye tracer and the mouse throughout the sport is because of the latency that took place when taking pictures on a target. remember, it took about a 2d for a shot to attain the place it was fired. users regarded to have a hard time getting “led” the missiles through searching out into empty area in the front of the cellular target. 6. conclusion this paper has focused on introducing new input for video games. eye movement is one of the important inputs in video games it increases the interactive between the players and makes it more interesting and challenging. the case study has been conducted on 48 players (undergraduate bsc students) to test the game with different input mouse, keyboard, mouse keyboard, and eye movement. the outputs have been shown the significant effect on the playing the game regarding scoring the point using eye input techniques and this adds privilege to using more input and makes the game more interactive. however, the game will slow down but this depends on game scenario. finally, the results show significant correlation between all inputs eye, mouse, and keyboard. acknowledgment we would like to thank all the opportunity and support university of human development for usual support, head of computer science department of university of sulaimani and all students of computer science department that participate in playing the game. references [1] m. r. mine. “virtual environment interaction techniques.” unc chapel hill computer science technical representative, vol. 18, pp. 1-18, 1995. [2] b. yuan, e. folmer and f. c. harris. “game accessibility: a survey.” univers. access information society, vol. 10, no. 1, pp. 81-100, 2011. [3] k. m. roose and e. s. veinott. “understanding game roles and strategy using a mixed methods approach”. in: acm symposium on eye tracking research and applications. association for computing machinery, new york, united states, pp. 1-5, 2021. [4] z. li, p. guo and c. song. “a review of main eye movement tracking methods.” journal of physics: conference series, vol. 1802, no. 4, p. 042066, 2021. [5] c. biele. “eye movement.” in: human movements in humancomputer interaction. springer, cham, pp. 23-37, 2022. [6] a. goettker and k. r. gegenfurtner. “a change in perspective: the interaction of saccadic and pursuit eye movements in oculomotor control and perception.” vision research, vol. 188, pp. 283-296, 2021. [7] d. ivanchenko, k. rifai, z. m. hafed and f. schaeffel. “a lowcost, high-performance video-based binocular eye tracker for psychophysical research.” journal of eye movement research, vol. 14, no. 3, p. 3, 2021. [8] a. s. almansouri. “tracking eye movement using a composite magnet.” ieee transactions on magnetics, vol. 58, no. 4, p. 3152085, 2022. [9] c. jiménez-rodríguez, l. yélamos-capel, p. salvestrini, c. pérez-fernández, f. sánchez-santed and f. nieto-escámez. rehabilitation of visual functions in adult amblyopic patients with a virtual reality videogame: a case series. virtual reality, vol. 2021, pp. 1-12, 2021. [10] n. modi and j. singh. “a review of various state of art eye gaze estimation techniques.” in: advances in computational intelligence and communication technology. springer, germany, 2021, pp. 501-510. [11] l. e. nacke, s. stellmach, d. sasse and c. a. lindley. “gameplay experience in a gaze interaction game”. arxiv, vol. 2010, pp. 49-54. [12] k. saroha, s. sharma, g. bhatia and a. professor. “human computer interaction: an intellectual approach.” international journal of computer science and management studies., vol. 11, no. 2, p. 2, 2011. [13] s. bakker, e. van den hoven and b. eggen. “evaluating peripheral interaction design.” human-computer interact, vol. 30, no. 6, pp. 473-506, 2015. [14] r. bernhaupt, a. boldt, t. mirlacher, d. wilfinger and m. tscheligi. “using emotion in games: emotional flowers.” acm international conference proceedings series, vol. 203, pp. 41-48, 2007. [15] m. lankes, s. riegler, a. weiss, t. mirlacher, m. pirker and m. tscheligi. “facial expressions as game input with different emotional feedback conditions.” in: ace ‘08: proceedings of the 2008 international conference on advances in computer entertainment technology. association for computing machinery, new york, united states, pp. 253-256, 2014. [16] a. covaci, g. ghinea, c. h. lin, s. h. huang, and j. l. shih. “multisensory games-based learning -lessons learnt from olfactory enhancement of a digital board game.” multimed. tools appl., vol. 77, no. 16, pp. 21245-21263, 2018. [17] y. a. sekhavat and p. nomani. “a comparison of active and passive virtual reality exposure scenarios to elicit social anxiety.” international journal of serious games, vol. 4, no. 2, pp. 3-15, 2017. [18] a. m. darwesh, m. i. ghareb. and s. karimi. “towards a serious game for kurdish language learning.” journal of university of human development, vol. 1, no. 3, pp. 376-384, 2015. [19] m. i. ghareb. html5, future to solve cross-platform issue in serious game development. journal of university of human development, vol. 2, no. 4, pp. 443-450, 2016. [20] l. chittaro and r. sioni. “affective computing vs. affective placebo: study of a biofeedback-controlled game for relaxation training.” international journal of human-computer studies, vol. 72, no. 8-9, pp. 663-673, 2014. [21] a. ahmed and m. ghareb. design a mobile learning framework for students in higher education. journal of university of human development, vol. 3, no. 1, p. 288, 2017. muhammad et al.: game object interactions using eye tracking uhd journal of science and technology | jan 2022 | vol 6 | issue 1 51 [22] p. ekman, w. v. friesen and p. ellsworth. emotion in the human face. elsevier, netherlands, 1972. [23] a. kołakowska, a. landowska, m. szwoch, w. szwoch and m. r. wróbel. “emotion recognition and its applications.” advances in intelligent systems and computing, vol. 300, pp. 51-62, 2014. [24] n. wang and s. marsella. “introducing evg: an emotion evoking game.” lecture notes in computer science, vol. 4133, pp. 282-291, 2006. [25] a. landowska and m. r. wrobel. “affective reactions to playing digital games.” in: proceedings-2015 8th international conference on human system interaction. ieee, united states, pp. 264-270, 2015. [26] w. szwoch. “model of emotions for game players.” in: proceedings 2015 8th international conference on human system interaction. ieee, united states, pp. 285-290, 2015. [27] s. bakker, e. van den hoven, b. eggen and k. overbeeke. “exploring peripheral interaction design for primary school teachers,” proceedings 6th international conference dedicated to research in tangible, embedded, vol. 1, no. 212, pp. 245-252, 2012. [28] s. bakker, e. van den hoven and b. eggen. “fireflies: supporting primary school teachers through open-ended interaction design.” in: proceedings 24th australian computer interaction conference. association for computing machinery, new york, united states, pp. 26-29, 2012. [29] d. hausen, h. richter, a. hemme, and a. butz. “comparing input modalities for peripheral interaction: a case study on peripheral music control.” lecture notes in computer science, vol. 8119, no. 3, pp. 162-179, 2013. [30] r. j. k. jacob. “what you look at is what you get: eye movementbased interaction techniques.” proceedings acm, vol. 90, pp. 11-18, 1990. [31] r. j. k. jacob. “no titlethe use of eye movements in humancomputer interaction techniques: what you look at is what you get.” acm transactions on information systems, vol. 9, pp. 152-169, 1991. [32] r. j. k. jacob. “eye movement-based human-computer interaction techniques: toward non-command interfaces”. in: h. r. harst and d. hix. (eds.), advances in human/computer interaction. vol. 4. hindawi, united kingdom, pp. 151-190, 1993. [33] r. j. k. jacob. “what you look at is what you get: using eye movements as computer input.” proc. virtual real. syst., vol. 93, pp. 164-166, 1993. [34] r. j. k. jacob. eye tracking in advanced interface design. in: virtual environments and advanced interface design. vol. 258. oxford university press, inc., oxford, p. 288, 1995. [35] l. e. sibert and r. j. k. jacob. “evaluation of eye gaze interaction.” in: conference on human factors in computing systems proceedings. association for computing machinery, new york, united states, pp. 281-288, 2000. [36] a. murata, t. miyake and m. moriwaka. “effectiveness of the menu selection method for eye-gaze input system.” japanese journal of ergonomics, vol. 47, no. 1, pp. 20-30, 2011. [37] p. isokoski and b. martin. “eye tracker input in first person shooter games.” in: proceedings 2nd conference communication by gaze interact. association for computing machinery, new york, united states, pp. 78-81, 2006. [38] p. isokoski, a. hyrskykari, s. kotkaluoto and b. martin. “gamepad and eye tracker input in fps games: data for the first 50 min.” proceedings 3rd conference communication by gaze interact. cogain, denmark, pp. 1-5, 2007. [39] a. t. duchowski. eye tracking methodology: theory and practice. springer verlag, london, uk, 2003. [40] j. d. smith and t. c. n. graham. “use of eye movements for video game control.” in: proceedings of the 2006 acm sigchi international conference on advances in computer entertainment technology. ace, hollywood, ca, 2006. [41] j. leyba and j. malcolm. “eye tracking as an aiming device in a computer game.” in: course work (cpsc 412/612 eye tracking methodology and applications by a. duchowski). clemson university, clemson, ca, 2004. [42] p. isokoski and b. martin. “eye tracker input in first person shooter games.” in: proceedings of the 2nd conference on communication by gaze interaction: communication by gaze interaction-cogain 2006 gazing into the future. cogain, turin, italy, pp. 78-81, 2006. [43] p. isokoski, m. joos, o. spakov, and b. martin. “gaze controlled games.” universal access in the information society, vol. 8, pp. 323-337, 2009. [44] e. lacorte, g. bellomo, s. nuovo, m. corbo, n. vanacore and p. piscopo. “the use of new mobile and gaming technologies for the assessment and rehabilitation of people with ataxia: a systematic review and meta-analysis.” cerebellum, vol. 20, no. 3, pp. 361-373, 2021. [45] m. dorr, l. pomarjanschi and e. barth. “gaze beats mouse: a case study.” psychnology journal, vol. 7, pp. 197-211, 2009. [46] b. jacob, j. chen, y. huang and i. cohen. “pearson correlation coefficient.” in: noise reduction in speech processing. springer, berlin, heidelberg, pp. 1-4, 2009. tx_1~abs:at/tx_2:abs~at 38 uhd journal of science and technology | jul 2021 | vol 5 | issue 2 1. introduction databases and database systems are an indispensable part of contemporary life; most of us engage in at least one database-related activity each day [1]. simply, everyone can save data and information into a database in order to keep business apparatuses safe and protected. in the case of an emergency, technology has dramatically increased our odds of survival. in reality, technology has enhanced how we live, travel, communicate, study, and be treated medically, as well as how we conduct our lives. technology employed in essential infrastructure that sustains our daily lives is becoming a necessity, and life would be unthinkable without it [2]. today’s databases and whole systems are often subjected to a variety of security risks. many of these risks are prevalent in small businesses, but in large businesses and institutions, vulnerability is critical since they contain sensitive information that is utilized by many individuals and departments [3]. it is concerned with protecting databases against some form of unwanted access or danger at any stage. server protection entails allowing or disallowing user behavior on the database and its properties. the security of their database has been sought by well-functioning organizations. they do not let the unlicensed user admittance their files or documents. they also state that their information is safe from any deceptive or unintended variations. the security priority is on data protection and privacy [4]. a review of database security concepts, risks, and problems ramyar abdulrahman teimoor* department of computer, college of science, university of sulaimani, sulaymaniyah, iraq a b s t r a c t currently, data production is as quick as possible; however, databases are collections of well-organized data that can be accessed, maintained, and updated quickly. database systems are critical to your company because they convey data about sales transactions, product inventories, customer profiles, and marketing activities. to accomplish data manipulation and maintenance activities the database management system considered. databases differ because their conclusions based on countless rules about what an invulnerable database constitutes. as a result, database protection seekers encounter difficulties in terms of a fantastic figure selection to maintain their database security. the main goal of this study is to identify the risk and how we can secure databases, encrypt sensitive data, modify system databases, and update database systems, as well as to evaluate some of the methods to handle these problems in security databases. however, because information plays such an important role in any organization, understanding the security risk and preventing it from occurring in any database system require a high level of knowledge. as a result, through this paper, all necessary information for any organization has been explained; in addition, also a new technological tool that plays an essential role in database security was discussed. index terms: database security, attack, threats, protection, encryption, database vulnerability corresponding author’s e-mail: ramyar abdulrahman teimoor, department of computer, college of science, university of sulaimani, sulaymaniyah, iraq. ramyar.teimoor@univsul.edu.iq received: 17-07-2021 accepted: 30-09-2021 published: 10-10-2021 o r i g i n a l r e s e a r c h a r t i c l e uhd journal of science and technology access this article online doi: 10.21928/uhdjst.v5n2y2021.pp38-46 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2021 teimoor. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) ramyar abdulrahman teimoor: database security concepts, risks, and problems uhd journal of science and technology | jul 2021 | vol 5 | issue 2 39 this study concentrates on the database security risks that the database forensics can mitigate, as it is becoming an increasingly important topic for investigation. the study aims at assisting organization to protect data by introducing the highest nine vulnerabilities found in database. context information, being up-to-date risk information for each threat, protection data and information user database safety gateway protections, and some other method has also been investigated [5]. what is every organization next issue, are data using database protected? nowadays, security is one of the most critical and difficult tasks people encounter. it is difficult to maintain databases. practitioners of database protection do not comprehend the assaults as well as associated to database protection issues. companies are unaware of the sensitive data contained within databases, tables, and columns, as per it experts and database administrator (admin), since either they are managing inherited presentations or taking no records or maintain the data model documentations. if you know the database properties, the databases are more difficult to be secured due to their specific implementation and procedures. we can describe the database protection as the tool for implementing a wide scale controlling data security, protecting databases internally and externally, as well as compromising database privacy, truthfulness, and accessibility, such as technological, managerial, and bodily controls, are used to ensure security [6]. the following is a breakdown of how this study is structured. in section 2 a further overview of related works is presented. section 3 describes type of attack in any system. in section 4, threat and prevention that may be used against any database system has been explained. in section 5 describe some methods for protecting information in the database, as well as several brand of new technologies that have a positive impact on database security, have been introduced. finally, the study’s conclusion has been described in section 6. 2. literature review in this section, significant amount of work presented. it enabled us to check and use the following sources accordingly. 2.1. thilina [7] this review has focused on the utilization of virtual resources in storing data for database users. it also includes information on the strategies used to address database security problems, as well as database attack and privacy risk mitigation measures. organizations may store data and information in databases using an innovative business model that requires no initial investment. it also includes information on database security needs and assets. this review paper covers database security breaching risks, malware activities on such data, and how to address or mitigate those issues, as well as oracle security database implementation. 2.2. sharma [8] this article discusses the security of relational database protection and security frameworks as an example of how internet application security for explicit database authorization may be designed and implemented. because relational database securities are the most popular target for attackers, protection associations and substance are regarded as significant company resources that must be meticulously protected. this research was conducted to identify the problems and risks associated with relational database security, as well as the requirements for relational database set security and how database relations are used at different levels to provide security. 2.3. albalawi [9] they propose an intelligent system for hiding sensitive data when statistical searches are combined. to begin with, the framework is helpful for defining sensitive information in order for the admin to make decisions and establish regulations. second, in the event of rule discrimination based on attribute-orientation, the framework investigates the connection between sensitive and other characteristics, allowing for the selection of attributes that may be used to drive private data. 2.4. juma and makupi [10] in their view, databases are the heart of information systems (is), therefore it is critical to maintain database quality to ensure is quality. recently, determining what constitutes a good database model or architecture has proven to be difficult. as a result, they measured certain characteristics and aspects in a database implementation in their discussion. a measure of evaluation is created using the many elements and qualities inherent in a database. 2.5. odirichukwu and asagba [11] they believe that the number of businesses putting their data online is growing every day, enabling people to engage with and manipulate data all around the world. more information on the internet. as the number of websites on the internet grows, so does the number of database security risks. ramyar abdulrahman teimoor: database security concepts, risks, and problems 40 uhd journal of science and technology | jul 2021 | vol 5 | issue 2 ensuring security in developed applications, owing to a combination of factors a lack of security incentives, a tight timeline, and a web deficit training on application security testing a review of the literature is presented in this article on twenty database security risks that affect web applications. control actions that might be taken to prevent this attacks were investigated to raise awareness about online security the general population, as well as application developers. the work expresses an opinion that developers should make every effort to incorporate all of the required features while developing apps, take security precautions. involvement is also important. all developers should get security testing training. the task at hand despite ensuring sufficient security, the author finds that admin should develop a method of maintaining a continuous backup of their database apps available online. 2.6. paul and aithal [3] this article discusses the fundamentals of databases, such as their meaning, features, and roles, with an emphasis on various database security issues. furthermore, this article emphasizes the fundamentals of security management, as well as relevant technologies. as a result, various aspects of database security have been briefly discussed in this article. 2.7. mousa et al. [12] according to the authors of this study, assaulters would rather attack the database because of the data sensitivity and value. databases compromised in many different ways. the database should be secured against different forms of attacks and threats. most of the assaults identified in this study can be solved. some of the assaults are real and some are not. in this article, they discuss various types of assaults. 2.8. singh and rai [13] they concluded that databases are the foundation of modern applications. for businesses, they are the primary storage option. as a result, database attacks are on the rise, but crucially threatening. they give the intruder (int) access to sensitive information. this study discusses a variety of database assaults. this study also includes a review of relevant database security strategies as well as potential study in the field of database protection. this study will result in a more concrete approach to the database security issue. 2.9. tabrizchi and rafsanjani [14] the goal of this project is to examine the many components of cloud computing as well as the current security and privacy issues that these systems confront. furthermore, this work introduces a new classification system for recent security solutions in this field. this study also addressed outstanding problems and suggested future approaches, as well as introducing different kinds of security risks that are affecting cloud computing services. this article will concentrate on and investigate the security issues that cloud organizations, such as cloud service providers, data owners, and cloud users, confront. 3. type of attack in a database, there are several protection layers. an int will compromise protection at all of these levels, which include the database admin, server admin, security officer, developers, and employees [5]. three types of attackers can be found [15]: a. intruder (int) int is an unwanted user who attempts to obtain useful information from a computer device by manipulating it excessively. b. insider (ins) ins is one of the members of trustworthy users who violates his or her permission and attempts to obtain knowledge outside his or her own allocation. c. administrator (admin) admin is a user with authority to operate a computer system who, in violation of the organization’s security policies, abuses his or her management rights by spying on database management systems (dbms) activities and obtaining sensitive data. when an attacker breaks into the system, the two of the following attacked can be conducted [1]: 3.1.1. direct attacks it refers to targeting the goal data first. these attacks are only possible and effective if the database has no security mechanism in place. if this attack is unsuccessful, the int will move on to the next. 3.1.2. indirect attacks it does not explicitly attack the goal, nonetheless data from or around the goal can be obtained by other in-between items, as the name suggests. many of the variations of various questions are used and try to get through the authentication mechanism. it is difficult to keep track of these threats. in general, database attacks are composed of two types [5] which are: ramyar abdulrahman teimoor: database security concepts, risks, and problems uhd journal of science and technology | jul 2021 | vol 5 | issue 2 41 3.2.1. passive attack in this case, the int just inspects the data in the database and makes no changes. the following are few examples of passive attacks: 1. static leakage: this attack obtains data about database plaintext content by analyzing a database snap taken at a given period. 2. outflow of information: in this case, data about plaintext values can be accessed by connecting database values to the index location of mentioned values. 3. dynamic leakage: modifications made to a database over time may be detected and evaluated, as well as facts about plain text values. 3.2.2. active attack: real database values are changed during an ag gressive attack. these are more dangerous than passive attacks because they can lead to consumer confusion. for instance, a user can incorrectly capture information as a result of a query [5]. there are many methods for carrying out such an attack, which are mentioned below: 1. spoofing – in this attack, a produced value is substituted for the cipher text value. 2. splicing – this involves replacing a cipher text value with a new cipher text value. 3. replay – this is an attack in which the cipher text value is replaced with an older version that has been changed or removed previously. because of the data they carry and their size, databases are the most popular target for cybercriminals [1]. a variety of database security risks and issues are addressed in this article. 4. threat and prevention in this part, we’ll go through nine of the most dangerous threats that may be utilized against databases, as well as how to avoid them form happening. 4.1. first threatexcessive privilege abuse as soon as database access privileges given to users that go beyond what is required by their procedure, those privileges can be exploited for malicious purposes. a university admin with the ability to alter student contact details can also use unnecessary database updating privileges to change grades, which is built-in. since admin cannot identify and replace granularly, databases get admission to privilege management processes for each user, and a user eventually ends up with unnecessary privileges. consequently, user(s) are given ordinary nonpayment access to privileges that go beyond the requirements of certain tasks [16]. prevented by: query-level access control – excessive privilege abuse prevention accessing the question-level for regulation can be the solution of disproportionate rights. a process known as question-degree gets admission to control limits database rights to the bare minimum of sql operations (select, update, and so on) and facts. the granularity of accessible data manipulations should develop outside the table to include the table rows and columns. a granular query-stage mechanism of accessible control could permit the previously mentioned college admin updating the contact records while raising certain alarm if trying to change grades. accessible control of query-stage can be valuable for detecting malicious workers who misuse their privileges but also for detecting unnecessary privilege abuse, in general, as well as for preventing the maximum assaults identified. implementation of the database software applications include a certain level of question-diploma management (triggers, row-stage protection, so on so forth), but the directed design of those “built-in” features make them unreasonable for anything but the built-in integrated deployments, the process of manual determination of the question-level access control policy for all database customers. the rows, columns, and operations take much time, to make matters worse, as user functions change over time, we should update query policies to represent the changes! maximum database admin will struggle to define a useful question policy for a few customers at a single time, let alone a smaller group of users over time. consequently, most organizations provide unlimited rights of access to users with special collection of the paintings for a wider range of users. automated gears are necessary for real-time question-degree access management to become a reality [16]. 4.2. second threat authentic privilege abuse users to perform unauthorized tasks can use valid database privileges. consider a hypothetical villain healthcare worker who has access to patient details through an application as the custom web. internet application architecture usually limits users from accessing the medical history of a single patient. it is not possible to display several facts at the same time; meanwhile, electronic copies are not permitted. the villain worker, on the other hand, can get around those obstacles using a different client, such as ms-excel connecting the database. the worker can also retrieve and buy all patient records using ms-excel and the correct login credentials. ramyar abdulrahman teimoor: database security concepts, risks, and problems 42 uhd journal of science and technology | jul 2021 | vol 5 | issue 2 personal copies of medical record files are unlikely to adhere to any healthcare record security policies of organization’s patient. we have to be aware of two risks. 1) the villain employee swap personal information for make cash. 2) a careless employee retrieves and saves significant quantities of data to their client computer for authentic work purposes. once information stored on an endpoint device, it would expose for trojan virus, pc theft, and other assaults. prevented by: legitimate privilege abuse prevention database access management is the solution for legitimate privilege misuse that applies to queries but also to the context nearby database access. one can probably identify users abusing legitimate database access rights by enforcing a policy of patron packages, place, and time. 4.3. third threat privilege elevation attackers of database platform software to adjust a regular user access privileges to those of an admin can also use vulnerabilities. stored procedures integration, built-in capabilities, implementations of protocols, and square statements may all be vulnerable. for instance, a financial institution software developer may consider taking advantage of a prone feature acquiring database administrative privilege. the villain developer can disable audit mechanisms; build fictitious versions, transfer funds, and more administrative privileges [13]. prevented by: privilege elevation preventive – institution prevention systems (ips) and query level access control (qlac) combination usage of traditional ips and qlac to manipulate, privilege elevate exploitation can be avoided (see excessive privileges above). ips examines database site users for patterns that may lead to identified weaknesses. once a characteristic identified as a prone, for instance, ips most probably blocks the entire access to the prone method or, if likely, blocks the most successful processes with embedded attacks. 4.4. fourth threat platform vulnerabilities unauthorized entry, data corruption, or service denial can result from flaws in underlying working frameworks (windows 2000, unix, and so on so forth) and extra services installed on a database server. for instance, the blaster computer virus exploited a weakness in windows 2000 causing a situation of denial of service (dos) [13]. with the advancement of technology, security has improved as well, and as a result, many vulnerabilities have been solved in later versions of windows or other platforms. prevented by: avoiding assaults, updating the software, and preventing ints protecting database property requires a mixture of protection programs and the ips network security. over-time, provisions of updates by the seller mitigated vulnerabilities discovered in the database platform. unfortunately, businesses provide and enforce software upgrades regularly. databases are not covered during the replacement periods. furthermore, compatibility problems can often prevent software upgrades from happening. ips must be introduced to address these problems. as mentioned before, ips examines database visitors and detects assaults aiming recognized defenselessness. 4.5. fifth threat sql injection in certain cases, sql injection assault, the perpetrator introduction (or “injection”) unauthorized database reports into an inclined sql channel. saved approaches and web utility enter parameters are typical instances of oriented record channels. then these injected reports sent to the database, where they are completed. using sql injections permits attackers may acquire unlimited access to all the database [13]. prevented by: sql injection prevention ips query-level gets proper access to governing (see disproportionate privilege abuse), and event correlation is three strategies that can be mixed to effectively fight rectangular sql injection such as: 1. input validation and parametrized queries 2. avoiding administrative privileges 3. using web application firewall ips can detect sql injection strings or save strategies that are vulnerable to attacks. however, we believe that ips is unreliable because square inoculation threads are disposed to incorrect positivity. those managers of protection who exclusively consider ips application are inundated by viable warnings of sql injection. nevertheless, considering the correlation of the sql inoculation mark along any other breach, including enquiry-stage get-in-to-manipulate violation, it is possible to manipulate the violation, and a real assault can be pinpointed with extreme precision. during normal business operations, sql inoculation mark as well as any infringement does not probably occur similarly in the submission. ramyar abdulrahman teimoor: database security concepts, risks, and problems uhd journal of science and technology | jul 2021 | vol 5 | issue 2 43 4.6. sixth threat weak audit trail the inspiration behind the implementation of any database must involve the automated documentation of all sensitive and/or irregular database transactions. in several ways, a shaky database audit policy poses a serious threat to the company. • regulatory danger • deterrence • detection and recovery • lack of user accountability • performance degradation • separation of duties • limited granularity • proprietary. prevented by: preventing weak audit the majority of the vulnerabilities associated with local audit equipment are resolved by high-quality network-based audit home equipment. • high performance • separation of duties • cross-platform auditing. 4.7. seven threat dos another common form of cyber-attack is the dos in which demonstrative consumers are denied access to network applications or data. common techniques may be used to establish dos conditions, in which many of them can be linked to the mentioned vulnerabilities. dos can be motivated by different factors. ransom scams are often associated with dos attacks linked to computer, in which a remote attacker constantly crashes computers before depositing money into a global bank account by the victim. a bug infection instead maybe blamed for dos. regardless of availability, the seriousness of the threat for most companies maybe posed by dos [17]. prevented by: dos preventive dos preventive necessitates several layers of protection. protections at the network, software, and database levels are all critical. this study focuses on database-specific security. recommendation focuses on link charge manipulations, query access control, ips, and reaction timing controls the database-specific contexts. 4.8. eighth threat weak authentication by stealing or otherwise obtaining login credentials, attackers can predict the identity of legitimate database customers using vulnerable authentication schemes. to acquire credentials, an attacker can use various number of methods available [18]. • cryptanalytic attack • social engineering • direct credential theft. prevented by: preventing authentication attacks 1strong authentication it is crucial to use the most advanced realistic authentication technologies and rules. where possible, two-factor (tokens, certificates, biometrics, etc.) authentication is preferred. unfortunately, cost and ease of use, frequently furnish authentication impracticality. these situations necessitate the implementation of strict username and password policies (least possible duration, gender selection, as well as obscurity). 2directory integration however, incorporation of strong authentication mechanisms with business enterprises catalogs the substructure for scalability and simplicity of use. the directory structure, among others, allows the user to consider using a particular login detail for numerous database and program. however, it increases the usefulness of double-factor authentication system. moreover, it is easier for consumers remembering alternative passphrases on a regular basis. 4.9. ninth threat backup data exposure database backup in certain cases, storage media is completely unregulated. as a result, stealing backup disks and hard drives have been the focus of many high-profile security breaches. prevented by: preventing backup data exposure both backups of databases require encryption application. indeed, certain carriers stated that the potential products of dbms not necessarily support the unencrypted backup’s usage. however, online produce of database statistics encryption advised frequently. nevertheless, key management issues of presentation and cryptographic are frequently impractical, and granular privilege controls described above are in general considered as a weak substitute. 5. method for protecting database system in this section, two types of method have been explained, the first one is to eliminate security risks, any company must have a security policy in place that must be followed. authentication is crucial in security policy since proper authentication reduces the probability of attacks. on different database objects, different users have different access rights. the management of access rights is the responsibility of access control systems. ramyar abdulrahman teimoor: database security concepts, risks, and problems 44 uhd journal of science and technology | jul 2021 | vol 5 | issue 2 to safeguarding database statistic substances, besides the majority dbms supports it is the greatest elementary practices [5]. the control methods concerning database protection depicted (see fig. 1). 5.1. access controller is the most basic service, which any dbms can offer? it has safeguarded data from unauthorized reads and writes. all contact to the database and objects of other systems must adhere to the policies defined by access control. errors may be serious enough to cause issues in a company’s operations. admission rights controlling can aid mitigating dangers, which have a direct effect on the database protecting the main server. the access control is able to prevent the deletion or changing of a table made by accident. the access control can rollback, and prevent the deletion of particular files. access control systems consist of: • file permissions to create, read, edit, or delete files on the server. • program per missions, the rights of executing an application program on the server. • data rights, the rights of retrieving, or updating data in a database. 5.2. inference strategy data protection at a particular level is critical. it is used once the processing of specific data in the form of facts should stop at a maximum level of protection. it helps the determination of how to keep knowledge from being posted. the inference control aims to prevent information from being revealed indirectly. unauthorized data disclosure can ocur in one of three ways: • correlated data a popular channel when visible data x and invisible data y are semantically linked. • missing data null values in the query masks a sensitive data. that way, existed data could be detected. • statistical inference this is common in database, which contain a numerical data in regards to individuals. 5.3. identification or authentication of the user it is better to know your users as a basic security requirement. after you’ve classified people, you’ll need to determine what privileges and access permissions they have, as well as verifying their data that must use. until a user is allowed to construct a database, they should be authenticated in several ways. user identification and authentication are a part of database authentication, the os or network service can perform an external authentication process. to establish user authentication secure sockets layer (ssl), business parts, and middle-tier server authentication, also known as proxy authentication, can all be used. it is the most basic prerequisite for ensuring protection when considering the identification process which identifies a collection of people who are permitted to access the data. to ensure confidentiality, the authentication of identity initiates preventing unauthorized users from modifying sensitive data. attackers make use of various methods such as bypass authentication, default password, privilege escalation, bruteforce password guessing, and rainbow attack when attempting to breach a user identity and authentication [16]. 5.4. audit and accountability database/non-database users audit and monitor a configured database behavior. accountability refers to the method of keeping track of user activities on a device. to ensure the physical integrity of the data, auditing checks and accountability are required which necessitates a specific database access carried out with auditing and maintaining the resiliency of the data. if users’ authentication accesses a resource successfully, the system will track all successful and unsuccessful attempts, and attempted accesses and their statuses will show in the audit trail files [16]. 5.5. encryption it is a method of translating information into cipher or code that only those who have access to the cipher text key can make it ready. encrypted data is the referral to cipher or encoded text. in a database, there are two states for data security. data is in two statuses: at rest and in motion – data stored in a database, on a backup disk, or a hard drive. once transiting through the network, it necessitates the use of various encryption solutions. any of the problems of data access control inference strategy user identification / authentication accountability and auditing encryption fig. 1. control methods for protecting database system. ramyar abdulrahman teimoor: database security concepts, risks, and problems uhd journal of science and technology | jul 2021 | vol 5 | issue 2 45 at rest can be solved by encrypting it. utilize solutions such as ssl/transport layer security for data in transit [16]. in the second method any organization may make advantage of a using new technology tool that has a significant effect on database security such as: 1. database firewalls: are a kind of web application firewall that monitor databases to detect and defend against database-specific attacks, which are usually aimed at gaining access to sensitive data contained in the databases. database firewalls also allow you to monitor and audit every database access via the logs they keep. specific compliance reports for laws like as pci, sox, and others may be generated by a database firewall [19]. herse some tool: • cloudflare • site lock • tufin secure track. • manageengine firewall analyzer. • firemon. • algosec. 2. real time data monitoring (rtdm): an admin may examine, analyze, and change the addition, deletion, modification, and usage of data on software, a database, or a system using rtdm. through graphical charts and bars on a single interface/dashboard, data managers may examine the general operations and functions done on the data in real time, or as they happen [20]. herse some tool: • real-time database profiler tool • firebase console • cloud monitoring 3. multi-factor database authentication: is a technique and technology for confirming a user’s identification that requires two or more credential category kinds for the user to log into a system or complete a transaction. this technique requires the effective answering of at least two separate credentials such as: entering password, email verification, phone verification, or answering security question [21]. herse some tool: • lastpass • duo security • ping identity • rsa securid access 6. conclusion the database security problems and research into various issues affecting the industry have frequently been listed in this survey. organizations are now dependent on documents to make decisions about different business processes that will improve their bottom line. as a result, it is a smart idea to keep confidential details secure from prying eyes. server security research papers have attempted to investigate the issues of potential assaults to database systems such as loss of confidentiality and honesty. because of the knowledge and volume contained in databases, they are the most common and simple targets for attackers. there are many options to accommodate a database. today, there are several forms of attacks and threats against which a database should be secured. this paper discusses the decisions that must be made in order to protect personal data from attackers. it also goes into depth about how a loss of privacy can lead to extortion and humiliation in the workplace. this survey also looked at strategies for dealing with any form of hazards. views and authentication should be used in this case. another method is to use an encryption strategy, which means the information is secured so that if an int finds it, he or she is unable to use it, and the criteria for a reliable dbms were also discussed. references [1] m. malik and t. patel. “database security attacks and control methods”. international journal of information technology, vol. 6, no. 1/2, pp. 175-183, 2016. [2] i. ghafir, j. saleem, m. hammoudeh, h. faour, v. prenosil, s. jaf, s. jabbar and t. baker. “security threats to critical infrastructure: the human factor”. the journal of supercomputing, vol. 74, no. 10, pp. 4986-5002, 2018. [3] p. k. paul and p. s. aithal. “database security: an overview and analysis of current trend”. international journal in management and social science, vol. 4, no. 2, pp. 53-58, 2019. [4] s. b. sadkhan. “related papers”. over rim, pp. 191-199, 2017. [5] h. kothari, a. suwalka and s. kumar. “various database attacks, approaches and countermeasures to database security”. international journal of advance research in computer science and management studies, vol. 5, no. 5, pp. 357-362, 2019. [6] j. c. ogbonna, f. o. nwokoma and a. ejem. “database security issues: a review”. international journal of engineering inventions, vol. 6, no. 8, pp. 1812-1816, 2017. [7] t. dharmakeerthi. “a study on security concerns and resolutions”. researchgate. net, no. may, 2020. [8] e. technology and v. sharma. “an analytical disparity of harbor tools erection for database system”. international research journal of modernization in engineering technology and science, vol. 3, no. 2, pp. 501-510, 2021. [9] u. albalawi. “countermeasure of statistical inference in database security”. proceeding 2018 ieee international conference big data, big data 2018, pp. 2044-2047, 2019. [10] j. juma and d. makupi. “understanding database security metrics: a review”. vol. 1. mara international journal of social sciences research publications, pp. 40-47, 2017. ramyar abdulrahman teimoor: database security concepts, risks, and problems 46 uhd journal of science and technology | jul 2021 | vol 5 | issue 2 [11] j. c. odirichukwu and p. o. asagba. “security concept in web database development and administration a review perspective. 2017 ieee 3rd international conference electro-technology national development nigercon 2017, vol. 2018-janua, pp. 383-391, 2018. [12] a. mousa, m. karabatak and t. mustafa. “database security threats and challenges”. 8th international symposium digital forensics secur. isdfs 2020, vol. 3, no. 5, pp. 810-813, 2020. [13] s. singh and r. k. rai. “a review report on security threats on database”. international journal of computer science and information technologies, vol. 5, no. 3, pp. 3215-3219, 2014. [14] h. tabrizchi and m. k. rafsanjani. “a survey on security challenges in cloud computing: issues, threats, and solutions”. vol. 76. springer, united states, 2020. [15] p. sharma. “database security: attacks and techniques”. international journal of scientific and engineering research, vol. 7, no. 12, pp. 313-319, 2016. [16] s. s. sarmah. “database security threats and prevention”. international journal of computer trends and technology, vol. 67, no. 5, pp. 46-53, 2019. [17] t. mahjabin, y. xiao, g. sun and w. jiang. “a survey of distributed denial-of-service attack, prevention, and mitigation techniques”. international journal of distributed sensor networks, vol. 13, no. 12, 2017. [18] h. b. hashim. “challenges and security vulnerabilities to impact on database systems”. al-mustansiriyah journal of science, vol. 29, no. 2, p. 117, 2018. [19] w. lee. “lecture notes in electrical engineering 461 proceedings of the 7th international conference on emerging databases”, 2019. [20] i. kotsiuba, m. nesterov, y. yanovich, i. skarga-bandurova, t. biloborodova and v. zhygulin. “multi-database monitoring tool for the e-health services. proceeding 2018 ieee international conference big data, big data 2018, pp. 2442-2448, 2019. [21] c. hamilton and a. olmstead. “database multi-factor authentication via pluggable authentication modules”. 2017 12th international conference internet internet technology and secured transactions icitst 2017, pp. 367-368, 2018. tx_1~abs:at/tx_2:abs~at 76 uhd journal of science and technology | jan 2022 | vol 6 | issue 1 1. introduction bread is the most consumed and staple food in many countries around the world. it is made from dough of flour such as wheat and barley, and water. it usually contains several ingredients such as table salt, sugars, flavors, and flour improver [1]. potassium bromate (kbro 3 ) was commonly used due to its low cost and acts as a slow oxidizing agent and it makes the dough more strength, and more elastic [2]. many studies have confirmed the deleterious effects of potassium bromate on human health. for example, according to a study done on mice, potassium bromate administration caused impairment in renal and hepatic tissues. it also increased plasma creatinine levels and decreased antioxidant capacity [3]. another study found that kbro 3 exposed mice had increased lipid peroxidation, protein oxidation, and numerous degenerative changes in the cerebellum tissues [4]. in addition, important vitamins in bread such as thiamine (b1) and niacin (b3) were destroyed by the effects of kbro 3 . carcinogenic and mutagenic effects of kbro 3 were also confirmed in experimental animals [5]. the center for science and environment (cse) [6] indicated that some determination of potassium bromate in bread brands in sulaimani city, kurdistan-iraq sardar m. weli1, sabiha m. salih2, abdullah a. hama2,3*, ary b. faiq2, fatimah m. ali4 1department of nursing and research center, college of health and medical technology, sulaimani polytechnic university, kurdistan region, iraq, 2department of medical laboratory and research center, college of health and medical technology, sulaimani polytechnic university, kurdistan region, iraq, 3department of medical laboratory science, college of health science, university of human development, kurdistan region, iraq, 4department of nursing and research center, sulaimani technical institute, sulaimani polytechnic university, kurdistan region, iraq a b s t r a c t bread is the most consumed and staple food in many countries worldwide. it is made from dough of flour such as wheat and barley, and water. it usually contains flour improver potassium bromate (kbro 3 ) which is used by bakers. however, many studies have confirmed the deleterious effects of kbro 3 on human health. therefore, this study aimed to determine the rate of kbro 3 in five main types of bread in sulaimani city, kurdistan-iraq. the duration of the study was from august 2021 to november 2021. thirty bread samples were collected from five main products that are extremely consumed by kurdish citizens. the bread-type products were bakery bread (nani frn), white hamburger bread (samun), white bread known as kurdish bread (nani hawrami), pizza, and brown barley bread. single beam uv–visible spectrophotometer apel-303 was used for the quantification of kbro 3 in bread samples. the results found that all 30 samples were had kbro 3 residues in their products with different concentrations. samples of brown barley bread were having the least content of kbro 3 while samples from pizza dough were having the highest concentration of kbro 3 . the present study concludes that all bread samples from five major bread types had potassium bromate above the permitted levels allowed by the united states food and drug agency (fda). index terms: potassium bromate, white bread, barley bread, flour improver, spectrophotometer corresponding author’s e-mail: abdullah a. hama, department of medical laboratory and research center, college of health and medical technology, sulaimani polytechnic university, kurdistan region, iraq, and department of medical laboratory science, college of health science, university of human development, kurdistan region, iraq. e-mail: abdullah.hama@spu.edu.iq received: 04-03-2022 accepted: 07-04-2022 published: 19-05-2022 access this article online doi: 10.21928/uhdjst.v1n1y2017.pp76-79 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2022 weli et al. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l re se a rc h a rt i c l e uhd journal of science and technology weli et al.: determination of potassium bromate in bread uhd journal of science and technology | jan 2022 | vol 6 | issue 1 77 studies found the link of bromate to cancer, so the global scientific expert committees and cse suggested reducing the allowed limit of use; also, they recommended that kbro 3 should not be used as a flour treatment agent. many studies have proofed that potassium may be cause detrimental health effects in humans [7,8], also in the same area (metropolis rivers stat, nigeria), the concentration of kbro 3 in all samples was above the allowed concentration and the authors advised the consumer of read from the study area the bread conduction may be of harmful for our health [9]. due to the harmful effects of this substance, many countries, including france, the united kingdom, and canada, have removed kbro 3 from the list of acceptable additive substances to flour [10]. however, the maximal permitted dose of kbro 3 in bread in other countries such as japan, china, and the usa is 10 mg/kg, 50 mg/kg of flour mass, and 0.02 mg/kg, respectively [11], the most studies indicate that potassium bromate in bread concentration was exceeded the acceptable limit of 0.02 µg/g set by fda, in delta state, all 15 bread brand samples was contained a higher concentration of kbro 3 than permitted range [12] and the authors sate that this can be very dangerous for the bread consumers in the study area, and in the study in erbil, the level of kbro 3 was found to be more (6.66 mg/l–67.45 mg/l) than the permissible limit set by fda [13]. this study aimed to determine the level of kbro 3 in different types of bread in sulaimani city, kurdistan-iraq. 2. materials and methods 2.1. collection of samples bread samples were collected during the day (morning and afternoon) from different bakeries and from different locations in sulaimani city from august 2021 to november 2021. the locations were the city center, ibrahim-pasha, ibrahim-ahmad, kani-ba, sarchnar, tui-malik, and familymall. thirty bread samples were collected from five main products that are extremely consumed by kurdish citizens. the bread-type products were bakery bread (nani frn), white hamburger bread (samun), white bread known as kurdish bread (nani hawrami), pizza, and brown barley bread. 2.2. preparation of samples samples were prepared according to a procedure that has been described and used by abdulla and hassan [14]. a small part (about 2 cm) in the center of each bread sample was dried in the oven for 72 h at 55°c. after drying, the sample was ground with an electric grinder to a powder. a 2.5 g of the powder were dissolved in 25 ml of distilling water. after centrifuging, the liquid fraction was diluted to 50 ml. 2.3. standard preparations a stock solution of 200 ppm potassium bromate (kbro 3 ) was prepared by dissolving 0.200 grams of kbro 3 into 1 l of distilling water. the standard series solutions of potassium bromate were prepared from the stock solution at 0, 4, 12, 20, and 40 ppm. 2.4. method a 5 ml of standard or sample solution was mixed with 5 ml of 1% ki, 10 ml of 0.1 n hcl, and then completed to 100 ml. the standards and samples were read after 10 min by a single beam uv–visible spectrophotometer apel303 at wavelength 420 nm with a calibration curve used for quantification of the samples. 2.5. data analysis data were entered into a statistical package for the social sciences “spss” version 26 for the storage and statistical analysis. the one-way anova test was applied to test for association between different groups, with p = 0.05 or less considered significant. 3. results the results of this study found that all 30 samples from five main types of bread contained different amounts of kbro 3 residues in their products. sample number 26 (brown barley bread) had the least content of kbro 3 while sample number 21 (pizza dough) had the highest concentration of kbro 3 (table 1). the calibration curve of this study was shown (fig. 1). in addition, this study has found that the concentrations of kbro 3 were highest in the pizza group and lowest in the brown barley bread group. the means and standard error of all groups with significant differences between each group of bread types are shown in table 2 and fig. 2. there were differences in the means of all groups. however, there were no significant differences between kurdish bread, white bakery bread, and brown barley bread. on the other hand, there was a significant difference between pizza flour, brown barley bread, and kurdish bread. 4. discussion this study was carried out to determine the level of potassium bromate (kbro 3 ) in the bread samples and to find the weli et al.: determination of potassium bromate in bread 78 uhd journal of science and technology | jan 2022 | vol 6 | issue 1 highest and lowest concentrations of kbro 3 in different types of bread. thirty samples from five major consumed types of bread were analyzed and kbro 3 was found in all samples. according to the us food and drug agency (fda), the amount of kbro 3 in bread higher than 0.02 µg/g (0.00002 parts per million) is considered not safe for human consumption [15]. all 30 samples of the present study were having concentrations of kbro 3 higher than the national permitted levels so that none of the bread of all major types in sulaimani city might be unsafe for human consumption. this is in agreement with a study done in hawler city, kurdistan region of iraq; they found that the residual bromate level in the analyzed bread samples by spectrophotometer was in the range from 6.66 mg/l to 67.45 mg/l [13]. in addition, a study in iraq (baghdad city) found that electrical samun and loaf had 10 and 0.3 µg/g potassium bromate, respectively. these levels were higher than the permissible level by the us food and drug agency (fda). they also found the exposed bread industry workers were elevated in chromosomal aberrations (ca), represented by chromatic breaks (cb), micronuclei (mn), and ring chromosome (rc) [16]. another study in iraq (basrah city) found the harmful effects of potassium bromate on both hematological and biochemical parameters in rats. liver enzymes (a.l.t and a.s.t) were increased and blood table 1: concentrations of potassium bromate (ppm) in all bread samples samples type of breads quantity of kbro3 (ppm) 1 kurdish bread (nani hawrami) 9.747 2 kurdish bread (nani hawrami) 9.747 3 kurdish bread (nani hawrami) 7.58 4 kurdish bread (nani hawrami) 6.137 5 kurdish bread (nani hawrami) 4.693 6 kurdish bread (nani hawrami) 5.415 7 white hamburger bread (samun) 14.801 8 white hamburger bread (samun) 11.913 9 white hamburger bread (samun) 21.227 10 white hamburger bread (samun) 10.65 11 white hamburger bread (samun) 16.968 12 white hamburger bread (samun) 21.227 13 white bakery bread (nani frn) 7.581 14 white bakery bread (nani frn) 13.791 15 white bakery bread (nani frn) 12.635 16 white bakery bread (nani frn) 10.65 17 white bakery bread (nani frn) 16.968 18 white bakery bread (nani frn) 5.415 19 pizza flour 11.913 20 pizza flour 9.747 21 pizza flour 29.783 22 pizza flour 26.534 23 pizza flour 27.076 24 pizza flour 21.227 25 brown barley bread 7.581 26 brown barley bread 4.693 27 brown barley bread 5.415 28 brown barley bread 6.859 29 brown barley bread 9.747 30 brown barley bread 6.137 table 2: concentrations of kbro3 (ppm) in all five groups of bread samples group number type of bread concentrations of kbro3 means±se 1 kurdish bread (nani hawrami) 7.22±2.18 a 2 white hamburger bread (samun) 16.13±4.52 be 3 white bakery bread (nani fern) 11.17±4.22 ab 4 pizza flour 21.05±8.41 cb 5 brown barley bread (nani jo) 6.74±1.79 ad values are presented as means±se (n=6 sample/group). different capital letters denote significant differences between groups (p<0.05). fig. 1. calibration curve. fig. 2. concentrations of kbro 3 (ppm) in all five groups of bread samples. values are presented as means ± se (n=6 sample/group). weli et al.: determination of potassium bromate in bread uhd journal of science and technology | jan 2022 | vol 6 | issue 1 79 parameters (rbc, hb, wbc, and pcv) were decreased [17]. a recent study which is done in dhaka city in bangladesh showed that 67% of collecting samples were had kbro 3 above the permitted level [18]. the present study also found that there were different concentrations of residues of kbro 3 in different types of bread. the concentrations of kbro 3 were highest in the pizza group and lowest in the brown barley bread group. this agrees with a study done in tunis country. they observed different concentrations of bromate residues in different types of bread. the muffin contained the highest mean concentration of bromate residue as opposed to bread without salt, which had the lowest mean bromate level [6]. moreover, a study in nigeria found that 25% of the bread samples were had potassium bromate above the permissible limit allowed by the us food and drug agency (fda) and explained that these samples are unsafe for human consumption [19]. 5. conclusion the present study concludes that all bread samples from five major bread types had potassium bromate above the permitted levels allowed by the us food and drug agency (fda). in general, all samples are unsafe for human consumption; however, the riskiest samples that have a greater concentration of potassium bromate were pizza flour and white bakery bread. the kurdish bread and brown barley bread have a lower concentration of potassium bromate compared to others. references [1] m. o. emeje, s. i. ofoefule, a. c. nnaji, a. u. ofoefule and s. a. brown. “assessment of bread safety in nigeria: quantitative determination of potassium bromate and lead”. african journal of food science, vol. 4, no. 6, pp. 394-397, 2010. [2] a. abu-obaid, s. abu-hasan and b. shraydeh. “determination and degradation of potassium bromate content in dough and bread samples due to the presence of metals”. american journal of analytical chemistry, vol. 7, pp. 487-493, 2016. [3] n. g. altoom, j. ajarem, a. a. allam, s. n. maodaa and m. a. abdel-maksoud, “deleterious effects of potassium bromate administration on renal and hepatic tissues of swiss mice”. saudi journal of biological sciences, vol. 25, no. 2, pp. 278-284, 2018. [4] h. b. saad, d. driss, i. jaballi, h. ghozzi, o. boudawara, m. droguet, c. magne, m. nasri, k. m. zeghal, a. hakim and i. b.amara. “potassium bromate-induced changes in the adult mouse cerebellum are ameliorated by vanillin”. biomed environ science, vol. 32, no. 2, pp. 115-125, 2018. [5] l. a. alli, m. m. nwegbu, b. i. inyang, k. c. nwachukwu, j. o. ogedengbe, o. onaadepo, m. a. jamda, g. a. akintan, s. o. ibrahim and e. a. onifade. “determination of potassium bromate content in selected bread samples in gwagwalada, abuja-nigeria”. international journal of health and nutrition, vol. 4, no. 1, pp. 1520, 2013. [6] a. tewari and a. khurana. “potassium bromate/iodate in bread and bakery products”. centre for science and environment, pp. 1-12, 2016. available from: https://www.cseindia.org. [last accessed on 2022 may 06]. [7] a. m. magomya, g. g. yebpella, u. c. okpaegbe and p. c. nwunujui. “analysis of potassium bromate in bread and flour samples sold in jalingo metropolis, northern nigeria. journal of environmental science, vol. 14, no. 2, pp. 1-5, 2020. [8] n. a. ugochukwu, o. elechi and e. a. ozioma. “determination of bromate content of selected bread brands consumed within port harcourt and its environs”. chemistry research journal, vol. 4, no. 3, pp. 86-91, 2019. [9] a. u. naze, o. a. epete and e.owhoeke. “bromate content in thirty different brands of bread baked in port harcourt metropolis rivers state, nigeria”. journal of applied sciences and environmental management, vol. 22, no. 8, p. 1321, 2018. [10] m. el ati-hellal, r. doggui, y. krifa and j. el ati. “potassium bromate as a food additive: a case study of tunisian breads”. environmental science and pollution research, vol. 25, pp. 27022706, 2018. [11] j. el harti, y. rahali, m. ansar, h. benziane, j. lamsaouri, m. o. b. idrissi, m. draoui, a. zahidi and j. taoufik. “a simple and rapid method for spectrophotometric determination of bromate in bread”. journal of materials and environmental science, vol. 2, no. 1, pp. 71-76, 2011. [12] a. uwague and o. c. oghenekohwoyan. “investigation into the health danger of potassium bromate in bread consumed in sapele town, delta state”. international journal of modern engineering research, vol. 7, no. 9, pp. 1-3, 2017. [13] s. a. narmin and a. h media. “spectrophotometric determination of bromate in bread by the oxidation of dyes”. kirkuk university journal-scientific studies,vol. 4, no. 1, pp 31-39, 2009. [14] n. s. abdulla and m. a. hassan. “spectrophotometric determination of bromate in bread by the oxidation of dyes”. journal of kirkuk university-scientific studies, vol. 4, no. 1, pp. 31-39, 2009. [15] a. s. ekop, i. b. obot and e. n. ikpatt. “anti-nutritional factors and potassium bromate content in bread and flour samples in uyo metropolis, nigeria”. e-journal of chemistry, vol. 5, no. 4, pp. 736741, 2008. [16] a. haleem. “cytogenetic effects of potassium bromate kbro3 associated with iraqi baking industry cytogenetic effects of potassium bromate kbro3 associated with iraqi baking industry”. indian journal of applied research, vol. 4, no. 6, pp. 10-12, 2015. [17] s. a. zainab and r. f. ghadhban. “effect of potassium bromate on some hematological and biochemical parameters and protective role of vitamin c on laboratory rats (rattus rattus). annals of the romanian society for cell biology, vol. 25, no. 2, pp. 669-674. [18] s. s. mahmud, m. moni, a. b. imran and t. foyez. “analysis of the suspected cancer causing potassium bromate additive in bread samples available on the market in and around dhaka city in bangladesh”. food science and nutrition, vol. 9, pp. 3752-3757, 2021. [19] h. i. kelle, v. u. oguezi and i. p. udeozo. “qualitative and spectrophotometric determination of potassium bromate in bread samples sold in asaba, delta state, nigeria”. pakistan journal of chemistry, vol. 5, no. 2, pp. 93-95, 2015. tx_1~abs:at/tx_2:abs~at 24 uhd journal of science and technology | july 2022 | vol 6 | issue 2 1. introduction montelukast sodium (mtk) which has the following chemical structure (fig. 1) is considered as a good alternative to corticosteroid inhaler in treating asthma and rhinitis since it has fewer side effects [1]. mtk mechanism of action is by blocking the action of cyslt receptor type 1 in respiratory system that results in relaxing smooth muscle and decreasing inflammation. mkt hydrophobic acidic drug that has water solubility about 0.2–0.5 µg/ml at room temperature; therefore, it is considered as class ii compound according to biopharmaceutic category system [2]. montelukast base solubility enhanced through salt formation as sodium salt of montelukast mtk. mtk possess acidic lipophilic property with a pka between 2.7 and 5.8 and logp 8.79 which make it soluble in higher ph media [2]. mtk is available as a tablet dosage form under the brand name of singulair for both adult and children from age 6 months and older with no detected adverse effect [3]. different methods have been studied to determine amount of mtk in its dosage form such as capillary electrophoresis [4], cyclic voltammetry [5], high performance liquid chromatography (hplc) with florescence detection [6], and hplc with ultraviolet (uv) detection [7], this study develop simple, specific, accurate, and precise method by uv-spectrophotometry and validate it according to international council for harmonisation (ich) guideline, and evaluate this new method with previously published method that has the same way of determination. 2. materials and methods 2.1. instrumentation for deter mination uv double beam (spekol 2000, analytikjena, canada) with two identical 1 cm quartzes cell newly simple quantitative determination of montelukast sodium by ultraviolet-spectrophotometry dlivan fattah aziz1, yehia ismail khalil2 1department of pharmaceutics, college of pharmacy, university of sulaimani, sulaimanyia, iraq, 2department of pharmaceutics, college of pharmacy, university of baghdad, baghdad , iraq a b s t r a c t montelukast sodium is well known pharmaceutically for its action as leukotriene antagonist and reliving symptoms associated with asthma is available in the market as tablet, chewable tablet, and powder. the aim of this study was to develop newly simple selective ultraviolet spectrophotometry (uv) method for daily routine analysis of quality control department. the uv method was developed with wavelength at 287.0 nm. this newly developed method was effectively applied to tablet dosage form of the motelukast sodium follow the beer’s lamberts at range 2.5–50 µg/ml. the validated parameters were carryout such as linearity, accuracy, precision, and specificity. the result of validation statistically studied and found to be satisfactory. index terms: ultraviolet, montelukast, determination, validation, quantitative, method corresponding author’s e-mail: dlivan fattah aziz, department of pharmaceutics, college of pharmacy, university of sulaimani, sulaimanyia, iraq. e-mail : dlivan.aziz@univsul.edu.iq received: 03-04-2022 accepted: 01-08-2022 published: 15-08-2022 access this article online doi: 10.21928/uhdjst.v6n2y2022.pp24-28 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2022 aziz and khalil. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l re se a rc h a rt i c l e uhd journal of science and technology aziz and khalil: quantitative determination of montelukast sodium by uv uhd journal of science and technology | july 2022 | vol 6 | issue 2 25 was used, all materials were weighed by electronic sensitive balance (sartorius, germany), water bath sonicator (starsonic, italy) used to aid dissolving solute in solvent during solution preparation. 2.2. materials pure mtk, lactose monohydrate, magnesium stearate, microcrystalline cellulose, and croscar melose sodium were kindly provided by pioneer for pharmaceutical industry, iraq. ethanol 96% was purchased from merck, germany. 2.3. standard stock solution and calibration curve solution preparation stock solution of 100 µg ml-1 of active pharmaceutical ingredient (api) was prepared by dissolving 25 mg of api into 250 ml of diluent (water: ethanol) (1:1 v/v) sonicated for few minutes. series solutions of the following concentration were prepared from stock solution in same the diluent (2.5, 5, 10, 15, 25, and 50 µg/ml). each solution was read triplicate and average of each sample was put into linear graph. 2.4. method development different media for dissolving mtk were evaluated to choose best solvent for api depending on solubility of mtk, stability, cost, selectivity, and toxicity. first water used as solvent then ethanol was added gradually till found that ethanol with water by 1:1 (v/v) will give clear solution. the prepared standard solution scanned and found that best absorption would be at 287.0 nm. 2.5. stability stability of mtk solution of calibration was determined at room temperature in day light condition for a period of 24 h by observing change in absorbance at the same wavelength. 2.6. analytical validation ich (q2) r guideline of validation of analytical procedure was applied for validating developed method as the following: 2.6.1. precision both interday and intraday of precision were analyzed with median concentration of api. intraday precision was completed by evaluating the median concentration of the mtk at the fig. 1. chemical structure of montelukast sodium. table 1: different concentration of mtk solution at different level with corresponding absorption linearity sample 2 zero time level conce ug/ml abs avearge area 50% 2.50 0.1136 0.1138666667 slope 0.04699946417 0.114 intercept 0.01064293369 0.114 r 0.9996766948 r2 0.9994 75% 5.00 0.2447 0.245 0.245 0.2449 100% 10.00 0.4784 0.4784 0.4784 0.4784 125% 15.00 0.7093 0.7093 0.7093 0.7093 150% 25.00 1.225 1.226533333 1.2273 1.2273 200% 50.00 2.3433 2.3433 2.3433 2.3433 aziz and khalil: quantitative determination of montelukast sodium by uv 26 uhd journal of science and technology | july 2022 | vol 6 | issue 2 same day while interday precision was studied over consecutive days for the same concentration repeated 6 times. evaluating precision of an analytical procedure provide statistics data on the unsystematic error. it states agreement between numbers of measurements achieved from several sampling of the same identical sample under approved conditions. the percentage of relative standard deviation (% rsd) values were studied and the low value of % rsd indicated the precisely of the analytical procedure. the value % rsd for the precision study according to ich guideline should be <2% (interday precision) this to confirms good precision of the method. 2.6.2. recovery recovery was completed using the method where identified quantity of standard mtk equivalent to 75, 100, and 125% of linear concentration had been added to placebo. the samples were read 3 times and percentage amount of api was calculated at each level. 2.6.3. linearity calibration curve for standard mtk solution was obtained in range from 2.5 µg ml−1 to 50 µg ml−1 for mtk. peak absorbance for each concentration must plot against respective concentrations and linear regression analysis should obtain the correlation coefficient higher than 0.999 to confirm that there is an excellent relationship between the absorbance and concentration of the samples and method have linear in response. 2.7. statistical analysis basic statistical analysis was applied such as mean, standard deviation, average, and rsd% using microsoft excel. 3. results and discussion this rapid technique for determination of the mtk is useful in drug analysis especially in pharmaceutical industry when time costs especially using hplc for determination of mtk is time consuming and requires effort and high cost. choosing the best solvent for preparing solution of mtk is a bit challenging in this study. different solvents have been tried and found that equal volume of (water: ethanol) give clear solution, its cheap, available in almost every laboratory, and easy to use. this diluent makes this study different from other previously studies in which in most of them, methanol 100% [8]-[10], methanol 50% [11], [12], methanol with 0.1n naoh [13], chloroform [14], or 7.4 ph phosphate buffer with 0.5% sodium lauryl sulfate [15] were used during solution preparation of mtk. different wavelengths have been suggested for reading mtk standard solution in different articles such as 344.4 [11], 359 [8], 344.3 [10], 283 [9], 287.3 [15], 286.5 [13], and 280 nm [12] but in the present research when standard solution of mtk in equal mixed volume of (ethanol: water) scanned by uv absorbance to find characteristic peak at wavelength between 400 and 200 nm by 0.2 nm interval, it was found that best absorbance is at 287.0 nm at nominal concentration of 10–15 µg/ml. the suggested method was validated regarding to ich guideline in the following aspects. 3.1. specificity and selectivity mtk solution of concentration 10 µg/ml in diluent was prepared in both alone and mixed with common excipient such as (lactose monohydrate, magnesium stearate, microcrystalline cellulose, and croscarmelose sodium) separately to know the interference of these excipients with api. both solutions were scanned at wavelength between 400 and 200 nm. method was specific and selective and there was no interference in reading between api and excipients. 3.2. linearity for linearity, six different concentrations were prepared from lower concentration 2.5 µg/ml to higher concentration 50.0 µg/ml. each concentration was read 3 times as shown in the table 1. linear relationship was observed between fig. 2. calibration curve of mtk at different concentration. table 2: precision of the method (n=6). parameter amount obtained by the proposed method in (mg) interday intraday mean 10.13 10.44 sd 0.00923 0.0105 rsd% 0.91% 1.01% aziz and khalil: quantitative determination of montelukast sodium by uv uhd journal of science and technology | july 2022 | vol 6 | issue 2 27 concentrations of mtk besides mean reading of absorbance at each point as it is clear in fig. 2 the determination correlation coefficient (γ2) equal to 0.999. 3.3. precision the method was evaluated to confirm precise in repeatability of analyzing six samples. the samples were prepared and the percentage of label claim of api of each sample was statistically evaluated. results are shown in table 2. the results were accepted according to acceptance criteria for assay value obtained from single analyst %rsd should be <2.0% while %rsd of two analyst performing the same samples that terminated in both days should not be more than 3.0%. 3.4. recovery accuracy and recovery of assay for the method was demonstrated by analyzing data achieved from standard addition into placebo solution at three levels. the amount of recovery of each sample was determined in percentage at each level and % rsd was calculated that was <2.0% shows a good accuracy of method. as it is clear in table 3, according to ich guidelines, good recovery of api should lie within the range of 98–102% which means the percentage recovery of api added to placebo should be in range of 100 ± 2.0% for average of three weight samples at each level. 3.5. stability the prepared solutions were stored at room temperature 25°c, analyzed after 24 h and it was found that mtk is stable for this period of analysis in diluent. 4. conclusion the validated uv method for mtk determination indicated that the method is linear, accurate, rapid, and specific. the simplicity of method allows it to be used in laboratories that have simple equipment and lack hplc, liquid chromatography mass spectrometry, or ultra-performance liquid chromatography especially for repetitive analysis of mtk in pharmaceutical dosage form or during development of new dosage form of mtk. the current method is also useful for quantitative determination of mtk in quality control department in pharmaceutical industry. 5. acknowledgment the author would like to thank pioneer pharmaceutical industry for providing facility. references [1] n. kittana, s. hattab, a. ziyadeh-isleem, n. jaradat and a. n. zaid. “montelukast, current indications and prospective future applications”. expert review of respiratory medicine, vol. 10, pp. 943-956, 2016. [2] s. sawatdee, t. nakpheng, b. t. w. yi, b. t. y. shen, s. nallamolu and t. srichana. “formulation development and in-vitro evaluation of montelukast sodium pressurized metered dose inhaler”. journal of drug delivery science and technology, vol. 56, pp.101534, 2020. [3] c. cingi, n. b. muluk, k. ipci and e. şahin. “antileukotrienes in upper airway inflammatory diseases”. current allergy and asthma reports, vol. 15, pp. 1-11, 2015. table 3: accuracy of mtk standard data weight dil. factor conc. µg/ml area average area sd rsd% 10.000 0.47840 0.4784 0.00 0.0% 0.47840 0.47840 accuracy recovery of mtk level quantity spiked µg/ml area quantity recovered µg/ml % recovery average recovery (%) %rsd 75 5.00 0.2395 5.01 100.13 100.25 0.11 0.2400 5.02 100.33 0.2399 5.01 100.29 100 10.00 0.4760 9.95 99.50 99.45 0.04 0.4757 9.94 99.44 0.4756 9.94 99.41 125 15.00 0.7105 14.85 99.01 99.04 0.04 0.7110 14.86 99.08 0.7107 14.86 99.04 average 99.58 aziz and khalil: quantitative determination of montelukast sodium by uv 28 uhd journal of science and technology | july 2022 | vol 6 | issue 2 [4] y. shakalisava and f. regan. “determination of montelukast sodium by capillary electrophoresis”. journal of separation science, vol. 31, pp. 1137-1143, 2008. [5] i. alsarra, m. al-omar, e. a. gadkariem and f. belal. “voltammetric determination of montelukast sodium in dosage forms and human plasma”. il farmaco, vol. 60, pp. 563-567, 2005. [6] h. ochiai, n. uchiyama, t. takano, k. i. hara and t. kamei. “determination of montelukast sodium in human plasma by column-switching high-performance liquid chromatography with fluorescence detection”. journal of chromatography b: biomedical sciences and applications, vol. 713, pp. 409-414, 1998. [7] a. k. shakya, t. a. arafat, n. m. hakooz, a. n. abuawwad, h. al-hroub and m. melhim. “high-performance liquid chromatographic determination of montelukast sodium in human plasma: application to bioequivalence study”. acta chromatographica, vol. 26, pp. 457-472, 2014. [8] s. s. patil, s. atul, s. bavaskar, s. n. mandrupkar, p. n. dhabale and b. s. kuchekar. “development and statistical validation of spectrophatometry method for estimation of montelukast in bulk and tablet dosage form”. journal of pharmacy research, vol. 2, pp. 714-716, 2009. [9] m. arayne, n. sultana and f. hussain. “spectrophotometric method for quantitative determination of montelukast in bulk, pharmaceutical formulations and human serum”. journal of analytical chemistry, vol. 64, pp. 690-695, 2009. [10] p. v. adsule, k. sisodiya, a. g. swami, v. p. choudhari and b. s. kuchekar. “development and validation of uv spectrophotometric methods for estimation of montelukast sodium in bulk and pharmaceutical formulation”. int j pharm sci rev res, vol. 12, pp. 106-8, 2012. [11] w. badulla and g. arli. “comparative study for direct evaluation of montelukast sodium in tablet dosage form by multiple analytical methodologies”. rev roum chim, vol. 62, pp. 173-179, 2017. [12] s. muralidharan, l. j. qi, l. t. yi, n. kaur, s. parasuraman, j. kumar and p. v. raj. “newly developed and validated method of montelukast sodium estimation in tablet dosage form by ultraviolet spectroscopy and reverse phase-high performance liquid chromatography”. ptb reports, vol 2, pp. 27-30, 2016. [13] k. singh, p. bagga, p. shakya, a. kumar, m. khalid, j. akhtar and m. arif. “validated uv spectroscopic method for estimation of montelukast sodium”. ijpsr, vol. 6, pp. 4728-4732, 2015. [14] s. r. bhagade. “spectrophotometric estimation of montelukast from bulk drug and tablet dosage form”. international journal of pharmaceutical sciences and research, vol. 4, pp. 4432, 2013. [15] k. pallavi and s. babu. “validated uv spectroscopic method for estimation of montelukast sodium from bulk and tablet formulations”. international journal of advances in pharmacy, biology and chemistry, vol. 1, pp. 450-453, 2012. ole_link2 _goback . 38 uhd journal of science and technology | april 2017 | vol 1 | issue 1 1. introduction biosensors can sense single molecule through using nanopores. they may sense unlabeled biopolymers such as dna and rna and single proteins. the sensing takes place when ion currents reduced largely due to blocking pores by passing molecules [1]. the nanopore diameter is very important for sensing molecules. the main step for finding that is through using segmenting of them from scanning electron microscope (sem) image. image segmentation defined as dividing images into multiple parts that have homogeneity in pixel intensity, color, or texture [2]. one of the simple, sometimes useful, segmenting methods is threshold technique. however, it is time-consuming for its strategy based on trial and error method, and for sometimes, a single threshold value does not work well, especially, for a series of image frames of video data. akhtaruzzaman et al. [3] used an automated threshold detection on a video which is a series of image frames of human walking to segment human lower limbs. they applied automated threshold detection to convert the image frames into grayscale image, line fill algorithm to smoothing the edges of object, and remove background to get out the object. in general, image enhancing through denoising is an important previous step before segmenting objects. one of the denoising filters is bilateral filter which reduces noise with remaining sharp edges of the objects. besides, nguyen et al. [4] denoised specific artifacts and segmented the full body bone structure by employing 3d bilateral filter and 3d graph-cut, respectively. on the other hand, sahadevan et al. [5] increased the accuracy of super vector machine classifier using a bilateral filter which merges spatial contextual information to spectral domain. estimation of nanopore size using image processing haidar jalal ismail, azeez abdullah azeez barzinjy, and kadhim qasim jabbar department of physics, college of education, salahaddin university-erbil, zanko, erbil, iraq a b s t r a c t nanopores, which are nanometer-sized holes, have been utilized in apparatus that point toward sensing a range of molecules such as dna and rna and single proteins the important factor for sensing molecules is diameters of nanopores which can be found through a substantial process called segmenting for nanopores of scanning electron microscope (sem) images. in this investigation, four segmentation methods, namely, threshold, bilateral filter, k-means, and expectation maximizationgaussian mixture model (em-gmm) which has been utilized to segment three sem images of nanopores efficiently. the quality of segmentation evaluated objectively through computing rand index among them. consequently, the nanopore size of al 2 o 3 films computed by means of sem images. this study found that em-gmm segmenting method gives promising results among other examined methods. it is for their high r-index, minimum adjustment parameters (just one variable which set usually 2), and low consuming time. hence, it can be used efficiently for computing nanopore count and size. index terms: feature extraction, image segmentation, nanopore, segmentation evaluation corresponding author’s e-mail: haidar.ismail@su.edu.krd received: 10-03-2017 accepted: 25-03-2017 published: 12-04-2017 access this article online doi: 10.21928/uhdjst.v1n1y2017.pp38-44 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2017 ismail, et al. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l re se a rc h a rt i c l e uhd journal of science and technology haidar jalal ismail, et al.: estimation of nano-pore size uhd journal of science and technology | april 2017 | vol 1 | issue 1 39 another method of segmentation is k-means which put the image into multi cluster of pixels according to factors such as their intensities. chen et al. [6] propose a semiautomatic segmentation method, using k-means, to determine object’s mean temperature and variance through segmenting contours of thermal images taken by the optical camera. fu and wang [7] applied expectation maximizationgaussian mixture model (em-gmm) on color images to segmenting them, and their results approve the power of it. the em-gmm and fuzzy-c-means (fcm) methods are widely used in image segmentation. however, they have a major drawback for their sensitivity to the noise. kalti and mahjoub [8] proposed a variant of these methods to resolve this problem. their results showed improvement compare to standard version of em-gmm and fcm. several researches work to find geometrical structures of nanopores. alexander et al. [9] computed nanopore size, perimeter, and some other geometric features using histogram equalization, morphological, and statistical operations. in another work, that done by phromsuwan et al. [10], size of nanopores of sem images obtained through using morphological and canny edge detector techniques. parashuram and vidyasagar [11] used morphological and global thresholding for obtaining nanopore diameter and statistical features. same authors with muralidhara [12] using same operations to obtain perimeter of the nanopores. it can be realized that all above methods using methods that need trial and error parameters to give proper results. this work aims to find semiautomatic algorithm to find diameter of nanopores of sem images through examine four segmenting techniques. the performance will be evaluated objectively, and the average of nanopore’s diameter will be computed. 2. materials and methods three sem images of the nanopores anodic alumina film [13] used in this study for segmenting by our segmenting techniques and compute their diameters and number of pores. they consist of three sem images with different widening times, namely, 0, 10, and 20 minutes as shown in fig. 1. the images segmented by four methods. the simple method is thresholding method that used here as ground truth images for objective evaluating of other segmenting methods. the second and third segmenting methods utilize bilateral filter, k-means, as the first step and using region selector as the second step. the fourth one is segmenting images using em-gmm (fig. 2). thresholding is a technique of selecting optimum gray level value which separates the region of interest from other regions. thresholding produced binary images from gray level by making pixels lower or greater than a gray level value to zero and other remaining pixels to one. if g(x, y) is threshold output of an input f(x, y) at specific input gray level value t, it can be described as [14] follows: g x,y = f x,y >t otherwise ( ) ( )    1 0 (1) k-means method divides pixels into a number of separate clusters. it’s algorithm consists of two steps. first, it finds k centroid (k number of clusters) for pixels of the image, and second, relate each pixel to a centroid through using different methods of computing distance between them. euclidean distance may be used to measure distance, and it defined as follows: d=||p (x,y)−c k || (2) where p(x, y) is an input pixel to be cluster and c k is the cluster centers. after grouping pixels into k sets (i.e. clusters), new euclidean distance evaluated between each center and pixels, so pixels assigned to the minimum euclidean distance [15]. the bilateral filtering is a technique for smoothing and sharpening edges of an image. it obtained by applying one gaussian filter for obtaining the spatial domain and another one for intensity domain. the filter output of s pixel is given by following equation: j s = 1 k(s) (p s)(i i )ip p s p( ) ∑ − −∈∅ (3) where k(s) is normalization expression: k s = f(p s)g(i i )p p s( ) ∑ − −∈∅ (4) where f and g are gaussian, in the spatial domain and in the intensity domain, which represents the range filter, respectively [16]. region selector method uses roicolor command in matlab which select wanted region according to color or intensity levels in grayscale image. haidar jalal ismail, et al.: estimation of nano-pore size 40 uhd journal of science and technology | april 2017 | vol 1 | issue 1 fig. 1. four segmentation methods for three scanning electron microscope images with pore widening times: (a) 0 min, (b) 10 min, and (c) 20 min (scale bar = 500 nm) [13] cba the gmm consists of gaussian distributions that defined as follows: f x = p n x |qn k n kk=1 k ( ) ( )∑ (5) haidar jalal ismail, et al.: estimation of nano-pore size uhd journal of science and technology | april 2017 | vol 1 | issue 1 41 where every component of function n(x n |θ k ) is a gaussian distribution which, for a d-dimensional vector x, defined as follows: n x 1 (2p) 1 s exp 1 2 x-m (x-m)d/2 1/2 t -1|θθ( ) = ( ) ∑{ } (6) where µ and σ are a d-dimensional average vector and a d × d covariance matrix, respectively. the prior distribution π k defines the probability of noticing x n that belongs to the kth class ω k . it is unrelated to the observation x n . moreover, π k must possess these restrictions: 0 1, ; k=1, kk kk=1 k ≤ ≤ ∑π π ...., (7) after finding the density function for an observation, the log-likelihood function of n observations is as follows: l n x |k n kk k n n θ θ( ) = ( ) == ∑∑ log( π11 (8) according to equations 5 and 8, the major feature of the gmm is that its form is too simple and it needs few variables. moreover, when gmm used in image segmentation, the correct results obtained if they unrelated to each other. to find the variables (π k , µ k , and σ k ), the em step is usually applied to get the upper limit of the log-likelihood function in equation 8. the last probability for expectation stage of em obtained as follows: p |x n x n x t k n k n k j= k j n j θ θ θ ( ) = ( ) ( )∑ π | | 1 ππ (9) in the maximization step of em, the parameters (π k , µ k , and σ k ) are changed iteratively through the following formulas: πk t+1 n n t k n n n n t k n p x x p |x = ( ) ( ) = = ∑ ∑ 1 1 θθ θθ | (10) σσk t n n t k n n k n k t n n t k n p |x �(x -m )(x -m ) p x + = = = ( ) ( ) ∑ ∑ 1 1 1   | (11) πk t+1 n n t k np x n = ( ) =∑ 1 θθ | (12) where t denotes the iteration value. the loop is stopped in the convergence condition. the value from equation 9 for maximum posterior criterion used to get the class label for each pixel [17]. a. rand index the rand index, which founded by william rand, used for the comparison of two arbitrary segmentations using pairwise label relationships. it obtained by division of the number of pixel pairs that have the same label relationship in both segmentations. the n uv is the number of points labeled u in s and that labeled v in s’. the labeled points u in the first part of s, and labeled points v in second part s’ are termed as n u■ and n ■v , respectively. afterward: n u■ =∑ v n uv n ■v =∑ u n uv (13) clearly ∑ u n u■ =∑ v n u■ =n is the entire data points. hence, the rand index is as follows: r s,s n n n n(n-1) /2 ' u u 2 v u 2 u,v uv 2 ( ) = − +( )−∑ ∑ ∑ 1 1 2 � � � (14) the r-index is 1 when both segmentations have total similarities and 0 for zero ones. this type of similarity measurements takes small running time when unique labels in s and s’ are smaller than total data numbers n [18]. 3. results and discussion all three sem images segmented using threshold technique obtained after a large number of trial and error for optimize threshold intensity pixel value, morphological operation, and removing small objects. they considered as ground truth images through visual perception to objective evaluating other segmenting methods (fig. 1). the results of other segmenting methods are shown in same figure too. fig. 1a shows sem that suffers from some noise effect. the wiener filter and adaptive histogram equalization used fig. 2. image processing steps for different segmentation methods haidar jalal ismail, et al.: estimation of nano-pore size 42 uhd journal of science and technology | april 2017 | vol 1 | issue 1 fig. 3. (a-c) total counting nanopores and distribution of nanopore sizes for threshold segmenting (ground truth image) and expectation maximization-gaussian mixture model (higher r index) for all three scanning electron microscope types a cb haidar jalal ismail, et al.: estimation of nano-pore size uhd journal of science and technology | april 2017 | vol 1 | issue 1 43 for denoising and contrast enhancement before segmenting by threshold technique. nevertheless, it still effects on segmenting by other methods. fig. 1b and c show good segmenting for all segmenting methods. fig. 3 presents all three images that segment by threshold, ground truth image, and em-gmm, higher r index, that number labeled each pore. furthermore, the distribution of pore size which mentioned showed in same figure. they can be fitted mainly as gaussian distribution as appear charts of fig. 3 and that in agreement with what in results of macias et al. [13]. the time consuming for running code, rand index, number, and diameter of nanopores for segmenting methods and for all three sem images presented in table i. the obtained results for diameter of nanopores are in a good agreement with macias et al. [13] results for threshold segmenting and smaller for em-gmm segmenting. the method of analysis sem images in mentioned reference is unknown. the em-gmm is semiautomatic method, with high r index, and relatively smaller time consuming is better than other segmenting methods studied here. 4. conclusion four different segmenting methods are applied on three sem images with various time widening pores 0, 10, and 20 minutes. it can be noticed that the threshold segmenting possesses good results, but perhaps, it needs a large number of trial and error for choosing optimum threshold pixel intensity and needs also morphological operation and removing small objects. the authors also concluded that the em-gmm is superior than bilateral filter and k-means with region selector, since it has higher r index than them. consequently, their segmenting results used for pore counting and computing their diameters. likewise, it has relatively small time consuming of running. accordingly, em-gmm can be used professionally for segmenting sem images and finding number of pores and their diameters. 5. acknowledgment the authors would like to express their acknowledgment to the salahaddin university for supporting him with available tools. references [1] c. raillon, p. granjon, m. graf, l. j. steinbock and a. radenovic, “fast and automatic processing of multi-level events in nanopore translocation experiments.” nanoscale, vol. 4, no. 16, pp. 4916, 2012. [2] m. ahmed, s. abd el-attysoliman and j. adamkani. “performance study of innovative and advanced image segmentation techniques.” singaporean journal of scientific research, vol. 7, no. 12015, pp. 320-326, 2014. [3] m. akhtaruzzaman, a. a. shafie and r. khan. “automated threshold detection for object segmentation in colour image. arpn journal of engineering and applied sciences, vol. 11, no. 6, pp. 41004104, 2016. [4] c. nguyen, j. havlicek, q. duong, s. vesely, r. gress, table i time consuming, rand index, nanopore diameter and nanopore counts for all sem images segmented by the four techniques nanopore threshold segment bilateral filter, region selector k*means, region selector em-gmm a time (s) 1.31 29.21 15.51 19.96 rand index 0.781 0.757 0.91 diameter (nm) 36.7±10.4 39.3±19.5 pore count 51 60 b time (s) 1.28 30.44 20.07 11.24 rand index 0.599 0.592 0.870 diameter (nm) 54.2±11.9 46.6±9.1 pore count 81 83 c time (s) 1.36 30.35 14.79 13.68 rand index 0.503 0.503 0.755 diameter (nm) 71.8±15.9 59.8±12.7 pore count 83 84 sem: scanning electron microscope, em‑gmm: expectation maximization‑gaussian mixture model haidar jalal ismail, et al.: estimation of nano-pore size 44 uhd journal of science and technology | april 2017 | vol 1 | issue 1 l. lindenberg, p. choyke, j. h. chakrabarty and k. williamset. “an automatic 3d ct/pet segmentation framework for bone marrow proliferation assessment.” 2016 ieee international conference on image processing (icip), phoenix, az, 2016, pp. 4126-4130. [5] a. s. sahadevan, a. routray, b. s. das and s. ahmad. “hyperspectral image preprocessing with bilateral filter for improving the classification accuracy of support vector machines.” journal of applied remote sensing, vol. 10, no. 2, pp. 025004, apr. 2016. [6] y. y. chen, w. s. chen and h. s. ni. “image segmentation in thermal images.” 2016 ieee international conference on industrial technology (icit), taipei, 2016, pp. 1507-1512. [7] z. fu and l. wang. “color image segmentation using gaussian mixture.” in multimedia and signal processing: second international conference, cmsp 2012, shanghai, china, 2012. [8] k. kalti and m. mahjoub. “image segmentation by gaussian mixture models and modified fcm algorithm.” the international arab journal of information technology, vol. 11, no. 1, pp. 11-18, 2014. [9] s. alexander, r. azencott, b. bodmann, a. bouamrani, c. chiappini, m. ferrari, x. liu and e. tasciotti. “sem image analysis for quality control of nanoparticles.” in computer analysis of images and patterns, springer, berlin, heidelberg, 2009. pp. 590-597. [10] u. phromsuwan, y. sirisathitkul, c. sirisathitkul, p. muneesawang and b. uyyanonvara. “quantitative analysis of x-ray lithographic pores by sem image processing.” mapan-journal of metrology society of india, vol. 28, no. 4, pp. 327-333, 2013. [11] p. bannigidad and c. vidyasagar. “effect of time on anodized al2o3 nanopore fesem images using digital image processing techniques: a study on computational chemistry.” international journal of emerging trends and technology in computer science (ijettcs), vol. 4, no. 3, pp. 15-22, 2015. [12] c. vidyasagar, p. bannigidad and h. muralidhara. “influence of anodizing time on porosity of nanopore structures grown on flexible tlc aluminium films and analysis of images using matlab software.” advanced materials letters, vol. 7, no. 1, pp. 71-77, 2016. [13] g. macias, j. ferré-borrull, j. pallarès and l. marsal. “effect of pore diameter in nanoporous anodic alumina optical biosensors.” the analyst, vol. 140, no. 14, pp. 4848-4854, 2015. [14] m. h. j. vala and a. baxi. “a review on otsu image segmentation algorithm.” international journal of advanced research in computer engineering and technology, vol. 2, no. 2, pp. 387-389, 2013. [15] n. dhanachandra, k. manglem and y. chanu. “image segmentation using k-means clustering algorithm and subtractive clustering algorithm.” procedia computer science, vol. 54, pp. 764-771, 2015. [16] s. agarwal and p. kumar. “denoising of a mixed noise color image through special filter.” international journal of signal processing, image processing and pattern recognition, vol. 9, no. 1, pp. 159176, 2016. [17] t. xiong, l. zhang and z. yi. “double gaussian mixture model for image segmentation with spatial relationships.” journal of visual communication and image representation, vol. 34, pp. 135-145, 2016. [18] r. unnikrishnan and m. hebert. “measures of similarity.” application of computer vision, 2005. wacv/motions ‘05. vol. 1. seventh ieee workshops on, breckenridge, co., 2005, pp. 394. tx_1~abs:at/tx_2:abs~at uhd journal of science and technology | july 2022 | vol 6 | issue 2 19 1. introduction semantic-web (as a recommender system) together with all the necessary tools and methods required for creation, maintenance, and application. in actual history, the semanticweb is usually future as an heightening of the present world wide web (www) or (3w) with machine-justifiable data (rather than a large portion of the ongoing web, which is generally focused on at human utilization), together with services – intelligent agents [1]. nevertheless, in our proposed system used to facilitate and improve human electronic recommendation management (herm) is mean that the current web became semantic web (sw) with recommender system (rs). the human consumption artificial intelligent (ai) modify to semantic-web recommender system (swrs). our contribution is sw instead of human consumption and rs instead ai and combine to swrs. however, our proposed system is sw with the cosine similarity is a method and part of content-based algorithm (cba) for filtering all titlemovie in dataset of movie-lens [2]. the resource description framework (rdf) suggest to graph-based data model, which became part of the semantic-web vision [3], the rdf in our proposed system is very necessity with a view to represent data that recommended the title-movie and store into the rdf file. the rdf is much more accurate than the ontology file due to: (1) easy to use, (2) easy to understand also, nd (3) accurate. apart from one parameter that used two parameter to enhance accuracy and execution consume-time. the ontology modified from only one parameter to two parameters in propose of semantic web recommender system over different operating platforms halo khalil sharif, kamaran hama ali. a. faraj department of computer science, college of science, sulaimani university, sulaimani, krg, iraq a b s t r a c t semantic-web recommender system (swrs) evaluation over different operating systems (oss) used to facilitate and improve human electronic recommendation management (herm). the herm is address the needs of user and dataset of movie in our proposed system through internetworking means which increase the speed of automated recommendation and enhance the goodness of swrs and services also electronically to select right movies-title to user demand. furthermore, it will be a benefit for selection a right favor by user for right selection from (i.e., 3000 records in dataset of movie-lens) in the backend. there are a direct relation between time-consume of selection movie-title, also the time-consume, and accuracy. the two-mentioned parameters, namely, time-consume and accuracy over two different operation system (oss) which designed by web technology python. in our research, swr system is proposed; it is provide with some recommendation methods. the system designed and improved using content-based algorithm (cba). investigational results indicate that the developed algorithm technique confident a reasonable performance such as accuracy and time consuming compared to other existing works with a testing average accuracy of 85.63 for windows and 88.35 for linux operating system. in conclusion, swrs investigated on two different operating platforms and could be seen that the linux is faster than windows in accuracy and time consuming. index terms: semantic web, e-recommender system, content-based, rdf, sparql, python corresponding author’s e-mail: halo.sharif@univsul.edu.iq received: 14-07-2022 accepted: 28-07-2022 published: 10-08-2022 access this article online doi: 10.21928/uhdjst.v6n2y2022.pp21-19-24 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2022 sharif and faraj. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l re se a rc h a rt i c l e uhd journal of science and technology halo khalil sharif and kamaran hamaali. a. faraj: semantic web recommender system 20 uhd journal of science and technology | july 2022 | vol 6 | issue 2 the system. content-based rs make suggestions that consider the users the ratings that users give to items according to their preferences and the content of the items (e.g., extracted keywords, title, pixels, and disk space). the content based algorithms with using the filtering technique is a main idea of our proposed system. the training algorithm is start first for training all dataset to predict the movie-title that situated between limitation and after that, the test algorithm is start to filtering of training output. the activates are depend on training algorithm and test algorithms between the user’s demands and movie’s title (plus demands) to build the swrs decisions. semantic-web utilizes the resource description framework (rdf) and the simple-protocol and query/update languages (sparql) as uniform logical data illustration and handling models, permitting machines to straight interpret data from the web. as semantic-web, applications is growing progressively popular, new-fangled and stimulating threats of security arise [4], it is impossible to achieve our proposed system or any evaluation without rdf because in rdf is store and transfer data to web application through sparql. the two parameters that mentioned namely tag line and original title. nevertheless, the only used parameter is overview parameter used in cami et al. [5]. the contribution in our proposed system is two parameters. while the deployment of (www) and the internet was swiftly increasing, the recommendation outfits become electronic to support e-commerce (ec) business. usually, the concept of e-recommender is relevant with all kinds of digitalizes businesses and it uses three-tier architecture [6]. regardless of the fantastic measure of data that is accessible in the reality or on the web, it is difficult for the searcher to track down items or services that he may be interested in. decision-making is an essential part that the traditional and electronic recommendation should do. the vast amounts of digitally available candidate information denote a sizeable opportunity for improving matching quality and it leads to better web semantic recommendation performance [7]. this paper proposes a new procedure for recommending movietitles using a content-based filtering algorithm and generally used dataset (movielens). the whole of the paper is arranged as follows. section 2 places forward a literature review. section 3 shows a complete swrs for the recommending of movietitles, containing units like an outline of system architecture, movielens dataset description, data preprocessing, feature extraction, and performance metrics. section 4 discusses the experimental results achieved after applying different feature extractors and comparing them from different platforms with the existing methods. finally, section 5 deals with the conclusion of the work. 2. literature survey in semantic-web recommender system (swrs), techniques have conveyed exceptional outcomes; these techniques are regularly acted in the recommendation on movie-titles dataset. recently, various works were executed with the assistance of various content-based methodology to distinguish and predict of movie-titles. a short audit of a few significant contributions from the current literature is given. soumya prakash rana (2020) [8] proposed arrangement, health recommender systems (hrs) have arisen for patientsituated decision-making to suggest better medical care guidance in light of profile health records (phr) and patient data sets. the hrs can upgrade medical services frameworks and at the same time oversee patients experiencing a scope of various sicknesses utilizing prescient investigation and suggesting fitting therapies. a content-based recommender system (cbrs) is a tweaked hrs approach that focuses on the assessment of a patient’s set of experiences and “learns,” through ai (ml), to produce forecasts. moreover, cbrs plans to offer personalized and believed data to the patient’s with respect to their health status. donghui wang (2018) [9] they fostered a content-based diary and meeting recommender framework for software engineering and innovation. to the extent that, there is no comparative recommender system or distributed strategy like what they have presented here. besides, there was no dataset to utilize. hence, the web crawler has been intended to gathering information and creates preparing and testing informational indexes. then, unique component determination techniques and played out few trials used to choose a decent system and recreate include space. despite the fact that accomplishing 61.37% exactness for paper proposal. ibukun tolulope afolabi (2019) [10], in this examination, showed a semantic-web content digging approach for recommender frameworks in web based shopping. the strategy depends on two significant stages. the primary stage is the semantic preoperational of text-based information utilizing the blend of a created cosmology and a current metaphysics. the subsequent stage utilizes the naïve bayes calculation to make the proposals. the result of the framework is assessed utilizing accuracy, review and f-measure. carlos luis sanchez bocanegra (2017) [11] this shows the practicality of utilizing a semantic content-based recommender framework to enhance youtube health halo khalil sharif and kamaran hamaali. a. faraj: semantic web recommender system uhd journal of science and technology | july 2022 | vol 6 | issue 2 21 recordings. assessment with end-clients, notwithstanding medical services experts, will be expected to distinguish the acknowledgment of these suggestions in a no simulated data looking for setting. most of sites suggested by this framework for health recordings were pertinent, in view of evaluations by health experts. albatayneh (2018) [12], this examined to present an original proposal engineering that can prescribe intriguing post messages to the students in an e-learning on the web conversation gathering in view of a semantic contentbased separating and students’ negative appraisals. we assessed the planned e-learning recommender framework against leaving e-learning recommender frameworks that utilization comparable sifting methods concerning suggestion exactness and students’ exhibition. the got exploratory outcomes display that the suggested e-learning recommender framework beats other comparative e-learning recommender frameworks that utilization non-semantic content-based separating strategy (cb), non-semantic content-based sifting method with students’ negative appraisals (cb-nr), semantic content-based sifting procedure (scb), concerning framework precision of around 57%, 28%, and 25%, separately. 3. proposed methodology 3.1. system architecture tf-idf is used for the vectorization of the information and cosine similarity is utilized to compute the similarity measure between the vectors. tf-idf is normally used as a portion of content-based algorithm recommendations systems in proposed system. it contains of two positions: term-frequency (tf) and inverse-document-frequency (idf). tf deals-with the occurrence of interests and preferences in user profile. whereas, idf deals with inverse of the word frequency among the entire data provided by user profile. these two theories are joint together to present the recommendation for a user based on the data’s presented by user profile. cosine similarity be able to catch the similarity among two attribute or more from the dataset found by determining cosine value between two vectors or more. use of cosine similarity can be executed on any two texts such as documents, sentences, attributes or paragraph. occasionally through the similarity measurement between the vectors which produce unstable results. finally, the swrs are build using famous algorithm content-based (cb) and rdf. the important steps in proposed structure design are shown in fig. 1. in the below figure shoe all steps as an instruction of our system. row one show all main steps, but the underneath raw is subset of first raw. raw one and two are complete each other’s for the sake of processes of the system. 3.2. dataset explanation the proposed system was trained as well as tested on the movielens dataset. the dataset consists of movies released on or before july 2019. information focuses incorporate cast, group, plot, watchwords, spending plan, income, banners, delivery dates, dialects, creation organizations, nations, tmdb vote counts, and vote midpoints. the complete movielens datasets comprising 26 million evaluations and 750,000 label applications from 270,000 clients on every one of the 45,000 motion pictures. this dataset is a troupe of information gathered from tmdb and grouplens. the movie detail, credit and keyword have been gathered from the (tmdb) open an api. this item utilizes the tmdb api however is not embraced or affirmed by tmdb. their api likewise gives admittance to information on numerous extra motion pictures, entertainers and entertainers, group individuals, and tv shows. the movie links and ratings have been gotten from the official grouplens site. a portion of the things you can do with this dataset: predicting film income or potentially film achievement in view of a specific measurement. what motion pictures will generally get higher vote counts and vote midpoints on tmdb? building content-based and collaborative filtering based recommendation engines [13]. 3.3. preprocessing to be capable handling information concurring appropriately, really, and productively, that it requires the capacity as far as fig. 1. architecture of proposed system. halo khalil sharif and kamaran hamaali. a. faraj: semantic web recommender system 22 uhd journal of science and technology | july 2022 | vol 6 | issue 2 a specific programming language that is explicitly devoted to handling information or data in many place of origin in the association or the web to turn into a valuable information researcher for associations or organizations [14], because of in the proposed technique (fillna) method is used to cleansing data from the dataset to achieve the best result. scikit-learn is a permitted software (utility) machine-learning library for the python programming language. it assists python numerical and scientific libraries, in which tfidf-vectorizer is one of them. it alters a group of raw documents to a matrix of tfidf structures. as tf–idf is extremely frequently used for text sorts, the class tfidf-vectorizer merges all the options of count-vectorizer and tfidf-transformer in a particular model. tfidf-vectorizer uses an in-memory vocabulary (a python dict) to map the most recurrent words to features indices and henceforward calculate a word occurrence frequency (sparse) matrix, the class of tfidfvectorizer used to vectorizing the two attribute from the dataset (movielens) in our proposed system. 3.4. recommendation engine term-frequency inverse-document-frequency (tf-idf) is utilized to yield recommendations to the user’s favorites. each data attribute from the datasets is converted into a vector by applying the tf-idf vectorization algorithm described before. for each vector, a similarity measure is calculated using the cosine similarity method. when a user requires number of recommendations for a certain movie, the correspondence quantities are produced for the movies with concern to that movie. individually similar movie detected will have a confident score of how similar it is to the represented movie, which is sorted into descending order, because of list the movies with high to low similarity. conferring to the amount of recommendations demanded by the user, the indices of those movies are gathered and showed to the user as a list of movies. the recommendations created by the engine are displayed over a user interface to the user; the engine is trained to yield similarity measures using the training data. the backend is scripted using python language, whereas the calculations performed from equations 1 to 4 to find cosine-similarity and tf-idf [15]. cosine similarity, based on vector similarity, similarity among vectors can be denoted as eq. 1: cos aibi ai bi i n i n i n ( ) �  = = = = ∑ ∑ ∑ 1 1 2 1 2 (1) where, ai and bi are components of vector a and b respectively: tf, i.e. word frequency, indicates the frequency of terms in the text showed in eq. 2. tf n n i j i j k i j , , , = ∑ (2) where, n is the frequency of terms in the movie-title. idf, i.e. inverse document frequency, represents the reciprocal of the quantity of movie-title containing words in the mass displayed in eq. 3. idf j n df j =         log (3) where, n is the frequency of movie-title containing words thus, the tf-idf weight for catchphrase in record can be composed in eq. 4 tf-idf = (frequency of words/total words of sentences) × (total documents/documents containing the word) (4) 3.5. evaluation evaluation is used to assessment the consideration space and results from various models or algorithms. for the recommendation of movie-titles, so when it comes to a classification problem, can be counted on an auc roc curve. because of needed to scan or imagine the performance of the proposed system, it is denoted by the auc (area under the curve) roc (receiver operating characteristics) curve. it is one of the greatest significant estimation metrics for testing any arrangement model’s performance. it is as well written as auroc (area under the receiver operating characteristics).the range auc is between 0 and 1, an brilliant model has auc proximate to the 1 and that implies it has a moral degree of distinguishability. the unwell model has an auc near 0 which denoted it has the poorest measure of separability. three broadly utilized performance metrics were applied to assess the proposed system’s performance: tpr (true positive rate)/recall/sensitivity, specificity and fpr(false positive rate)/precision. to calculate the metrics specified by equations 5–7, three distinct performance factors were selected: true positive (tp), true negative (tn), false positive (fp), and false negative (fn). tpr (true positive rate)/recall/sensitivity: tpr recall sensitivity tp tp fn / / = + (5) halo khalil sharif and kamaran hamaali. a. faraj: semantic web recommender system uhd journal of science and technology | july 2022 | vol 6 | issue 2 23 specificity: specificity tn tn fp = + (6) fpr: fpr fp tn fp = + (7) 4. results and discussion although the semantic-web recommendation system (swrs) built by other developers have used any technique filtering techniques, they had encountered weaknesses, which were slight disturbing. in our paper, we had implemented (swrs) by content-based algorithm in two attributes from the movielens dataset utilizing cosinesimilarity and ter m-frequency inverse documentfrequency (tf-idf), after that the algorithm is tested on the windows 10 64-bit and linux 18.2 64-bit operating system with the different number of records (movietitle), then these results are shown in table 1 that display the results achieved on windows 10 64-bit operating system in different number of records in our dataset to produce process time, execution time and accuracy form read dataset to create rdf file, furthermore table 2 that display infor mation as table 1 but on the real (not virtual) linux 18.2 64-bit operating system. these marks pointed to that the building of (swrs) on linux operating system is better than on windows operating system, moreover the fig. 2 on windows 10 and fig. 3 on linux operating system demonstrates to verify the results that found by area under curve (auc) to accuracy of creating swrs. as a result of all the evaluation found out the two parameters are better in quality of service (qos), quality of information (qoi), in spite of the results (accuracy and speed) can be affected by the features of the computer such as (cpu, ram, data bus, graphic card) for this situations, so these issues should be handled before any processing to provide the predicted results. table 2: results for the swrs on linux‑v22 operating system linux v 18.2 64-bit operating system no. records process time (second) execution time (second) accuracy (auc) 1000 0.7104 0.6034 92.10% 2000 0.7445 0.6499 91.75% 3000 0.8268 0.6726 90.35% table 1: results of the swrs on windows‑v10 operating system windows 10 64-bit operating system no. records in dataset process time (second) execution time (second) accuracy (auc) 1000 1.00315 1.01364 88.75% 2000 1.07825 1.15712 88.25% 3000 1.40268 1.52974 87.15% fig. 2. display auc on windows for swrs. fig. 3. display auc on linux for swrs. halo khalil sharif and kamaran hamaali. a. faraj: semantic web recommender system 24 uhd journal of science and technology | july 2022 | vol 6 | issue 2 5. conclusions quick recommendations of movie-title, provides the greatest fortuitous to finding the correct titles (movies) to the users, semantic web-based content-based algorithm recommender system able to use in automatically and successfully analyze required data to identify the movie-titles. the main objective of our research is to use tf-idf and cosine similarity model to perform recommendations then creating rfd file as semantic web using input data from the amount of output of the data that recommended by the proposed method, after that simple protocol and rdf query language (sparql) used and it is the query language for the semantic-web that performs demand information from the databases or any data source that can be plotted to rdf. the proposed system offered a higher average recommendation accuracy approximately (88.5%) for windows operating system and (91.25%) for linux operating system investigational results exposed that the proposed method is more effective than the previous works. references [1] p. hitzler, 2021. “a review of the semantic web field”. communications of the acm, vol. 64, pp.76-83, 2021. [2] i. portugal, p. alencar and d. cowan. “the use of machine learning algorithms in recommender systems: a systematic review”. expert systems with applications, vol. 97, pp. 205-227, 2018. [3] j. e. gayo, e. prud’hommeaux, i. boneva and d. kontokostas. “validating rdf data”. synthesis lectures on semantic web: theory and technology, vol. 7, pp. 1-328, 2017. [4] h. asghar, z. anwar and k. latif. “a deliberately insecure rdfbased semantic web application framework for teaching sparql/ sparul injection attacks and defense mechanisms. computers and security, vol. 58, pp. 63-82, 2015. [5] b. r. cami, h. hassanpour and h. a. mashayekhi. “a contentbased movie recommender system based on temporal user preferences”. in: 3rd iranian conference on intelligent systems and signal processing (icspis). pp. 121-125, 2017. [6] g. m. zebari, k. faraj and s. zeebaree. “hand writing code-php or wire shark ready application over tier architecture with windows servers operating systems or linux server operating systems”. international journal of computer sciences and engineering, vol. 4, pp. 142-149, 2016. [7] k. faraj. “design of an e-commerce system based on intelligent techniques”. sulaimani university, sulaimani, krg, iraq, 2010. [8] s. p. rana, m. dey, j. prieto and s. dudley. “content-based health recommender systems”. in: recommender system with machine learning and artificial intelligence: practical tools and applications in medical, agricultural and other industries. john wiley and sons, hoboken, pp. 215-236, 2020. [9] d. wang, y. liang, d. xu, x. feng and r. guan. “a contentbased recommender system for computer science publications”. knowledge-based systems, vol. 157, pp. 1-9, 2018. [10] i. t. afolabi, o. s. makinde and o. o. oladipupo.“semantic web mining for content-based online shopping recommender systems”. international journal of intelligent information technologies, vol. 15, pp. 41-56, 2019. [11] c. l. bocanegra, j. l. ramos, a. civitet and l. f. luqure. “healthrecsys: a semantic content-based recommender system to complement health videos”. bmc medical informatics and decision making, vol. 17, pp. 1-10, 2017. [12] n. a. albatayneh, k. i. ghauth and f. f. chua. “utilizing learners’ negative ratings in semantic content-based recommender system for e-learning forum”. journal of educational technology and society, vol. 21, pp. 112-125, 2018. [13] available from: https://www.kaggle.com/datasets/rounakbanik/themovies-dataset [last accessed on 2022 feb 05]. [14] s. sardjono, r. y. alamsyah, m. marwondo and e. setiana. “data cleansing strategies on data sets become data science”. international journal of quantitative research and modeling, vol. 1, pp. 145-156, 2020. [15] r. h. singh, s. maurya, t. tripathi, t. narula and g. srivastav. “movie recommendation system using cosine similarity and knn”. international journal of engineering and advanced technology, vol. 9, pp. 556-559, 2020. ole_link2 tx_1~abs:at/tx_2:abs~at 70 uhd journal of science and technology | jan 2022 | vol 6 | issue 1 a clinical appraisal [1]. most of patients ought to visit the emergency clinic for the span of their treatment, which brought about more prominent medical care costs and tension on country and distant medical services organizations [2]. over the long haul, mechanical progressions have considered the determination of an assortment of problems just as well-being observing by means of minuscule gadgets, for example, smartwatches [3]. despite of this, innovation has changed the medical care framework from being focused on clinics to being fixated on patients [4], [5]. a few clinical preliminaries, for instance, may be done at home without the assistance of a medical care proficient (e.g., blood glucose level, checking pulse, po2, and level) [4], [6], [7]. future iot software in healthcare also exploring iot industry application mustafa n. rashad1, dana l. hussein1, haval d. abdalkarim2, ribwar r. azeez2 1department of information technology, chamchamal technical institute, sulaimani polytechnic university, kurdistan region, iraq, 2department of database technology, computer science institute, sulaimani polytechnic university, kurdistan region, iraq a b s t r a c t there has been a great deal of investigation into medical services ability and specialized advancements during the most recent 10 years. to state the obvious, internet of things (iot) has demonstrated insure associating different clinical hardware, sensors, and medical services experts to give top-notch clinical consideration at a distant area. this has upgraded patient security, diminished medical care costs, expanded admittance to medical services benefits, and expanded functional adequacy in the medical care industry. emerging technologies such as iot have the potential to transform our lives in many ways. a smart ubiquitous framework can only be built using smart objects in the iot system, which is its ultimate building pieces. this research surrenders an audit of potential iot-based innovation applications in medical services conducted to date. this paper records the development of the use of the healthcare internet of things (hiot) in tending to different medical care worries according to the viewpoints of empowering innovation, medical care administrations, and applications. besides, potential hiot framework issues and issues are explored. the current research closes by giving a wellspring of comprehension on the various uses of hiot with expectations of empowering future scholastics that are quick to chip away at and kick off something new in the field to have a superior handle of the subject. iot innovation has helped medical care experts in checking and diagnosing an assortment of well-being concerns, estimating an assortment of well-being factors, and giving demonstrative capacities at far-off areas using these standards. the structure and implementation of a specific framework are the subject of this paper. this has moved the medical services industry’s concentrate away from clinics and toward patients. index terms: future iot applications, healthcare internet of things, medical care, industrial applications corresponding author’s e-mail: mustafa n. rashad, department of information technology, chamchamal technical institute, sulaimani polytechnic university, kurdistan region, iraq. e-mail: mustafa.rashad@spu.edu.iq received: 05-12-2021 accepted: 07-04-2022 published: 19-05-2022 access this article online doi: 10.21928/uhdjst.v6n1y2022.pp70-75 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2022 rashad et al. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l re se a rc h a rt i c l e uhd journal of science and technology 1. introduction in current history, medical care industry has encountered rapid development, contributing impressively to income and business creation. a couple of years prior, sicknesses, and inconsistencies in human body must be recognized through mustafa n. rashad, et al.: future iot software in healthcare also exploring iot industry application uhd journal of science and technology | jan 2022 | vol 6 | issue 1 71 the current interchanges innovation may likewise be utilized to send clinical information from remote spots to medical services habitats [5], [8]. the iot has turned into a vital supporter of worldwide correspondence on account of future conventions and calculations [8]. the iot has given individuals more autonomy while likewise expanding their capacity to cooperate with the remainder of the world [9]. it associates a wide scope of gadgets to the internet, including remote sensors, home apparatuses, and electrical gadgets [10]. iot applications might be found in horticulture, vehicles, the home, and medical services [11]. the iots is gathering steam due to its advantages of higher precision, lower costs, and the capacity to more readily anticipate future occasions [12]. moreover, more prominent information on programming and applications, just as headways in versatile and pc advancements, universal openness of remote innovation, and the development of the computerized economy, have all upheld the fast iot insurgency [13]. heating, dampness, electrocardiograph (ecg), electroencephalograph (eeg), and other physiological data from the patient’s body are captured using instruments in healthcare applications. climate, dampness, date, and time may all be recorded as well [5], [14]. these data allow for more precise and relevant conclusions about the patients’ health [4], [15]. because a vast quantity of information is obtained derived from a range of sources, data storage and usability are also crucial in the iot system. doctors, caretakers, and other authorized persons have use of obtained information by aforementioned sensing devices. having the ability to communicate this data relating to healthcare practitioners through cloud-server enables for fast patients’ diagnoses and, if necessary, actions in the field of medicine [16]. the iots is an idea wherein every one of the gadgets in our day-to-day existence might associate with the web or to one another to communicate the information and execute all positions through the organization [10]. the improvement of new provisions in the manner medical care products is offered has come about due to admittance to versatile clinical clients and portable well-being administrations. likewise, the development of therapy draws near, just as the spread of arising innovation like robots and man-made brainpower, just as the straightforward exchange, and sharing of clinical information over the web [10], [11]. it is a remarkable idea which spins around utilizing the internet to better our lives [13]. this fundamental pattern is valuable to patient consideration since it permits specialists to make more exact analyses and, accordingly, accomplishes better treatment results. the utilization of iot highlights in clinical gear generously expanded the quality and viability of clinical benefits [17]. 2. background iot is a significant a component of current data innovation. iot is a framework that spreads over the internet and is the consequence of ongoing quick development in the field of remote interchanges [10]. to make the “web of everything” a reality, it very well might be important to associate different information gathering devices to the internet [18]. keen urban communities, shrewd homes, sensors were set, and route frameworks are only a couple of the spaces, where the iots is as of now broadly utilized [4]. perhaps the main application area of all is shrewd well-being [5], [8], [16]. great many each year, individuals pass away as a result of different illnesses or medical problems [16]. individuals’ well-being is turning out to be increasingly more of a concern. subsequently, one of the focal points of study in the field of shrewd well-being is the utilization of iot advancements to address medical problems [18]. it is the organization with which it communicates the physical, virtual universes of the internet. the actual world incorporates home devices, autos, modern hardware, structures, clinical gear, and the human body [10]. individuals’ way of life, ongoing sickness the executives, peril id, and lifesaving treatments will all profit from the utilization of iot innovation in medical services [5]. in medical care, the iot has a variety of applications: keep a nearby eye on your well-being. wearable gadgets would now be able to follow essential human body capacities, examine human conduct, and analyze medical conditions [8]. wearable innovation devices (smartwatches) can lessen tension and set aside cash for patients [14]. one more delicate well-being observing innovation used in customary emergency clinics isn’t care for this. patient support in wellbeing related exploration [16]. iot gadgets might be utilized in medical services settings to remind patients to take as much time as necessary [5], [10]. electrocardiograms, blood oxygen, and circulatory strain checking hardware can be interconnected to work on patients’ and parental figures arranged, observing, and framework expected, prompting further developed treatment results and service enhancements [5]. autos can be associated with network frameworks utilizing the iots. if an auto is occupied with a mishap, the framework can survey the seriousness of the impact and inform the traffic right hand director and the medical care mediation focal point of the mishap area and heading. this will help individuals who have been harmed in looking for brief assistance [10]. a few examination utilizing smartwatches, shrewd wearables, keen wristbands, keen homes, and iot innovations are presently embraced in the field of brilliant well-being. by the by, no exploration has equitably investigated and pictured all of the writing on the theme and analyze the current state and future mustafa n. rashad, et al.: future iot software in healthcare also exploring iot industry application 72 uhd journal of science and technology | jan 2022 | vol 6 | issue 1 advances of iot-based wise well-being research, totally, and dispassionately [5]. 3. system architecture overview 3.1. basics of hiot architecture the iot worldview for clinical applications supports the coordination of iot and distributed computing benefits into the field of medication. it likewise gives the conventions to conveying patient information to a clinical association from a scope of sensors and analytic supplies. the geography of hiot alludes to the association of various parts of an iot medical care system that is successfully incorporated into medical services setting [19]. the three vital parts of a run of the mill hiot framework are the maker, representative, and supporter, as shown in fig. 1. the distributer addresses an organization of connected sensors and other clinical gadgets that may freely or at the same time accumulate the patient’s fundamental information. the factors that might be estimated incorporate pulse, temperature, oxygen immersion, ecg, eeg, and emg. the distributer can send this information to a specialist routinely over an organization [20]. the information that has been gained in the cloud is handled and put away by the representative. at long last, the endorser participates in constant checking of the patient’s information, which might be gotten to and seen on a cell phone or pc. the distributer can analyze the information and proposition criticism in the wake of seeing any physiological irregularities or weakening in the patient’s well-being status [1]. every hub on the iot organization and server in the medical care network fills a particular need in the hiot, which unions separate parts into a crossover matrix. since the geography is relying upon the medical care prerequisite and application, it’s hard to offer a uniform establishment for hiot [12]. for the hiot framework, a few underlying modifications have been executed before. when fabricating another iot-based medical care framework for ongoing patient checking, it’s basic to make a rundown of all connected activities identified with the planned wellbeing application [12], [14]. the accomplishment of the iot framework is controlled by how well it addresses the issues of medical care experts [4]. the geography should follow the clinical standards and stages in the analysis technique since every infection requires a muddled succession of medical care activities [15]. 3.2. hiot technologies the innovation important to construct a hiot framework is critical [1]. this is on the grounds that the arrangement of explicit advancements can upgrade the abilities of an iot framework [17]. the different state of the art advancements has been joined with an iot framework to incorporate assorted medical care applications [10]. the three fundamental categorizations where these advances fall are as follows [1]: fig. 1. a typical hiot framework has three main components [1]. mustafa n. rashad, et al.: future iot software in healthcare also exploring iot industry application uhd journal of science and technology | jan 2022 | vol 6 | issue 1 73 1. location technology 2. identification technology 3. communication technology, as shown in fig. 2. 3.2.1. authentication technology the availability of the information about the patient derived from endorsed hub sensor device, and possibly situated in remote locations, is a commonsense factor in the plan of a hiot framework [14]. this can be refined by accurately distinguishing the hubs and sensors that exist in the medical care organization [4], [10]. the act of allocating an interesting personality (id) to each allowed substance, so it very well may be effectively recognized and steady information transmission can be cultivated, is known as distinguishing proof [1]. a computerized id is associated with each asset engaged with the medical services framework (emergency clinic, specialist, attendants, cares, clinical gadgets, etc.) [7]. be that as it may, due to the consistent headway of iot-based advances, the exceptional character of a part might fluctuate after some time all over the iot framework’s life cycle [11]. to guarantee the uprightness of the medical care gadget/ framework, the gadget should offer an arrangement for refreshing patients’ information [1], [11]. this is because the arrangement alteration not just influences the most common way of following the organization component(s); however, it might likewise bring about a broken conclusion [11]. 3.2.2. telecommunication technology correspondence advancements permit different substances in a hiot organization to speak with each other [9]. it can be parted into two classes: short-reach and medium-range correspondence [10]. the conventions used to interface things inside a little scope of body region networks are known as shortrange correspondence advancements (ban) [6], [8]. mediumrange correspondence frameworks, then again, ordinarily give significant distance correspondence, for example, data trade between a base station and a ban’s focal hub [8], [9], [21]. zigbee is a typical convention for interconnecting clinical hardware and communicating information [9], [11]. the zigbee recurrence range is like that of bluetooth (2.4 ghz) [11], [12]. it does, nonetheless, have a more drawn-out correspondence range than bluetooth gadgets [6]. the lattice network geography is utilized in this innovation [5]. end hubs, switches, and a handling place make up the framework [22]. information examination and conglomeration are taken care of by the handling community, regardless of whether a couple of gadgets fall flat, and the crosssection network guarantees that the remainder of the gadgets stays associated. energy utilization, high transmission rate, and huge organization limit are large benefits of zigbee [11], [12]. 3.2.3. geolocation technology the geolocation advance is regularly utilized in medical services organizations to screen and find the whereabouts of an article [6]. it likewise monitors the treatment cycle dependent on how accessible assets are dispersed [10]. the global positioning system, otherwise called gps, is quite possibly the majority typically applied technology [8]. satellites are utilized for the following purposes [1]. as since a long time ago, there is an unmistakable view between the item and four separate satellites, an article can be identified utilizing gps [8]. it very well may be utilized in hiot to distinguish the area of an emergency vehicle, a medical care proficient, cares. utilization of global positioning systems are, notwithstanding, confined for outside apps since neighboring offices can meddle with correspondence between the item and the satellite. a neighborhood situating framework (lps) organization can be valuable in these circumstances [1]. lps can follow an article by recognizing the radio transmission sent by the moving item and sending it to a variety of pre-situated beneficiaries [1]. lps can likewise be used with an assortment of short-range correspondence innovations such as rfid, wi-fi, zigbee. super wideband (uwb) radio, then again, is inclined toward due to its predominant transient goal. this permits the collector to work out the appearance time with accuracy [7], [8], [13]. the analysts utilized a uwb-based confinement technique to follow the time distinction of appearance (tdoa). other estimating rules, such as a family member and differential season of appearance, and full circle beginning speed, have been archived in the writing while developing a uwb-based restriction framework. gps, just as other high-transmission capacity correspondence innovations, could be utilized to build savvy medical care networks later on [1], [8]. fig. 2. categorization of iot technology. mustafa n. rashad, et al.: future iot software in healthcare also exploring iot industry application 74 uhd journal of science and technology | jan 2022 | vol 6 | issue 1 3.2.4. services and application of hiot clinical contraptions would now be able to do ongoing investigations that specialists could not direct only a couple of years prior in light of late headways in iot innovation [8]. it has likewise helped medical services communities connect with more people on the double and give great consideration for a minimal price. the utilization of huge information and distributed computing has fundamentally worked on the dependability and simplicity of correspondence among patients and specialists [11]. therefore, the patient’s commitment to the treatment interaction was expanded, while the patient’s monetary weight was diminished [14]. the huge impact of iot lately has helped the formation of hiot applications, which incorporate sickness diagnostics, individual consideration for pediatrics and geriatric patients, well-being, and wellness of the executives, and constant illness checking [5], [8], [16]. it has been isolated into two key classes, to be specific administrations and applications, for a superior comprehension of these applications [16]. the previous alludes to the rules that are utilized in the advancement of a hiot gadget, while the last option alludes to medical care applications that are utilized in either diagnosing a particular medical issue or estimating well-being measurements [8]. by giving answers for various medical care concerns, administrations and ideas have changed the medical services business [14]. with rising medical care requests and innovative headways, more administrations are being presented every day. these are currently turning into a significant piece of the hiot framework configuration process [11], [12]. in a hiot setting, each help gives an assortment of medical services arrangements. 4. challenges as of late, the medical care business has seen critical mechanical headways and their utilization in the goal of medical services-related challenges [8]. this has significantly further developed medical care administrations, which are presently accessible at the dash of a button. iot has effectively changed the medical care industry using keen sensors, distributed computing, and correspondence advances [5], [8]. iot, as different innovations, has its arrangement of obstructions and issues that could be investigated more later on. in the accompanying part, we’ll turn out a portion of the worries [13]. • there have been quick specialized enhancements as of late, requiring occasional moves up to hiot-based gadgets. countless associated clinical gadgets and sensors are utilized in each iot-based framework. this involves costly support, fix, and redesign costs, which might impact the organizations just as end-clients financials. accordingly, sensors that can be worked with fewer support costs should be incorporated [11]-[13]. • the larger part of iot gadgets is fueled by means of batteries. it is difficult to modify a sensor’s battery whenever it has occurred introduced. subsequently, a high-limit batter y was utilized to control the framework. subsequently, scientists throughout the planet are endeavoring to fabricate medical services devices that can produce their force. connecting the iot framework with environmentally friendly power frameworks is one such likely arrangement. somewhat, these strategies can help with alleviating the worldwide energy issue [10], [23]. • the thought of ongoing observing has been modified by the coordination of distributed computing. be that as it may, this has expanded the weakness of medical services organizations to aggressors. this could bring about the bungle of delicate patient information and affect the treatment cycle. a few preparatory insurances should be thought of while fostering a hiot framework to shield it from this destructive assault [17], [18], [23]. • real check, secure booting, adaptation to internal failure, approval of the executives, whitelisting, secret key encryption, and secure blending conventions should be in every way assessed and utilized by clinical and sensor gadgets in a hiot organization to stay away from an assault [8]. 5. conclusion and future scope the current research investigated a few features of the hiot framework. the engineering of a hiot framework, its parts, and the correspondence among these parts has all been analyzed inside and out here. furthermore, this article gives information on contemporary medical care benefits that have explored iot-based innovation. iot innovation has helped medical care experts in checking and diagnosing an assortment of well-being concerns, estimating an assortment of well-being factors, and giving demonstrative capacities at far-off areas using these standards. this has moved the medical services industry’s concentrate away from clinics and toward patients. we’ve likewise discussed diverse hiot applications and their latest things. the difficulties and issues identified with the plan, creation and utilization of the hiot framework have likewise been talked about. before very long, these troubles will fill in as an establishment for future development and exploration center. moreover, peruses who mustafa n. rashad, et al.: future iot software in healthcare also exploring iot industry application uhd journal of science and technology | jan 2022 | vol 6 | issue 1 75 are keen on beginning their exploration as well as making enhancements in the field of hiot gadgets will get a piece of full modern information on the gadgets. references [1] b. pradhan, s. bhattacharyya and k. pal. “iot-based applications in healthcare devices”. journal of healthcare engineering, vol. 2021, p. 6632599, 2021. [2] m. javaid and i. h. khan. “internet of things (iot) enabled healthcare helps to take the challenges of covid-19 pandemic”. journal of oral biology and craniofacial research, vol. 11, no. 2, pp. 209-214, 2021. [3] proceedings of the 5th eai international conference on smart objects and technologies for social good. acm, new york, usa, 2019. [4] i. de morais barroca filho, g. s. jr. aquino and t. b. vasconcelos. “extending and instantiating a software reference architecture for iot-based healthcare applications”. in: computational science and its applications iccsa 2019. springer international publishing, cham, 2019, pp. 203-218. [5] s. ketu and p. k. mishra. “internet of healthcare things: a contemporary survey”. the journal of network and computer applications, vol. 192, no. 103179, p. 103179, 2021. [6] m. m. alam, h. malik, m. i. khan, t. pardy, a. kuusik and y. le moullec. “a survey on the roles of communication technologies in iot-based personalized healthcare applications”. ieee access, vol. 6, pp. 36611-36631, 2018. [7] m. a. akkaş, r. sokullu and h. e. çetin. “healthcare and patient monitoring using iot”. internet of things, vol. 11, no. 100173, p. 100173, 2020. [8] j. qi, p. yang, g. min, o. amft, f. dong and l. xu. “advanced internet of things for personalised healthcare systems: a survey”. pervasive and mobile computing, vol. 41, pp. 132-149, 2017. [9] y. a. qadri, a. nauman, y. b. zikria, a. v. vasilakos and s. w. kim. “the future of healthcare internet of things: a survey of emerging technologies”. ieee communications surveys and tutori-als, vol. 22, no. 2, pp. 1121-1167, 2020. [10] r. c. dharmik, s. gotarkar, p. dinesh and h. s. burde. “an iot framework for healthcare moni-toring system”. journal of physics: conference series, vol. 1913, no. 1, p. 012145, 2021. [11] n. gavrilović and a. mishra. “software architecture of the internet of things (iot) for smart city, healthcare and agriculture: analysis and improvement directions”. journal of ambient intelligence and humanized computing, vol. 12, no. 1, pp. 1315-1336, 2021. [12] l. catarinucci, d. de donno, l. mainetti, l. palano, l. patrono, m. l. stefanizzi, l. tarricone. “an iot-aware architecture for smart healthcare systems”. ieee internet of things journal, vol. 2, no. 6, pp. 515-526, 2015. [13] g. marques, r. pitarma, n. m. garcia and n. pombo.“internet of things architectures, technolo-gies, applications, challenges, and future directions for enhanced living environments and healthcare systems: a review”. electronics (basel), vol. 8, no. 10, p. 1081, 2019. [14] d. castro, w. coral, j. cabra, j. colorado, d. méndez and l. trujillo. “survey on iot solutions applied to healthcare”. dyna (medellin), vol. 84, no. 203, pp. 192-200, 2017. [15] r. de michele and m. furini. “iot healthcare: benefits, issues and challenges”. in: proceedings of the 5th eai international conference on smart objects and technologies for social good, 2019. [16] s. b. baker, w. xiang and i. atkinson. “internet of things for smart healthcare: technologies, challenges, and opportunities”. ieee access, vol. 5, pp. 26521-26544, 2017. [17] p. p. ray, d. dash and d. de. “edge computing for internet of things: a survey, e-healthcare case study and future direction”. the journal of network and computer applications, vol. 140, pp. 1-22, 2019. [18] a. jain, m. singh and p. bhambri. “performance evaluation of ipv4-ipv6 tunneling procedure us-ing iot”. journal of physics: conference series, vol. 1950, no. 1, p. 012010, 2021. [19] l. m. dang, m. j. piran, d. han, k. min and h. moon. “a survey on internet of things and cloud computing for healthcare”. electronics (basel), vol. 8, no. 7, p. 768, 2019. [20] b. oryema, h. s. kim, w. li and j. t. park. “design and implementation of an interoperable messaging system for iot healthcare services”. in: 2017 14th ieee annual consumer communi-cations and networking conference (ccnc), 2017. [21] a. gatouillat, y. badr, b. massot and e. sejdic. “internet of medical things: a review of recent contributions dealing with cyber-physical systems in medicine”. ieee internet of things journal, vol. 5, no. 5, pp. 3810-3822, 2018. [22] f. sallabi, f. naeem, m. awad and k. shuaib. “managing iot-based smart healthcare systems traffic with software defined networks”. in: 2018 international symposium on networks, com-puters and communications (isncc), 2018. [23] d. d. ramlowat and b. k. pattanayak. “exploring the internet of things (iot) in education: a re-view”. in: advances in intelligent systems and computing. springer singapore, singapore, 2019, pp. 245-255. tx_1~abs:at/tx_2:abs~at 12 uhd journal of science and technology | jan 2022 | vol 6 | issue 1 1. introduction technologies of biometric are utilized as a source in multiple applications that outline the interests of identification and authorization, such attentiveness demands a high level of privacy and security. compared with traditional security, biometrics is considered an important type of security since it provides unique features through identifying biometrics characteristics [1]. a large number of security-based researchers used biometrics systems to improve the performance of their approaches. the biometric specifications are classified into two main classes: physiological and behavioral, body parts included under fixed human characteristics and classified to be physiological characteristics for instance iris, fingerprint, face, dna, and retina, also classified as passive biometrics, while gait, voice, and handwritten signature classified as active biometrics as it represented by skills or functions performed by an individual and that make them belong to behavioral characteristics. either way, those characteristics led to high authentication and verification for security [2]. biometric security can be achieved through two kinds of categories: uni-model and multi-model [3]. online list of biometric characteristics (active or passive) is used as a feature in the uni-model, this model has a low-security level against the multi-model which uses two or more of those characteristics that achieve a higher security level. the main biometric characteristics for personal verification are obtained from face-fingerprint characteristics, the main aim of this paper is to improve human identification through the multi-model biometric process through merging between two biometric features, face-fingerprint, via these features, the system can compare, detect and identify the candidate within the constructed dataset. both of these characteristics require new feature-level algorithm for a face-fingerprint integral multi-biometrics identification system bayan omar mohammed1, hamsa d. majeed2, siti zaiton mohd hashim3, muzhir shaban al-ani4 1,2,4department of information technology, college of science and technology, university of human development, kurdistan region, iraq, 3department of data science, universiti malaysia kelantan (umk), taman bendahara, 16100 pengkalan chepa, kelantan a b s t r a c t this article delves into the power of multi-biometric fusion for individual identification. a new feature-level algorithm is proposed that is the dis-eigen algorithm. here, a feature-fusion framework is proposed for attaining better accuracy when identifying individuals for multiple biometrics. the framework, therefore, underpins the new multi-biometric system as it guides multi-biometric fusion applications at the feature phase for identifying individuals. in this regard, the facefingerprints of 20 individuals represented by 160 images were used in this framework . experimental resultants of the proposed approach show 93.70 % identification rate with feature-level fusion multi-biometric individual identification. index terms: multi-model biometric, dis-eigen algorithm, identification, aspect united moment invariant corresponding author’s e-mail: bayan.omar@uhd.edu.iq, hamsa.al-rubaie@uhd.edu.iq , sitizaiton@umk.edu.my , muzhir.al-ani@uhd.edu.iq received: 22-11-2021 accepted: 03-02-2022 published: 11-02-2022 o r i g i n a l re se a rc h a rt i c l e uhd journal of science and technology access this article online doi: 10.21928/uhdjst.v6n1y2022.pp12-20 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2022 mohammes et al. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) mohammed et al.: dis-eigen algorithm in feature-level fusion uhd journal of science and technology | jan 2022 | vol 6 | issue 1 13 the existence of reference biometric data samples taken from different volunteers of different ages to be compared against the respective biometrical data of every person enrolled in a database or against a single reference template of particular enrolled individuals for identity confirmation of that person respectively [4]. the identification system accuracy is determined by its success comparison relying on the uniqueness of people’s biometric characteristics, i.e, two persons can never have the same features [1]-[4]. 2. related work according to the importance of using biometric systems in different applications and implementations, this field attracted countless researchers to propose their approaches using different biometrics characteristics. this section presents the most recent published approaches in this field. in fingerprint recognition orientation, several techniques are recommended for accuracy enhancement of the recognition, those techniques were different upon various criteria, certain of these proposals areas of interest was in the preprocessing stage of fingerprint images [5], implemented an approach for the features extraction of both right and left-hand thumbs using many levels of twodimensional discrete wavelet transform (2d-dwt), while another approach [6] used discrete cosine transform (dct) technique for extraction features [7], [8] presented techniques for fingerprint enhancement through localizing and recognizing the minutiae for minutiae extraction relying on the optimal thinning operation that took place in the preprocessing stage of fingerprint image. prasad et al. [9] proposed a system consisting of many stages starting from data gathering which includes fingerprint images belonging to many different people then pre-processing those depending on their characteristics, finally, the algorithm is used for the purpose of recognition of fingerprint. another area of interest was in proposing and developing algorithms for extracted feature stage from the process of recognition of that fingerprint, [10] proposed multi-biometric fusion for identical twins at the feature-level with dis-mean algorithm, aspect united moment invariant (aumi) used to define the individual biometric fingerprint characteristic. the extracted features regard the twin handwriting fingerprint for both word and shape. furthermore, [11] proposed a fusion algorithm using the mean-discrete feature for identical twins fingerprint detection, the main method of the presented algorithm requires the person’s class labeling and multimodel biometric features to uni-modal biometric features conversion. as in multi-modal biometrics, the individuality represented by mohammed and hashim [12] using (aumi) for global feature extractions to serve as a means to identical twins fingerprint detection, the procedure of individuality representation measures the aumi capacity for the individuality of the main of twin handwriting-fingerprint. a modified algorithm proposed by mohammed [13] for individuality representation by employing the mean-discrete algorithm, the vector of feature carries the features which generalized the global features owned by individuals. the developed model generalized the features in an earlier stage of classification. in face recognition, many approaches have been proposed over the years that compound pattern recognition with computer vision and image processing, those algorithms are implemented in many forms to accomplish high-efficiency recognition. al-shayea et al. [14] present algorithm to specify the recognition rate of the pca algorithm before and after applying dwt. the outcome was that applying dwt increases the recognition rate with minimization in the feature matrix size. paper [15] proposed an approach based on wavelet-curvelet for facial features extraction, the used technique aims to reduce mathematical computational image analysis by reducing the dimensionality. the nearest mean classifier (nmc) is employed for recognition at a high rate. an efficient approach is proposed [16] for face recognition concentrated on the efficiency of the image to calculate the required features from the image that contained the face. the outcome result gives a fine performance of the face recognized from that system. qeethara al-shayea [17] proposed a system for information measurement between the main features of different face parts like angles and distances. an algorithm has been constructed with all the digital data compared with massive face images databases. histogram equalization technique is used for extraction feature for recognition. research [18] has a main objective of designing an efficient face recognition approach by generating a matrix with a significant rank by applying the technique of singular value decomposition. a relative study is made by prasad et al. [19] scientific survey generated to study various methods and techniques with all the sufferings and benefits of these approaches. al ani and al waisy [20] presented an approach using a kernel machine for face detection from different views, the proposed algorithm shows powerful ability to effectively multi-view face detection. another proposal presented by nejrs and al-ani [21] for face classification through a structured approach that is implemented using mohammed et al.: dis-eigen algorithm in feature-level fusion 14 uhd journal of science and technology | jan 2022 | vol 6 | issue 1 dissimilar levels in two-dimensional discrete fourier transform (2d-dwt). while [22] proposed a technique to track faces and detect happiness parameters. the system is performed using a raspberry pi device, high-resolution camera, and high-definition screen. the study is compatible to apply in real-time. the main goal of this proposal is for studying the happiness level that the students have in the class and raise that level during the lecture. other researchers shared our interest in integrating facefingerprint recognition in one system. sriram j and jacob j [23] proposed a voting system evm where all the data of voters are digitally recorded. through the proposed project, the faces of the voters are recognized first by a camera then the fingerprint is used for giving authentication through the data that is already stored in the database. another multimodal system was proposed [24] for identification using face-fingerprint features. the calculations are taking place in the work first for each feature individually rather than when they are combined together. divyakant and meva [25] proposed a biometric system for performance assessment of face-fingerprint biometric traits collected from 30 humans. the recognition stage performance is measured through (false acceptance rate) far and (false rejection rate) frr calculations. a comb filter approach is proposed systems [26] regarding face-fingerprint recognition through an encryption process to present cancelable patterns then make a comparison with any other encrypted biometric that have been randomly attacked. in article [27], firstly in the part of novel step, a system of smart cards accessing is presented by face authentication first then by fingerprint concurrently. the verification should be for both mutual authentications, after that the transaction is approvable [12]. 3. the multi-biometric identification system approach proposal with regards to getting a high accuracy identification, the classifier input is rich with the major features extraction of individuals. in this work, the features are extracted for individual face-fingerprint. after extraction, the classification takes place so as to improve biometric identification. in this module, the feature sets of the defined processed biometrics data are extracted. feature extraction is crucial for resolving raw data into simple, clear, and comprehensible data that would be able for matching learning. this phase is crucial in almost all systems of pattern recognition. the extraction of global features for both face-fingerprints including the extraction of macro features and method of minutiae-based extraction. the aspect united moment invariant (aumi) is extracted to attain the individual features from the images of individual face-fingerprint. the extracted features are all kept in the storage of invariant feature vector. the proposed system for multimodal biometrics identification that employs two biometrics modalities is presented in fig. 1. the adopted phase following feature extraction comprises the suggested fusion feature level. at the level of feature fusion, signals that come from different biometrics channels are processed first. meanwhile, feature vectors are separately extracted then go through a specific field algorithm of fusion known as the dis-eigen algorithm. these feature vectors are composite to generate a combined feature vector prior to being employed for similarity measurement and classification process. 4. extraction of the feature the feature sets are extracted from the defined processed biometrics data in this work using aspect united moment invariant (aumi) [28]. it is powerful in capturing the biometrics’ individual global characteristics of physiognomy and style of the fingerprint and for global features continuously and separately into an individual representation. the scorce of aspect united moment invariant structure is below and represented in fig. 2. entirely there are eight aumi features reported below: 1 2 1 = ∅ ∅ (1) 2 6 1 4 = ∅ ∅ ∅� (2) 3 5 4 = ∅ ∅ (3) 4 3 2 4 = ∅ ∅ ∅ ����� (4) 5 1 6 1 3 = ∅ ∅ ∅ ∅ (5) 6 1 2 2 6 = ∅ + ∅ ∅ ∅ ( ) (6) mohammed et al.: dis-eigen algorithm in feature-level fusion uhd journal of science and technology | jan 2022 | vol 6 | issue 1 15 �7 1 5 3 6 � � � � � (7) 8 3 4 3 = ∅ + ∅ ∅ (8) where φi represents hu‟s moment invariants. due to the large values of φi, the natural logarithm is applied thus giving us: for i = 1 to 7; θi ← log10 φi. a s e t o f g l o b a l f e a t u r e s a r e g e n e r a t e d f r o m t h e extracted features. for the pur pose of improving identification performance, these features are gathered individually. since the extracted features are in the multirepresentations zone, then it has been used in combined form. such combined features are termed as dis-eigen feature vectors in the uni-representation zone, which is employed after the process of feature extraction [12], [28], [29]. 5. the proposed discretized-eigen (dis-eigen) algorithm in feature level fusion this work attempts to design a more effective multimodal biometrics identification system introduces by introducing the new proposed dis-eigen feature-based fusion as fig. 3 illustrates with the capacity in generating distinctive features of numerous modalities of individuals, where fac and fin represent the face and fingerprint features consecutively. first, an improved aumi is used as global in the extraction of features obtained from the individual face-fingerprint shape and style. then, the features-based fusion is examined in terms of its generalization. further, to achieve better classification accuracy, the dis-eigen feature-based fusion algorithm was used. at the start of the process, dis-eigen feature-based fusion processes the raw biometrics input images using the suitable technique of processing. this results in images of standard size and quality. then, the feature extraction algorithm is used to extract these images to generate discriminatory information which can provide a distinction between the identities. it is not easy to come up with a method that could comprehensively capture the discriminatory information from both raw input images among numerous identities or modalities and resolve problems in biometrics analysis. this study proposes the use of an enhanced combined feature vector fusion and uni-representation are known as the dis-eigen feature for multimodal biometrics identification. also, dis-eigen replaces the feature fusion as a feature transformation agent to provide better feature representation from numerous modalities. in the dis-eigen process, multi representation features from multibiometrics are converted to uniand systematic fig. 1. proposed multi-biometric system. fig. 2. aspect united moment invariant structure. mohammed et al.: dis-eigen algorithm in feature-level fusion 16 uhd journal of science and technology | jan 2022 | vol 6 | issue 1 representation features for reduction of the complexity (dimension) of the feature vector. after concatenating an individual face-fingerprint the sorted eigenvector in ascending order for a person in the aim of determining discrete intervals in the dis-eigen line. the dis-eigen line is a line of an invariant feature vector that starts from the minimum eigenvalue and ends with the maximum eigenvalue from the eigen feature vector for that person. the interval number is fit equally to the capacity of eigenvector -1. the eight features columns of aumi have been applied in this study. the eigenvalue is the divider of intervals in the dis-eigen line. each mean face-fingerprint feature (ff) vector that falls within the same interval will have the same med interval value. med interval (mi) for each interval is the average of an interval that is calculated using the formula as in (9): mi ev evi i= − +� � 1 2 (9) med interval value for interval one to seven represents the invariant feature vector that falls within if ff≥evi and if ff≤evi+1. on the other hand, compute the mean features that come from the individual face (fac ij ) and individual fingerprint (fin ij ) invariant feature vector for a person in a twin. the computation and reduction for each face-fingerprint feature are expressed as below: ff fac finij ij ij= +( )/2 (10) where: facij: face features for an individual. finij: fingerprint features for an individual. eight features are created in this study and this is the column number for the features number of the applied aumi for multi-biometrics. these features are called the dis-eigen feature-based fusion vector. tables 1-3 below exemplify the transformation of individual multimodal biometrics feature vector into dis-eigen featurebased fusion vector: tables 1 comprise eight columns representing the eight columns of invariant feature vectors within the aumi then concatenated. these data are further applied in the dis-eigen process. while table 2 represents the dis-eigen features vector composed of the generalized features of individual features for an individual. meantime table 3 presents dis-eigen fig. 3. the proposed dis-eigen feature-based fusion. mohammed et al.: dis-eigen algorithm in feature-level fusion uhd journal of science and technology | jan 2022 | vol 6 | issue 1 17 table 1: real data for face and fingerprint for individual number 1 image f1 f2 f3 f4 f5 f6 f7 f8 1.0075 2.6937 1.7301 0.3341 2.9618 3.3513 1.2401 5.7571 1.0075 5.2967 1.7303 0.334 5.8222 1.7048 630.6227 5.758 1.0054 5.7202 1.7308 0.3338 6.3065 1.5772 583.5766 5.7626 1.0075 2.6937 1.7301 0.3341 2.9618 3.3513 1.2401 5.7571 1.0178 0.1602 1.7237 0.3363 0.0175 56.1404 2.0994 5.7059 1.0142 0.1418 1.7242 0.3361 0.0156 63.2681 2.3703 5.71 1.0305 0.0901 1.7238 0.3363 0.0096 101.1057 3.7337 5.7062 1.0178 0.1021 1.7239 0.3362 0.011 88.8608 3.2916 5.7073 table 2: dis‑eigen face and fingerprint for individual number1 f1 f2 f3 f4 f5 f6 f7 f8 1.2312 1.2312 1.2312 1.2312 1.2312 57.6571 1.2312 1.2312 1.2312 1.2312 1.2312 1.2312 1.2312 57.6571 0 1.2312 1.2312 1.2312 1.2312 1.2312 1.2312 57.6571 0 1.2312 1.2312 1.2312 1.2312 1.2312 1.2312 57.6571 1.2312 1.2312 table 3: example of dis‑eigen feature for individuals f1 f2 f3 f4 f5 f6 f7 f8 individual 3.4057 3.4057 3.4057 0.437 3.4057 47.802 0 3.4057 p10 3.4057 3.4057 3.4057 0.437 3.4057 47.802 0 3.4057 p10 3.4057 3.4057 3.4057 0.437 3.4057 47.802 3.4057 3.4057 p10 3.4057 3.4057 3.4057 0.437 3.4057 47.802 3.4057 3.4057 p10 0.5237 4.5258 4.5258 0.5237 4.5258 54.7769 0 4.5258 p11 0.5237 4.5258 4.5258 0.5237 4.5258 54.7769 0 4.5258 p11 0.5237 4.5258 4.5258 0.5237 4.5258 54.7769 4.5258 4.5258 p11 0.5237 4.5258 4.5258 0.5237 0.5237 54.7769 4.5258 4.5258 p11 1.2422 5.3632 1.2422 1.2422 5.3632 58.042 0 5.3632 p12 1.2422 5.3632 1.2422 1.2422 5.3632 58.042 0 5.3632 p12 1.2422 5.3632 1.2422 1.2422 5.3632 58.042 58.042 5.3632 p12 1.2422 5.3632 1.2422 1.2422 5.3632 58.042 0 5.3632 p12 1.8191 6.8858 1.8191 1.8191 6.8858 64.3154 0 6.8858 p13 1.8191 6.8858 1.8191 1.8191 1.8191 64.3154 0 6.8858 p13 1.8191 1.8191 1.8191 1.8191 1.8191 64.3154 0 6.8858 p13 1.8191 6.8858 1.8191 1.8191 6.8858 64.3154 0 6.8858 p13 1.8032 1.8032 1.8032 1.8032 1.8032 52.7556 0 5.5007 p14 1.8032 1.8032 1.8032 1.8032 1.8032 52.7556 1.8032 5.5007 p14 1.8032 5.5007 1.8032 1.8032 5.5007 52.7556 5.5007 5.5007 p14 1.8032 1.8032 1.8032 1.8032 1.8032 52.7556 1.8032 5.5007 p14 1.7582 3.1658 1.7582 0.6345 3.1658 55.8628 0 55.8628 p15 1.7582 55.8628 1.7582 0.6345 55.8628 55.8628 0 55.8628 p15 1.7582 0.6345 1.7582 0.6345 0.6345 55.8628 3.1658 55.8628 p15 1.7582 3.1658 1.7582 0.6345 3.1658 55.8628 0 55.8628 p15 1.3383 3.4765 1.3383 1.3383 3.4765 51.3403 0 51.3403 p16 1.3383 3.4765 1.3383 1.3383 3.4765 51.3403 0 51.3403 p16 1.3383 1.3383 1.3383 1.3383 1.3383 51.3403 1.3383 51.3403 p16 1.3383 3.4765 1.3383 1.3383 3.4765 51.3403 0 51.3403 p16 mohammed et al.: dis-eigen algorithm in feature-level fusion 18 uhd journal of science and technology | jan 2022 | vol 6 | issue 1 features for both face and fingerprint for individuals (10–16). 6. experiment and results precise results of identification process generated by dis-eigen algorithm within the focal point of multirepresentation analysis for an individual’s face-fingerprint. an enhancement level of an individual’s face-fingerprint for dis-eigen feature-based fusion data utilization is proven in this work. in this study, the proposed dis-eigen algorithm is dis-eigen feature-based fusion. datasets with different numbers of both test and other train are used in this work, precisely two kinds of examples are established, the first adopt individual datasets with a split percentage of 60% as training and 40% for testing. the other has a split percentage of 80% as training and 20% as testing. the implementation of training is demonstrated using naivebayes, randomforest, randomforest, and j48 along with folds cross-validation of eight (8) and ten (10). in this experiment, the data sets comprise 160 data which are broken down into two categories with 20 individuals. as presented in tables 4-7 along with figs. 4 and 5. tables 4 and 5 show the performance of dis-eigen, face, fingerprint, and concatenate features for the eight classifiers for the two experimental analysis setup. in average, the performance of dis-eigen feature for all classifiers has table 4: provide the accuracy for classification process for dis‑eigen feature‑based fusion with all methods for split percentage of 60% training and 40% of testing methods naivebayes random forest random tree j48 dis‑eigen 84.37 100 84.37 65.62 face 6.25 21.78 18.75 15.62 fingerprint 18.75 18.75 21.75 12.5 concatenate 10.93 18.75 10.93 14.06 table 5: provide the accuracy for classification process for dis‑eigen feature‑based fusion with all methods for split percentage of 80% training and 20% of testing methods naivebayes randomforest random tree j48 dis‑eigen 87.5 100 100 100 face 25 31.25 25 18.75 fingerprint 12.5 18.75 12.5 6.25 concatenate 15.62 18.75 12.5 15.62 table 6: provide the accuracy for dis‑eigen feature‑based fusion with all methods for eight (8) folds cross-validation methods naivebayes randomforest random tree j48 dis‑eigen 88.75 100 98.75 100 face 13.75 23.75 17.5 22.5 fingerprint 32.5 23.75 27.5 27.5 concatenate 5 22.5 22.5 25.62 table 7: provide the accuracy for dis‑eigen feature‑based fusion with all methods for ten (10) folds cross-validation methods naivebayes randomforest random tree j48 dis‑eigen 91.25 100 100 98.75 face 16.25 21.25 17.5 25 fingerprint 33.75 25 22.5 28.75 concatenate 4.37 25.62 20 20.62 fig. 4. percentage of 60% training and 40% of testing. fig. 5. all methods for eight (8) folds cross. mohammed et al.: dis-eigen algorithm in feature-level fusion uhd journal of science and technology | jan 2022 | vol 6 | issue 1 19 succeeded as the heights performance with 90.23% accuracy rate in average. this is followed by 20.3 % for face rule, 14.64% for concatenate, and 15.21 % for fingerprint. the dis-eigen algorithm has presented the best performance accuracy of 97.18% in average in tables 6 and 7 for all classifiers with eight and ten fold cross validation environment setup. though, six fusion and non-fusion algorithm have achieved quite a lower average performance of 19.68% for face rule, 27.65% for fingerprint and 18.20% for concatenate. this is a very poor performance in comparison toward the performance of dis-eigen features. this has shown that dis-eigen algorithm significantly increased their classification performance. as referred, tables 4-7 and figs. 4 and 5 presented the overall results of the various methods, it is a noticeable sign that the dis-eigen feature-based fusion has the optimum accuracy than face or fingerprint individually besides the concatenate data. the preferable outcome achieves the applied improvement to the features which are individually represented through dis-eigen feature-based fusion. 7. conclusion the dis-eigen feature-based fusion algorithm has been proposed in this work as an attempt for multi-model biometric system improvement for individuality in facefingerprint identification. the proposed algorithm converts the multi-representations of individual features into a uni-representation with the technique of the dis-eigen algorithm. generalized features of an individual have been presented significantly. the new approach has been evaluated and beard comparison with the conventional one with regard to similarity measurement. according to this, an individuals’ facefingerprints were identified. resultant scrutinization has been made. the outcome feature from the dis-eigen feature application is systematic and more informative as experimental resultants of the proposed approach show 93.70 % identification rate with feature-level fusion multi-biometric individual identification. furthermore, a particular improvement in system performance accuracy is achieved. references [1] m. al-ani and k. al-baset. “efficient watermarking based an robust biometric features.” iracst‑engineering science and technology: an international journal, vol. 3, pp. 529-534, 2013. [2] m. s. al-ani and m. a. rajab. “biometrics hand geometry using discrete cosine transform (dct).” science and technology, vol. 3, no. 4, pp. 112-117, 2013. [3] m. al-ani and s. nejrs. “efficient biometric iris recognition based on iris localization approach.” uhd journal of science and technology, vol. 3, pp. 24-32, 2020. [4] z. a. kakarash, d. f. abd, m. al-ani, g. a. omar and k. mohammed. “biometric iris recognition approach based on filtering techniques.” 2019. [5] m. s. al-ani, t. n. muhamad, h. a. muhamad and a. a. nuri. “effective fingerprint recognition approach based on double fingerprint thumb.” in: 2017 international conference on current research in computer science and information technology (iccit), 2017. [6] m. shabanal-ani and w. m. al-aloosi. “biometrics fingerprint recognition using discrete cosine transform (dct).” international journal of computer applicationsvol, 69, no. 6, pp. 44-48, 2013. [7] o. h. a. al-ani. “human identification based on thinning minutiae of fingerprint.” journal of theoretical and applied information technology, vol. 96, no. 17, pp. 5918-5929, 2018. [8] m. s. al-ani. “a novel thinning algorithm for fingerprint recognition.” international journal of engineering sciences, vol. 2, no. 2, pp. 4348, 2013. [9] r. s. prasad, m. s. al-ani and s. m. nejres. “an efficient approach for fingerprint recognition.” international journal of engineering innovation and research, vol. 4, no. 2, pp. 307-313, 2015. [10] b. o. mohammed and s. m. shamsuddin. “feature level fusion for multi‑biometric with identical twins.” in: 2018 international conference on smart computing and electronic enterprise (icscee), 2018. [11] b. o. mohammed. “fusion method with mean-discrete algorithm in feature level for identical twins identification.” uhd journal of science and technology, vol. 4, no. 2, pp. 141-150, 2020. [12] b. o. mohammed and z. m. hashim. “individuality representation using multimodal biometrics with aspect unieted moment invariant for identical twins.” journal of theoretical and applied information technology, vol. 98, no. 12, pp. 2148-2157, 2020. [13] b. o. mohammed. “mean-discrete algorithm for individuality representation.” journal of al‑qadisiyah for computer science and mathematics, vol. 13, no. 1, pp. 1-10, 2021. [14] q. k. al-shayea, m. s. al-ani and m. s. a. teamah. “the effect of image compression on face recognition algorithms.” international journal of computer and network security, vol. 2, no. 8, pp. 56-60, 2010. [15] m. s. al ani. “face recognition approach based on waveletcurvelet technique.” signal image process, vol. 3, no. 2, pp. 2131, 2012. [16] r. s. prasad, m. s. al-ani and s. m. nejres. “an efficient approach for human face recognition.” international journal of advanced research in computer science and software engineering, vol. 5, no. 9, pp. 133-136, 2015. [17] m. a. a. qeethara al-shayea. “biometric face recognition based on enhanced histogram approach.” international journal of communication networks and information security, vol. 10, no. 1, pp. 148-154, 2018. [18] o. h. ahmed, j. lu, q. xu and m. s. al-ani. “face recognition based rank reduction svd approach.” the isc international journal of information security, vol. 11, no. 3, pp. 39-50, 2019. [19] r. s. prasad, m. s. al-ani and s. m. nejres. “human identification mohammed et al.: dis-eigen algorithm in feature-level fusion 20 uhd journal of science and technology | jan 2022 | vol 6 | issue 1 via face recognition: comparative study.” iosr journal of computer engineering, vol. 19, no. 3, pp. 17-22, 2017. [20] m. s. al ani and a. s. al waisy. “multi-view face detection based on kernel principal component analysis and kernel support vector techniques.” international journal on soft computing, vol. 2, no. 2, pp. 1-13, 2011. [21] s. m. nejrs and m. s. al-ani. “face image classification based on feature extraction.” solid state technology, vol. 63, no. 6, pp. 13515-13526, 2020. [22] m. s. al-ani. “happiness measurement through classroom based on face tracking.” uhd journal of science and technology, vol. 3, no. 1, pp. 9-18, 2019. [23] sriram j and jacob j. “smart evm based on face and fingerprint recognition.” ijraset, vol. 8, no. 8, pp. 1606-1610, 2020. [24] m. szymkowski and k. saeed. “a multimodal face and fingerprint recognition biometrics system.” in: computer information systems and industrial management. springer international publishing, cham, pp. 131-140, 2017. [25] d. c. k. divyakant and t. meva. “performance measurement of face and fingerprint recognition system.” in: rk university first international conference on research and entrepreneurship, 2016. [26] m. abd al rahim, w. el-shafai, e. s. m. el-rabaie, o. zahran and f. e. abd el-samie. “comb filter approach for cancelable face and fingerprints recognition.” menoufia journal of electronic engineering research, vol. 28, no. 1, pp. 89-94, 2019. [27] g. s. g. anjaneyulu and v. jalaja. “novel authentication process of the smart cards using face and fingerprint recognition.” in: advances in automation, signal processing, instrumentation, and control. springer, singapore, pp. 2547-2556, 2021. [28] b. o. mohammed and s. m. shamsuddin. “twins multimodal biometric identification system with aspect united moment invariant.” journal of theoretical and applied information technology, vol. 95, no. 4, pp. 788-803, 2017. [29] b. o. mohammed and s. m. shamsuddin. “a multimodal biometric system using global features for identical twins identification.” journal of computational science, vol. 14, no. 1, pp. 92-107, 2018. tx_1~abs:at/tx_2:abs~at uhd journal of science and technology | july 2022 | vol 6 | issue 2 11 1. introduction covid-19 is a respiratory disease caused by sars-cov-2 coronaviruses derive their name from their spherical viruses, which had a shell and a surface projection similar to the solar corona [1]. unfortunately, the number of deaths from covid-19 is increasing daily, which has led scientists to work tirelessly to develop a tool to diagnose all types of covid. there are several ways to diagnose the disease, including blood tests and chest x-ray (cxr) images [2]. the two most popular imaging studies for diagnosing and managing covid-19 patients are the cxr and computed tomography (ct) scan images. chest radiography and ct scans, on the other hand, are widely available at most medical centers and typically interpreted with a faster turnaround time than the sars-cov-2 laboratory testing. the use of cxr images in the monitoring and examination of numerous lung disorders including tuberculosis, infiltration, atelectasis, pneumonia, and hernia has been known. covid-19 predominantly affects the respiratory system, resulting in severe pneumonia and acute respiratory distress syndrome in extreme cases. for the most part, x-ray images of the chest are used to diagnose covid-19-infected patients [3]. therefore, there are many researches on the diagnosis of covid-19 using cxr images. one of the modern methods used in the diagnosis of covid-19 is the use of deep learning (dl) techniques, which is deep neural network learning. the dl approach has the advantage of automatically extracting features from training data and classifying them more accurately than other traditional methodologies [4]. resnet, abbreviation for residual networks, is a conventional neural network that acts as a foundation for many image processing applications. resnet’s fundamental achievement was that it enabled us to train extraordinarily deep neural networks covid-19 classification based on neutrosophic set transfer learning approach rebin abdulkareem hamaamin1, shakhawan hares wady1, ali wahab kareem sangawi2 1applied computer, college of medicals and applied sciences, charmo university, chamchamal, sulaimani, krg, iraq, 2general science, college of education and language, charmo university, chamchamal, sulaimani, krg, iraq a b s t r a c t the covid-19 virus has a significant impact on individuals around the globe. the early diagnosis of this infectious disease is critical to preventing its global and local spread. in general, scientists have tested numerous ways and methods to detect people and analyze the virus. interestingly, one of the methods used for covid-19 diagnosis is x-rays that recognize whether the person is infected or not. furthermore, the researchers attempted to use deep learning approaches that yielded quicker and more accurate results. this paper used the resnet-50 module based on the neutrosophic (ns) domain to diagnose covid patients over a balanced database collected from a covid-19 radiography database. the method is a future work of the n. e. m. khalifa et al.’s method for ns set significance on deep transfer learning. true (t), false (f), and indeterminate (i) membership sets were used to define chest x-ray images in the ns domain. experimental results confirmed that the proposed approach achieved a 98.05% accuracy rate outperforming the accuracy value acquired from previously conducted studies within the same database. index terms: covid-19, chest x-ray, neutrosophic set, resnet-50, classification o r i g i n a l r e s e a r c h a r t i c l e uhd journal of science and technology access this article online doi: 10.21928/uhdjst.v6n2y2022.pp11-18 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2022 hamaamin, et al. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) corresponding author’s e-mail: rebin abdulkareem hamaamin, applied computer, college of medicals and applied sciences, charmo university, chamchamal, sulaimani, krg, iraq. e-mail: rebin.abdulkarim@charmouniversity.org received: 01-06-2022 accepted: 17-07-2022 published: 01-08-2022 rebin, et al.: covid-19 classification 12 uhd journal of science and technology | july 2022 | vol 6 | issue 2 with 150+ layers. resnet-50 architecture is a well-known convolutional neural network (cnn) dl model with 50 layers for image classification [5]. all cxr images are in the spital domain, then transformed into a new domain called the neutrosophic (ns) domain. the ns includes crisp set, ns graph theory, ns fuzzy set, ns image, and ns topology built on the foundation of ns. the use of these parts on image parts is called advanced image preprocessing, which entails image transformation into the ns domain, the ns domain comprises of three sorts of images, and they are the true (t) images, indeterminacy (i) images, and falsity (f) images [6]. all three membership (true, false, and indeterminate) images were generated in this study. the identification of covid-19 as a classification task is addressed in this study using a system based on resnet-50 architecture in the ns domain. the key contribution of this paper is to analyze the effectiveness of utilizing ns sets based on resnet-50 architecture using huge database images to improve the overall accuracy and thereby reduce the misclassification error rate. the remainder of the paper is organized as follows. section 2 presents a review of related research, a complete proposed framework for the detection of covid-19, including sections such as a database description, image in the ns domain, and resnet-50 model depicted in section 3. section 4 discusses the experimental results and discussions and comparing them with the existing approaches. finally, section 5 provides the conclusion of the work. 2. related work over the past 2 years, numerous studies on covid-19 infection diagnosis and detection have been conducted. for instance, lawton [7] proposed covid-19 detection schema from ct lung scans using transfer learning architectures standard histogram equalization and contrast limited adaptive histogram equalization. five pre-trained cnnbased models (inception-resnetv2, inceptionv3, resnet-50, resnet101, and resnet152) for detecting coronavirus pneumonia-infected patients using cxr radiographs were recommended in narin et al. [8] three separate binary classifications comprising four classes (covid-19, normal, bacterial pneumonia, and viral pneumonia) were generated applying five-fold cross-validation. ilyas et al. [9] suggested a real-time rule-based fuzzy logic classifier for covid-19 detection. the suggested methodology collects real-time symptom data from users through an internet of things platfor m to identify symptomatic and asymptomatic covid-19 patients. sharmila and florinabel [10] attempted to classify covid-19 afflicted individuals using cxr scans utilizing new model of cnn and dcgans. hira et al. [11] classified covid-19 patients utilizing nine transfer learning methods. in comparison to other approaches, the resnet-50 produced the best covid-19 detection results for binary and multi-classes, according to the outcomes of the experiments. saiz and barandiaran [12] proposed a new testing methodology to determine whether a patient has been infected by the covid-19 virus using the sdd300 model. the deep feature plus svm-based procedure was proposed in singh et al. [13] for identifying coronavirus infected patients by applying cxr images. svm was utilized for classification rather than dl-based classifiers, which require a large database for training and validation. helwan et al. [14] introduced a transfer learning approach to diagnose patients who were positive for covid-19 and distinguish them from healthy patients using resnet-18, resnet-50, and densenet-201. for this purpose, 2617 chest ct images of non-covid-19 and covid-19 were experimented. alruwaili et al. [15] proposed an improved inception-resnetv2 dl model for accurately diagnosing chest cxr images. a grad-cam technique was also computed to improve the visibility of infected lung parts in cxr scans. aradhya et al. [16] proposed a system for detecting covid-19 from cxr scans. in the case of dl architectures, a novel idea of cluster-based oneshot learning was developed. the suggested schema was a multi-class classification system classifying images into four groups: pneumonia virus, pneumonia bacterial, covid-19, and typical cases. the proposed schema is built using a combination of an ensemble of generalized regression neural network and probabilistic neural network classifiers. ji et al. [17] presented a covid-19 detection approach based on image modal feature fusion. small-sample enhancement preprocessing, including spinning, translation, and randomized transformation, was initially conducted using this methodology. five classic pretraining models including vgg19, resnet152, xception, densenet201, and inceptionresnetv2 were utilized to extract the features from cxr images. gaur et al. [18] presented an innovative methodolog y for preprocessing ct images and identifying covid-19 positive and negative. the suggested approach used the principle of empiric wavelet transformation for preprocessing, with the optimal elements of the image’s red, green, and blue channels being learned on the presented approach. deep and transfer learning procedures recommended by qaid et al. [19] to differentiate covid-19 cases by assessing cxr images. the designed approaches used either cnn or transfer learning rebin, et al.: covid-19 classification uhd journal of science and technology | july 2022 | vol 6 | issue 2 13 models to effectively utilize their potential or hybridize them with sophisticated ml procedures turkoglu [20] presented a pertained cnn-based alexnet architecture employing the transfer learning technique deployed for covid-19 identification. the effective features generated using the relief feature selection process were classified using the svm method at all layers of the architecture. finally, al-ani and al-ani [21] reviewed a number of studies on the subject of covid-19 disease based on a variety of important criteria, including the topic, the applied method, the applied database, the researcher by countries, and the search by country. the findings showed that the majority of research publications supported the claim that coronavirus attacks the human respiratory system. the rest of this work focuses on expanding and improving the method of applying digital image processing to improve the rate of covid-19 identification performance using cxr images. the effectiveness of the proposed approach is therefore validated based on different metrics. 3. material and methodology in this paper, an attempt was made to develop a system for the identification and diagnosis of covid-19 as shown in fig. 1. to start, all cxr images were cropped to extract only regions of interest (roi) and resized in the preprocessing step. at the second step, the rgb color input images were converted into the ns domains for all three membership subsets. afterward, the approach divided the ns images through resnet-50 model into training and testing set in the ratio of 80:20 and then the system runs to use resnet-50 to classify the cxr images. finally, the recommended system’s performance was evaluated using a variety of well-known metrics including accuracy, sensitivity, specificity, precision, f-score, and matthews correlation coefficient (mcc) rates [22]. the details of the proposed method were presented in the subsequent subsections. 3.1. covid-19 database the structure of the database is the primary stage in any computerized technique. therefore, a database was created based on the covid-19 radiography database which is a publicly available database. the database consists of 21165 cxr images, of which 10,192 belong to normal, 3616 correspond to covid-19 positive, and 6012 belong to lung opacity (non-covid lung infection) 1345 are labeled as viral pneumonia cases. the data for this paper include 7232 cxr images (fig. 2), 3616 of which with a positive covid-19 diagnosis, and 3616 negatives randomly selected to create the balanced database. 3.2. preprocessing image preprocessing refers to the steps performed to prepare images before they are used in model training and validation. image data preprocessing is the process of converting image fig. 1. general architecture of the proposed framework. rebin, et al.: covid-19 classification 14 uhd journal of science and technology | july 2022 | vol 6 | issue 2 data into a format that machine learning algorithms can understand. it is widely used to increase the model accuracy while also minimizing its complexities. image data are preprocessed using a variety of procedures. therefore, in this step, the bounding box cropping approach is computed to extract the only roi alone by removing the unwanted background from the input image. before importing the input cxr images into the proposed framework, the cropped cxr images are resized into fixed size of 256∗256 pixels. 3.3. image in the ns domain neutrosophy is a field of philosophy founded in 1980 by f. smarandache, which broadened dialectics and investigated the genesis, nature, and extent of neutralities and their interactions with various ideational spectrums. advanced image processing includes image transformation into the ns domain that includes three areas — background subtraction for foreground objects, edge detection for boundary objects, and background detection for background objects. according to the theory of neutrosophy, each event has a specific degree of truth (t), falsity (f), and indeterminacy (i), all of which must be taken into account separately. ns truth domain displays all the true parts of the images in percentage. at that point, the image is called an image true. furthermore, the ns falsity membership degree presents the incorrect parts of the image and become an independent image separate from the other parts. the ns indeterminacy membership degree, which contains the least information of the original image, refers to the uncertain parts of any image [23], [24]. an m * n matrix represents the image as a mathematical object (spatial domain). pixel p (i, j) in the image domain is translated into a ns domain by calculating pns (i, j) = t (i, j), i (i, j), f (i, j) in equations (1)-(3), where t (i, j), i (i, j), and f (i, j) are taken as probabilities [25] that pixel p (i, j) belongs to white set (object), indeterminate set, and non-white set (background), respectively (fig. 3). t i j g i j g g g m i n m a x m i n � � ,� � , ( ) = ( ) − − (1) i i j i j m i n m a x m i n � � ,� , ( ) = ( ) − −     (2) f i j t i j g g i j g g m a x m a x m i n � ,� � � ,� , � � � �( ) = − ( ) = − ( ) − 1 (3) where: • 𝑔 (𝑖, 𝑗) indicates the pixel intensity value of an image. • t, i, and f are true, indeterminacy and false sets, respectively, in ns domain. • g i j� ,( ) is the local mean value of g (i, j). •  � i j,( ) is the homogeneity score of t at (i, j), which is defined as the absolute amount of the difference between an image’s intensity value g (i, j) and its local mean value g (i, j). fig. 2. examples of cxr images: (a) covid-19 and (b) normal. a b fig. 3. convert image from spatial domain to ns domain: (a) original image (covid-19), (b) f-domain of covid-19 image, (c) i-domain of covid-19 image, and (d) t -domain of covid-19 image. rebin, et al.: covid-19 classification uhd journal of science and technology | july 2022 | vol 6 | issue 2 15 3.4. deep residual neural network (resnet-50) model dl is a machine learning method for learning representations that employs artificial neural networks. there are three sorts of machine learning techniques: supervised, semi-supervised, and unsupervised. using dl, a computer model learns to do classification tasks directly from images, textual, or numeric data. deep feature extraction with pre-trained networks such as alexnet, vgg16, vgg19, googlenet, resnet18, resnet-50, resnet-101, inceptionv3, inceptionresnet-v2, densenet-201, xceptionnet, mobilenetv2, and shufflenet are usually employed for classification tasks [4]. in this research, resnet-50, which is a better version of cnn, was utilized as the basic model in the architectural proposed design to classify covid-19 and normal patients’ cxr images. the model was pre-trained on the imagenet database for object detection. resnet uses shortcuts between layers to reduce interference, which occurs as the network size increases in depth and complexity. with softmax activation, the network ends with a 1000 fully connected layer. there are a total of 50 weighted layers with 23,534,592 trainable parameters [26], [27]. the imagenet database was used to train resnet-50, a collection of over 14 million images divided into over 20,000 categories designed for image recognition competitions [8]. 4. experimental results and discussion the critical objective of the proposed framework is to classify cxr images into normal or covid-19. in this section, resnet-50 was trained on a dataset containing 7232 images as a benchmark and applied to each subset, namely, cxr images in the ns domain, by randomly dividing the database into an 80% training set and a 20% testing set. the proposed method was implemented for covid-19 diagnosis using the matlab r2020a programming language on a windows 10 computer with an intel core i7 processor and 16 gb of ram. the adam optimizer was used for weight updates, a 1e-4 learning rate, and five epochs; each stage uses the same minibatchsize. this method converts images to ns domains with different epochs used for each domain. as figs. 4-6 depicted the accuracy and loss curves for the three domains. in addition, experimentations were executed comprehensively to evaluate the performance of the proposed framework in terms of confusion matrix measurements, in particular, the accuracy, sensitivity, specificity, precision, f-score, and mcc rates. a confusion matrix is a table showing how an algorithm classifies data. the structure of the confusion matrix is divided into four parts, positive true, positive false, negative true, and negative false as shown in fig. 7. as a result, the images of the database were evaluated in the ns domain with different confidence values for selected models, resnet-50 model, which was trained with five epochs, epoch 5, epoch 10, epoch 15, epoch 20, and epoch 25. finally, we calculate the average of each domain and compared with the average of other domains. the model achieved an overall accuracy of about 98.05% in the f-domain on the testing set, as shown in table 1. furthermore, the highest sensitivity rate acquired in the t-domain had a value of 98.01%, as illustrated in table 2. fig. 4. the accuracy and loss curves of the suggested model that resulted from the t-domain. fig. 5. the accuracy and loss curves of the suggested model resulted from the f-domain. fig. 6. the accuracy and loss curves of the suggested model that resulted from the i-domain. rebin, et al.: covid-19 classification 16 uhd journal of science and technology | july 2022 | vol 6 | issue 2 the tables illustrated the model performance evaluation based on specificity, precision, f-score, and mcc, as shown in tables 3-6. first, the specificity scale of the model yields the best result in the f-domain by scoring 98.54% and ultimately outperforming other parts (table 3). it was found that f-domain improved the highest average prediction specificity and precision to reach 98.54%, whereas the average specificity and precision of the f-domain was the lowest, scoring of 92.15% and 92.42%, respectively (tables 3 and 4). furthermore, the same fact has been determined to classify covid-19 and regular patients cxr images by examining other performance measures (f1-score and mcc) to assess the proposed framework. the outcomes presented that the f-domain reached the maximum f1-score and mcc rates of 98.06% and 96.12% performing resnet-50, respectively (tables 5 and 6). from the above experimental results, it is clearly evident that the domain of f has better results in all measures except sensitivity, which effectively discriminates covid-19 cases from regular patients cxr images more precisely, which may help doctors make a precise diagnosis depending on their clinical specialists as well as the recommended platform as a proper diagnosis tool. using the same database and computational environment, the performance of the recommended scenarios was also tested using the misclassification error rate metric. as confirmed in fig. 8, the misclassification error rates for the recommended scenarios were calculated. the consequences verified that the resnet-50 falsity domain results in a minor misclassification error of 1.95% rate, which approved that the proposed scenario outperforms all other scenarios by a significant margin. therefore, this scenario was considered as a potential classification method for covid-19 cxr images. finally, the proposed system’s performance was compared to some state-of-the-art techniques, as shown in table 7. table 3: evaluation specificity over all epochs ns domain number of epoch ep5 ep10 ep15 ep20 ep25 average (%) i 92.71 89.84 92.41 92.23 93.54 92.15 t 98.60 97.84 97.17 97.33 97.43 97.67 f 97.82 99.08 97.42 98.95 99.43 98.54 ep5: epoch 5, ep10: epoch 10, ep15: epoch 15, ep20: epoch 20, ep25: epoch 25, the best result per row is highlighted in bold table 1: evaluation accuracy of overall epochs ns domain number of epoch ep5 ep10 ep15 ep20 ep25 average (%) i 87.07 89.21 90.49 89.83 92.32 89.78 t 97.68 97.44 98.34 97.68 97.96 97.82 f 97.51 97.99 98.31 98.55 97.89 98.05 ep5: epoch 5, ep10: epoch 10, ep15: epoch 15, ep20: epoch 20, ep25: epoch 25, the best result per row is highlighted in bold table 2: evaluation sensitivity of overall epochs ns domain number of epoch ep5 ep10 ep15 ep20 ep25 average (%) i 83.61 88.76 88.81 87.70 91.22 88.02 t 96.83 97.06 99.57 98.05 98.54 98.01 f 97.23 96.96 99.23 98.15 96.46 97.61 ep5: epoch 5, ep10: epoch 10, ep15: epoch 15, ep20: epoch 20, ep25: epoch 25, the best result per row is highlighted in bold table 4: evaluation precision of overall epochs ns domain number of epoch ep5 ep10 ep15 ep20 ep25 average (%) i 93.08 89.90 92.74 92.67 93.71 92.42 t 98.62 97.86 97.10 97.30 97.37 97.65 f 97.82 99.10 97.37 98.96 99.45 98.54 ep5: epoch 5, ep10: epoch 10, ep15: epoch 15, ep20: epoch 20, ep25: epoch 25, the best result per row is highlighted in bold table 5: evaluation of f-score over all epochs ns domain number of epoch ep5 ep10 ep15 ep20 ep25 average (%) i 87.84 89.28 90.71 90.11 92.43 90.07 t 97.71 97.45 98.32 97.67 97.94 97.82 f 97.52 98.02 98.29 98.55 97.93 98.06 ep5: epoch 5, ep10: epoch 10, ep15: epoch 15, ep20: epoch 20, ep25: epoch 25, the best result per row is highlighted in bold fig. 7. confusion matrix of the proposed model. rebin, et al.: covid-19 classification uhd journal of science and technology | july 2022 | vol 6 | issue 2 17 table 6: evaluation of mcc over all epochs ns domain number of epoch ep5 ep10 ep15 ep20 ep25 average (%) i 75.21 78.51 81.10 79.80 84.70 79.86 t 95.40 94.89 96.71 95.37 95.94 95.66 f 95.03 96.01 96.63 97.10 95.84 96.12 ep5: epoch 5, ep10: epoch 10, ep15: epoch 15, ep20: epoch 20, ep25: epoch 25, the best result per row is highlighted in bold table 7: comparison with the current state-of-art/relevant studies articles techniques database accuracy % afifi et al. [28] cnn-densenet161 11,197 cxr images 91.20 abd elaziz et al. [29] mobilenetv3+aqu 21165 cxr images 92.40 ahmad and wady [30] ct, gwt, and lgip 7232 cxr images 96.18 walvekar and shinde [31] resnet-50 359 cxr images 96.23 apostolopoulos and mpesiana [32] mobilenetv2 1427 cxr images 96.78 proposed resnet-50+ns 7232 cxr images 98.05 the best result per row is highlighted in bold compared to other methods, the proposed system produced excellent outcomes, particularly in average classification accuracy. as a final tool for proposed framework performance evaluation, a comparison has been made with the results obtained from the proposed framework and the results of paper [1] as shown in fig. 9. the experimentations from fig. 9 besides obviously confirmed that the proposed system attained the highest result utilizing resnet-50. this is due to the combined resnet-50, and ns domain approaches that helped the model show higher accuracy. furthermore, the best result was obtained with an overall accuracy of 98.05% compared to the previous studies. 5. conclusion covid-19 is the virus that has demolished the world’s states and placed everyone under massive quarantine. the virus attacked the world’s stability and ushered the world into a new area of instability and chaos. using technology to control the spread of the virus through the detection of infected patients still requires more work. the work undertaken in this paper aims to serve the people and help the covid specialist identify the patients more accurately. the study applied the fundamental principles of the ns set. the regulations include true (t) images, indeterminacy (i) images, and falsity (f) images on the cxr images database that belonged to both covid-19 and regular people. unlike the previous studies, the collected images were transformed into a ns domain trained on the dl technique, and resnet-50 was used as a transfer learning method to train it on the database. as a result, the model scored 98.05% average accuracy, outperforming other accuracy achieved by the previous studies on a similar database. references [1] n. e. m. khalifa, f. smarandache, g. manogaran and m. loey. “a study of the neutrosophic set significance on deep transfer learning models: an experimental case on a limited covid-19 chest x-ray fig. 8. misclassification error in the ns domain. fig. 9. comparison of system performance for a different scenario. rebin, et al.: covid-19 classification 18 uhd journal of science and technology | july 2022 | vol 6 | issue 2 dataset”. cognitive computation, vol. 2021, p. 0123456789, 2021. [2] s. saadat, d. rawtani and c. m. hussain. “environmental perspective of covid-19”. science total environment, vol. 728, p. 138870, 2020. [3] s. p. kaur and v. gupta. “covid-19 vaccine: a comprehensive status report”. virus research, vol. 288, p. 198114, 2020. [4] p. k. sethy, s. k. behera, p. k. ratha and p. biswas. “detection of coronavirus disease (covid-19) based on deep features and support vector machine”. international journal of mathematical engineering and science, vol. 5, no. 4, pp. 643-651, 2020. [5] n. sharma, v. jain and a. mishra. “an analysis of convolutional neural networks for image classification”. procedia computer science, vol. 132, no. iccids, pp. 377-384, 2018. [6] s. f. ali, h. el ghawalby and s. a a. “from image to neutrosophic image”. in: neutrosophic sets and system. port fuad, egypt: port said university, faculty of science, department of mathematics and computer science apr. 2015, pp. 1-13. [7] s. lawton. “detection of covid-19 from ct lung scans using”. paper, 2021. [8] a. narin, c. kaya and z. pamuk. “automatic detection of coronavirus disease (covid-19) using x-ray images and deep convolutional neural networks”. pattern analysis applications, vol. 24, no. 3, pp. 1207-1220, 2021. [9] t. ilyas, d. mahmood, g. ahmed and a. akhunzada. “symptom analysis using fuzzy logic for detection and monitoring of covid-19 patients”. energies, vol. 14, no. 21, p. 7023, 2021. [10] v. j. sharmila and j. florinabel. “deep learning algorithm for covid-19 classification using chest x-ray images”. computational and mathematical methods in medicine, vol. 2021, p. 9269173, 2021. [11] s. hira, a. bai and s. hira. “an automatic approach based on cnn architecture to detect covid-19 disease from chest x-ray images”. applied intelligence, vol. 51, no. 5, pp. 2864-2889, 2021. [12] f. saiz and i. barandiaran. “covid-19 detection in chest x-ray images using a deep learning approach”. international journal of interactive multimedia and artificial intelligence, vol. 6, no. 2, p. 4, 2020. [13] a. singh, a. kumar, m. mahmud, m. s. kaiser and a. kishore. “covid-19 infection detection from chest x-ray images using hybrid social group optimization and support vector classifier”. cognitive computation, vol. 2021, p. 0123456789. [14] a. helwan, m. k. s. ma’aitah, h. hamdan, d. u. ozsahin, and o. tuncyurek. “radiologists versus deep convolutional neural networks: a comparative study for diagnosing covid-19”. computational and mathematical methods in medicine, vol. 2021, pp. 5527271, 2021. [15] m. alruwaili, a. shehab and s. abd el-ghany. “covid-19 diagnosis using an enhanced inception-resnetv2 deep learning model in cxr images”. journal of healthcare engineering, vol. 2021, no. 4, pp. 1-16, 2021. [16] v. n. m. aradhya, m. mahmud, d. s. guru, b. agarwal and m. s. kaiser. “one-shot cluster-based approach for the detection of covid-19 from chest x-ray images”. cognitive computation, vol. 13, no. 4, pp. 873-881, 2021. [17] d. ji, z. zhang, y. zhao and q. zhao. “research on classification of covid-19 chest x-ray image modal feature fusion based on deep learning”. journal of healthcare engineering, vol. 2021, pp. 6799202, 2021. [18] p. gaur, v. malaviya, a. gupta, g. bhatia, r. b. pachori and d. sharma. “covid-19 disease identification from chest ct images using empirical wavelet transformation and transfer learning”. biomedical signal processing and control, vol. 71, p. 103076, 2021. [19] t. s. qaid, h. mazaar, m. y. h. al-shamri, m. s. alqahtani, a. a. raweh and w. alakwaa. “hybrid deep-learning and machinelearning models for predicting covid-19”. computational intelligence and neuroscience, vol. 2021, p. 9996737, 2021. [20] m. turkoglu. “covidetectionet: covid-19 diagnosis system based on x-ray images using features selected from pre-learned deep features ensemble”, applied intelligence, vol. 51, no. 3, pp. 1213-1226, 2021. [21] m. s. al-ani and d. m. al-ani. “review study on sciencedirect library based on coronavirus covid-19”. uhd journal of science and technology, vol. 4, no. 2, pp. 46-55, 2020. [22] v. bahel, s. pillai. “detection of covid-19 using chest radiographs with intelligent deployment architecture”. in: a. e. hassanien, n. dey, s. elghamrawy, editors. big data analytics and artificial intelligence against covid-19: innovation vision and approach. vol. 78. studies in big data, springer, cham, 2020. [23] s. h. wady, r. z. yousif and h. r. hasan. “a novel intelligent system for brain tumor diagnosis based on a composite neutrosophicslantlet transform domain for statistical texture feature extraction”. biomed research international, vol. 2020, p. 8125392, 2020. [24] o. g. el barbary, r. a. gdairi. “neutrosophic logic-based document summarization”. journal of mathematics, vol. 2021, pp. 9938693, 2021. [25] a. rashno and s. sadri. “content-based image retrieval with color and texture features in neutrosophic domain”. in: 3rd international conference on pattern image analysis ipria, pp. 50-55, 2017. [26] e. rezende, g. ruppert, t. carvalho, f. ramos and p. de geus. “malicious software classification using transfer learning of resnet-50 deep neural network”. in: proceeding 16th ieee international conference machine learning application, pp. 1011-1014, 2017. [27] k. he, x. zhang, s. ren and j. sun. “deep residual learning for image recognition”. 2016 ieee conference on computer vision and pattern recognition, pp. 770-778, 2016. [28] a. afifi, n. e. hafsa, m. a. s. ali, a. alhumam and s. alsalman. “an ensemble of global and local-attention based convolutional neural networks for covid-19 diagnosis on chest x-ray images”. symmetry, vol. 13, no. 1, pp. 1-25, 2021. [29] m. abd elaziz, a. dahou, n. a. alsaleh, a. h. elsheikh, a. i. saba and m. ahmadein. “boosting covid-19 image classification using mobilenetv3 and aquila optimizer algorithm”. entropy, vol. 23, no. 11, pp. 1-17, 2021. [30] f. h. ahmad and s. h. wady. “covid19 infection detection from chest xray images using feature fusion and machine learning”. the scientific journal of cihan university sulaimaniya, vol. 5, no. 2, pp. 10-30, 2021. [31] s. walvekar and s. shinde. “detection of covid-19 from ct images using resnet50”. in: 2nd international conference on communication and information processing, 2020. available from: https://www.ssrn.com/abstract=3648863 [last accessed on 2022 aug 12]. [32] i. d. apostolopoulos and t. a. mpesiana. “covid-19: automatic detection from x-ray images utilizing transfer learning with convolutional neural networks.” physical and engineering sciences in medicine, vol. 43, no. 2, pp. 635-640, 2020. tx_1~abs:at/tx_2:abs~at uhd journal of science and technology | july 2022 | vol 6 | issue 2 117 1. introduction speech is a natural means for people to communicate with one another. automatic speech recognition (asr) is the technique through which a computer can recognize spoken words and understand what they are saying. the asr is the first component of a smart system. it is a method of converting an auditory signal into a string of words, which can then be used as final outputs or inputs in natural language processing. the purpose of asr systems is to recognize human-spoken natural languages. asr technology is commonly utilized in computers with speech interfaces, foreign language applications, dictation, handsfree operations and controls, and other features that enable interactions between machines and humans faster and easier than using keyboards [1]. asrs are designed using a variety of methodologies, the most notable of which is the hidden markov model (hmm) and machine learning-based methods, such as artificial neural networks (anns) and convolutional neural network (cnn) [2]. increasing the accuracy and efficiency of these systems is one of the issues that still exist in this sector. deep learning, a relatively new technology, has been widely employed to address this issue. because an audio signal is a sample of sequential data, meaning its present value is reliant on all past values, in this work rnn (gru) applied in addition with cnn. rnn is a type of artificial neural network. it involves a sequential data connection with the hidden neurons. it can be applied for the applications of text, audio, and video. it deals with sequential data from the kurdish speech to text recognition system based on deep convolutional-recurrent neural networks lana sardar hussein, sozan abdulla mahmood department of computer science, college of science, university sulaimanyah, sulaimanyah, kurdistan region, iraq a b s t r a c t in recent years, deep learning has had enormous success in speech recognition and natural language processing. in other languages, recent progress in speech recognition has been quite promising, but the kurdish language has not seen comparable development. there are extremely few research papers on kurdish speech recognition. in this paper, investigated gated recurrent units (grus) which is one of the popular rnn models to recognize individual kurdish words, and propose a very simplified deep-learning architecture to get more efficient and high accuracy model. the proposed model consists of a combination of cnn and gru layers. the kurdish sorani speech kss dataset was created for the speech recognition system, as its 18799 sound files for 500 formal kurdish words. finally, the model proposed was trained with collected data and yielded over %96 accuracy. the combination of cnn an rnn (gurs) for speech recognition achieved superior performance compared to the other feed-forward deep neural network models and other statistical methods. index terms: deep learning, gated recurrent units, kurdish speech recognition, convolutional neural network corresponding author’s e-mail: lana sardar hussein, department of computer science, college of science, university of sulaimanyah, sulaimanyah, kurdistan region, iraq. lana.salih@univsul.edu.iq received: 29-04-2022 accepted: 07-09-2022 published: 18-11-2022 access this article online doi: 10.21928/uhdjst.v6n2y2022.pp117-125 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2022 hussein and mahmood. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l r e s e a r c h a r t i c l e uhd journal of science and technology hussein and mahmood: kurdish speech recognition 118 uhd journal of science and technology | july 2022 | vol 6 | issue 2 analyzed the sequence at each time depending on the previous time in a directed cycle. lstm units and gated recurrent units (grus) are variations type of rnn. thus, recurrent neural networks (rnns) are employed for processing speech signals [3]. kurdish is an indo-iranian branch of indo-european languages that are spoken by about 40 million people in western asia, primarily in iraq, turkey, iran, syria, armenia, and azerbaijan [3]. kurdish contains several dialects, as well as its grammatical system and extensive vocabulary [4], [5]. central kurdish (also known as sorani) and northern kurdish are the two most widely spoken dialects of kurdish (also called kurmanji). zazaki and gorani are two further dialects spoken by smaller groups (also known as hawrami). kurmanji is the kurdish language spoken in northern kurdistan (in turkey, syria, and northern iraq) and written in the latin (roman) alphabet; it is also supported by google speech recognition. the sorani dialect is spoken primarily in the southeast, including iran and iraq, and is written in a modified variant of the arabic alphabet. there is no data for the sorani dialect in google speech recognition [6]. in [7] mention that, kurdish is hampered by a lack of resources to support its computational processing needs. only a few attempts to develop voice recognition resources for the kurdish language have been made thus far, necessitating the creation of a dataset for their research. the major contribution of this work is design and implementation of a straightforward hybrid speech to text model for kurdish (sorani) that comprises three cnn layers and three (grus) layers, this combination in the proposed model architecture produced results that were more accurate. the rest of this paper is organized as follows: section two reviews the related works. section three is the data collections workflow. section four presents the model architecture and proposed method. in section five, results are discussed, and finally, the conclusion is in section 6. 2. literature review few attempts have been made to recognize kurdish speech, this review focused on first: those papers in low resources languages (arabic, persian and kurdish), and how audio datasets are built/collected with. second: cnn and rnn techniques used for recognition is concerned. kurdish character recognition has received some recent research such as [8]-[10]; however, our work on speech recognition is still in its early stages. the first attempts for kurdish speech recognition in [7] which presents a dataset extracted from sorani kurdish texts from grades one to three of primary school in iraq’s kurdistan region. the first attempts for kurdish speech recognition in [7] which presents a dataset extracted (bd-4sk-asr) from sorani kurdish texts from grades one to three of primary school in iraq’s kurdistan region, which contains 200 sentences. using cmusphinx to create asr, narrated by a single speaker using audacity software at a sampling rate of 16000 and a 16-bit rate mono single channel. after that, another attempts for kurdish language arise in [11] created a dataset for their work in kurdistan, iran, and used kalditoolkit to develop the identification engine with sgmm and dnn algorithms for the aquatic model. the authors presented wer of jira asr system for different topics (sgmm model trained and evaluated by office data) which are (general: 4.5%, sport: 10.0%, economic: 10.9, conversation: 11.6%, letter: 11.7%, politics: 13.8%, social: 15.3%, novel: 16.0%, religious: 16.2%, scientific/technology: 17.1%, and poet: 25.2). for isolated word recognition, some arabic papers been reviewed. in [12] proposed, an arabic digit classification system using 450 arabic spoken digits. based on a speakerindependent system, the accuracy was around 93%, the system is based on combining wavelet transform with linear prediction coding lpc method to extract the feature and the probabilistic neural network pnn for classification. the work by [13] employed sphinx technologies to recognize solitary arabic digits with data provided by six different speakers. the system achieved an 86.66%-digit recognition accuracy, examine the use of a distributed word representation and a neural network for arabic speech recognition. furthermore, the neural network model allows for robust generalization and improves the ability to combat data sparseness. the inquiry approach also comprises a variety of neural probabilistic model configurations, an n-gram order parameter experiment, output vocabulary, normalization method, model size, and parameters. the experiment was carried out on arabic news and discussion broadcasts. then, in [14] utilized an lstm neural network for frame-wise phoneme classification on the farsdat data set, and in [15], they employed a dlstm with a ctc output layer for persian phoneme recognition on the same data set. the rest of this review focused on papers that used cnn, rnn, and gru or combining of these techniques. hussein and mahmood: kurdish speech recognition uhd journal of science and technology | july 2022 | vol 6 | issue 2 119 in [16] a significant study was reported. the authors utilized a deep recurrent neural networks (rnn) model that was endto-end with appropriate regularization. on the timit phoneme recognition benchmark, they found that rnn, namely, long short-term memory (lstm), had a test error of 17.7%. there are, nevertheless, several studies underway to build computational tools for the kurdish language. in [15] collects a tiny corpus named corpus of contemporary kurdish newspaper texts (ccknt), which contains 214k northern kurdish dialect terms. pewan text corpus for central kurdish and northern kurdish was collected from two online news organizations. the pewan corpus contains around 18 million tokens for the central kurdish dialect and approximately 4 million tokens for the northern kurdish dialect. this corpus serves as a validation set for information retrieval applications [17]. in [18], they offered a speech-to-text conversion strategy for the malayalam language that employs deep learning techniques. for the training, the system is looking at 5–10 solitary words. mel-frequency cepstral coefficients are acquired for the preprocessing phase. hmm is used to identify the speech and training after the preprocessing, syllabification, and feature extraction procedures. the lstm was used to construct a speech recognition system based on ann. the system has a 91% accuracy. a recurrent neural network approach called lstm to distinguish individual bengali words was used in [19] the model is a two-layer deep recurrent neural network with 100 lstm cells in each layer, 30 unique phonemes are detected, the last layer is a softmax output layer with 30 units, and the data set was used with a total of 2000 words. fifteen different male speakers contributed to the audio speeches. making a 75:12.5:12.5 split of the dataset for training, validation, and testing purposes. the test run yielded a phoneme detection error rate of 28.7% and a word detection error rate of 13.2%. in [20] revised standard grus for phoneme recognition purposes and proposed that li-gru architecture is a simplified version of a standard gru, in which the reset gate is removed and relu activations are considered, this research worked with (timit, dirha, chime, ted) corpus li-gru outperforms gru in all the considered noisy environments, with achieving higher performance the bus (bus) environment (the noisiest) relative improvement of 16% against the relative improvement of 9.5% observed in the street (str). wer % calculated for dirha corpus in real part for (mfcc = 27.8, fbank = 27.6, fmllr = 22.8). the study in [21] employed lstm and two datasets in this project for arabic speech recognition. the 1-digit dataset consists of 8800 tokens with a sampling rate of 11025 hz, and it was created by asking 88 arabic native speakers to repeat all digits 10 times. 2-tv command dataset: 10000 tokens for 10 tv commands at a sampling rate of 16000 hz are contained in this dataset; finally, the author reached over 96% accuracy. the author proposed four different model structures for speech emotion recognition in [22], which are model-a (1d cnns-fcns), model-b (1d cnns-lstm-fcns), model-c (1d cnns-gru-fcns), and ensemble model-d, which combines model-a, model-b, and model-c, adding lstm, and gru after cnn blocks in models b and c results in increased accuracy (tess) in total, there are 2800 audio files with 200 target words. 2-ravdess audio files have a resolution of 1440 pixels and a sample rate of 48 khz. 3-savee has a total of 1920 samples. 4-emo-db berlin it contains 535 german-language audio recordings. crema-d is the fifth step in the crema-d process. it makes use of 7442 records. 3. data collection this section will discuss how to collect data and which type of data should be collected. those data were gathered through official administration papers in the university of sulaimani college of science, which totaled 500 different words and were collected by 30 speakers, 13 are female and 17 are male. 3.1. kurdish sorani speech (kss) dataset as mentioned before, there is no available dataset in kurdish sorani, which lead us to make our dataset for this research work here are the details of workflows, choosing individual words from the governmental worksheet, and arranging them in 30 to 50 words in each paragraph. four hundred words were read by 30 volunteers and the last 100 words were read by two different male and female readers; the total number of words reached 500 words. there were 30 volunteers in total, with 14 males and 16 females, 9 from family, and 21 from universities. the volunteers’ ages ranged from 20 to 40. the volunteer was asked to read the paragraph as individual words, which means between each word makes silence for at least 0.2 s. some speakers were asked to read each paragraph 1 time, which leads to 2–3 min’ duration of each file, but some others were asked to read each same word 3–5 times the duration of these files reached 5–7 min. hussein and mahmood: kurdish speech recognition 120 uhd journal of science and technology | july 2022 | vol 6 | issue 2 3.2. recording circumstance kss data sets were collected in two environments office and home with two recording devices that table 1 – indicates all information needed for data collection. the dataset consists of 500 words of numeric numbers and formal words. however, from “١١” in sorani reading in latin reading “yazde” or “yanze” which ”یانزە‘ or ’یازدە“ is eleven in numeric number, to “٢٠” in sorani reading in latin reading “bîst” which is twenty numeric ”بیست“ number, and also for (٣٠،٤٠،٥٠،٦٠،٧٠،٨٠،٩٠،١٠٠،١٠٠٠ ،١٠٠٠٠٠٠) as the same above for each numeric number, in total 41 tokens, the rest of 461 words containing formal words, weekdays, kurdish month name, kurdish pronoun, prefixes, and suffixes, and there are some words in the kurdish language used to join two same or different words or sentences like (لە، بۆ، یان، ی، کە، و ، وە), table 2 shows part of the dataset. 3.3. recording technology this section will discuss the property of the application that is used for recording speech and the recording conditions being explained. for recording sounds in an office environment, audacity was utilized since it is a free, easy-to-use, multi-track audio editor and recorder for windows, mac os, gnu/linux, and other operating systems that can also export sound files as mp3s (mp3, ogg, and wav). using this application, need to set up some recording conditions, that could work properly with the deep learning model, these conditions are described below. using both mono recording channel and stereo recording channel and sample rate of 16000 hz as it is near to the normal human sound and also using 44100 hz after converting it to 16000 hz. 3.4. data preprocessing after collecting data as mentioned before, the following steps for data preprocessing. 1. splitting individual words from sound files and saving each one as a new.wav file about the approximately 1-s duration for each one, using a model called “pydub” that can work with audio files, this library can play, split, merge, and edit.wav audio files. 2. after making chunks of the dataset facing many challenges, one of the challenges is appearing some sounds during recording like breathing in loud sounds “uhhh” the program treats as a separate word, should listen to each sound file carefully and discard these as an un-speech sound. furthermore, some speakers make “umm” or “uh” sounds before or after reading words; in this case, this part has been removed from the entire speech, as a result, it does not affect the dataset. 3. during splitting sound into small chunks (separate words), these small files are read by a package for audio analysis like music generation and asr. improves building blocks necessary to create music information retrieval system. mainly retrieves numerical numpy array, which represents sound data. moreover, sampling rate sr is the number of samples taken per second. by default, samples the file at a sampling rate of 22050 hz, this sample rate could be overridden to any desired sr (8000, 11025, 16000, 22050, 32000, 44100, etc.) hz. sr = number of samples per second. taking from a continuous signal to make a discrete or digital signal, choosing a 16000 sample rate. 4. in fig. 1, easily note that the speech part is ended in 1.2 s and mentioned before in challenge two the 0.2 s, after table 1: the information on data collection title value dataset name kurdish sorani speech office recording device model laptop: dell (latitude e5450) lenovo (20ars0yb08) no. of speakers 30 no. of isolated words 501 no. of recorded sound 18,799 sound files (utterance) frequency 16000 hz and 44100 hz recording channels mono and stereo sound files format ms wav (.wav) sampling resolution 32-bit table 2: samples of dataset sorani reading style latin reading style meaning in english سفر sifr zero یەک yek one دوو dû two سێ sê three چوار çwar four پێنچ pênç five شەش şeş six حەوت hewt seven هەشت heşt eight نۆ no nine ەد de ten زانكۆ zanko university خوێندكار xwendkar student یاریده ده ر yaridadar assistant پزیشكی pzishky medicine ئە نجومە ن anjuman council زانست zanst science hussein and mahmood: kurdish speech recognition uhd journal of science and technology | july 2022 | vol 6 | issue 2 121 1.0 s will be removed and the speech will be unclear. in this case, should remove the silence part from the beginning of the speech, but if the silent part is after the speech, do nothing. the duration of the speech signal is 1.0 s, after that time, it does not matter and fixed it before in challenge two. 5. in some cases, when the word contains two or more sections like “هاوپێچ” which is mean “attach” the reader read it separately, the program treats it as a separate word which makes it mean less “هاو“ ”پێچ,” fixed this problem manually by combining these sections. 6. in opposite to challenge four, in some cases, readers read two different words without any silence between them, and the program selected it as one word, also fixed it manually by separating them like “کوردستان ”هەرێمی meaning “kurdistan region,” as shown in fig. 2. 1. the format of sound data retrieved from viber was. m4a which is not supported by audacity, should convert to a.wav file and also change its sampling rate from 48000 hz to 16000hz the size of the file reduced for instance a file size of 7.60 mb becomes 3.8 mb. 2. to prepare the data for fitting into a model, down sampling was applied to the recorded sound files, resulting in 8000 samples per word. 3.5. data augmentation (da) da is the method of applying minor modifications to our original training dataset to produce new artificial training samples. there are many types of da which are time wrapping, frequency masking, time masking, noise reduction, etc. as in [23] used additive white gaussian noise, pitch shifting, and stretching of the signal level. in the proposed work, since the number of speech utterance records in each class is relatively low, this study performs one type of audio da, which is noise reduction. 4. methodology both cnn and rnn have widely used in the speech recognition area and approved satisfactory results, for the kurdish language using these models was challenging. this study proposed four different architecture models, fig. 1. the speech signal (a) with silence part, in the beginning, (b) with the removed silent part, and (c) with silence part in the end. ba c fig. 2. separate speech signal to individual words, (a) single chunk with two words, (b) separate first word, and (c) separate second word. b c a hussein and mahmood: kurdish speech recognition 122 uhd journal of science and technology | july 2022 | vol 6 | issue 2 model generalizations, changing hyper parameters like batch size [24], and optimizers. related to the study result concluded with cnn lower accuracy, then decided to go with a combination of cnn with rnn (gru) to seek higher accuracy than the first model architecture. 4.1. cnn model the initial model architecture, the cnn model, was used to train and test the kss dataset. the model comprises four cnn layers, as shown in fig. 3, each of which is made up of three basic layers, the first of which is conv_layer. this is the first layer, and it is used to extract various features from the input data by performing mathematical operations between the input data and a filter of a specific size (8 × 8, 16 × 16, 32 × 32, 64 × 64) for each cnn layer. the second one a pooling layer is usually followed by a convolutional layer in most circumstances. this layer’s major goal is to lower the size of the convolved feature map to reduce computational expenses. the third one over fitting happens when a model performs so well on training data that it hurts its performance when applied to new data. a dropout layer is used to solve this problem, in which a few neurons are removed from the neural network during the training process, resulting in a smaller model. after passing a dropout of 0.3, 30% of the nodes in the neural network are dropped out at random. 4.2. rnn (gru) the vanishing gradient problem affects rnn during back propagation. gradients are values that are used to update the weights of a neural network. when a gradient reduces as it back propagates through time, this is known as the vanishing gradient problem. when a gradient value falls below a certain threshold, it no longer contributes much to learning. rnns can forget what they have seen in longer sequences, because these layers do not learn, resulting in short-term memory. as a solution to shortterm memory, lstms and grus were developed. they have inbuilt devices known as gates that can control the flow of data. grus are a recurrent (rnn) gating technique first introduced in [25]. the gru is similar to an (lstm) with a forget gate, as shown in fig. 4, but it has fewer parameters and lacks an output gate. its performance in polyphonic music modeling, speech signal modeling, and natural language processing was found to be comparable to that of an lstm. on certain smaller and less frequent datasets, grus have been proven to perform better [26]. 4.3. training models the ability of the model to adapt to new previously unseen data derived from the same distribution as the one used to generate the model is referred to as generalization. for instance, adding or removing layers to the current model leads to changing the accuracy of the system. table 3 presents four different architecture models. after getting results from the table 3, model (3) was chosen as the best model architecture. the detailed layers of the proposed architecture are shown in fig. 5. 5. results and discussion in this section, the result of our experiments is presented in two different algorithms, the first is cnn, and the second is cnn+rnn (gru), as explained in table 3. as shown in table 4 indicate that, each batch size got an accuracy. for the cnn model, architecture concluded the best batch size which is batch size = 16. to carry out a thorough investigation and achieve a fair comparison between the varied systems, the research uses the 10:90 and 20:80 approaches for testing and training, respectively, as a way of assessing the proposed systems. the results can then be averaged to compute a single estimation. this is particularly important when carrying out experiments with limited data sources, it is important to be clear that the point of this experiment was to ascertain how much data should reserve for testing. the result is presented in tables 5 and 6 for the 10:90 splitting dataset and table 7 for the 20:80 splitting dataset. fig. 3. the cnn architecture model. hussein and mahmood: kurdish speech recognition uhd journal of science and technology | july 2022 | vol 6 | issue 2 123 after realizing combining three layers of cnn with three layers of rnn which is model, architecture 3 shows a better result among other 4 architectures, as shown in table 5, with %90 of the dataset for training the model and %10 for testing the model, after changing hyper parameters like batch size, also using sgd and adam optimizer, the result shows in tables 6 and 8. as discussed above indicate that both types of optimizer sgd and adam could be used for speech recognition as they show a confident result, adam optimizer reached the result that in batch size (64 and128), shows a better choice as it is accuracy reached (96% and 96%), respectively, which is higher than among batch size (8, 16, and 32) results (92%, 93%, and 72%). on the other hand, the sgd optimizer represents the result in different batch size values which are (16, 32, 64, and 128), but only in (32) reaches the result to (90%). this experiment discovered that the table 3: different types of model architecture with different layers layers model 1 model 2 model 3 model 4 conv_0      maxpooling_0     dropout_0     conv_1     maxpooling_1     dropout_1     conv_2    ✕ maxpooling_2    ✕ dropout_2    ✕ conv_3   ✕ ✕ maxpooling_3   ✕ ✕ dropout_3   ✕ ✕ flatten   ✕ ✕ batch normalization ✕ ✕   gru bidirectional_0,1,2 ✕ ✕   batch normalization ✕ ✕   flatten ✕ ✕   dense_0     dropout     dense_1     dropout     dense_2     dropout  ✕ ✕ ✕ fig. 4. gated recurrent unit (chung et al. 2014). table 4: cnn model batch size and accuracy number batch size accuracy % 8 0.51661 16 0.61993 64 0.58672 77 0.47 99 0.40 table 7: the effect of splitting dataset to 20:80 on accuracy optimizer batch size epochs accuracy % adam 8 62 0.89734 adam 16 59 0.94176 adam 32 40 0.93697 adam 64 49 0.94495 adam 128 50 0.94441 table 5: the accuracy for each model layers model 1 model 2 model 3 model 4 accuracy 0.61 0.88 0.92 0.87 table 6: using different batch size with adam optimizer optimizer batch size epochs accuracy % adam 8 31 0.92074 adam 16 59 0.93738 adam 32 58 0.7278 adam 64 60 0.9601 adam 128 69 0.96436 table 8: using different batch size with sgd optimizer optimizer batch size epochs accuracy % sgd 16 96 0.83989 sgd 32 16 0.90160 sgd 64 29 0.01489 sgd 128 31 0.01755 hussein and mahmood: kurdish speech recognition 124 uhd journal of science and technology | july 2022 | vol 6 | issue 2 table 9: comparison with recent works related to proposed method, dataset, accuracy acheivments author proposed dataset acc. alkhateeb [12] arabic digit classification system using probabilistic neural network pnn. 450 arabic spoken digits 93% arun et al. [18] malayalam speech recognition, the rnn was used. 5–10 solitary words 91% zerari et al. [21] used rnn for arabic speech recognition. consists of 8800 tokens and tv command dataset: 10000 tokens for 10 tv commands 96% proposed system using cnn with rnn (gru) for kurdish word recognition kss dataset that compose of 18799 sound files for 500 formal kurdish words 96% adam optimizer is more proper for speech recognition as getting a high accuracy in almost all the tests than the sgd optimizer. fig. 6 shows the best accuracy for model cnn and rnn (gru). now by changing the splitting dataset to 20:80 for testing: training with batch size (8, 16, 32, 64, and 128), the results show in table 7. by comparison with recent works, the table 9 indicates the comparison between those papers referenced with proposed method, its dataset and accuracy achieved. 6. conclusion in this paper, implemented speech recognition for the kurdish language as well as created a kss data set for this research purpose. the data set composed of 18799 sound files for 500 formal kurdish words were read by 30 native kurdish speakers. the research work designed different model architectures with different parameters using cnn and cnn+ rnn(gru), the experimental findings indicate that the accuracy of the model increases when three layers of gru are added to three layers of cnn. the accuracy of the cnn model reaches 61%, but after adding gru, the accuracy increases dramatically to 96%, providing us with a clear vision for selecting the desired architecture. in the future, intend to improve the quality of kurdish language materials, as well as utilize state of the art methods such as fig. 5. the proposed architecture cnn_gru model. fig. 6. (a) accuracy for model cnn. (b) accuracy for model cnn and rnn (gru). horizontal line indicates accuracy, while vertical line indicates number of epochs. b a hussein and mahmood: kurdish speech recognition uhd journal of science and technology | july 2022 | vol 6 | issue 2 125 metaheuristic optimizer with deep learning, to improve the performance. references [1] e. morris. “automatic speech recognition for low-resource and morphologically complex languages”. thesis. rochester institute of technology, 2021. [2] s. ruan, j. o. wobbrock, k. liou, a. ng and j. a. landay. “comparing speech and keyboard text entry for short messages in two languages on touchscreen phones”. journal proceedings of the acm on interactive mobile wearable and ubiquitous technologies archive, vol. 1, no.4, pp. 1-23, 2017. [3] m. assefi, m. wittie and a, knight. “impact of network performance on cloud speech recognition”. in: proceedings of the 24th international conference, pp. 1, 2015. [4] m. asseffi, g. liu, m. p. wittie and c. izurieta. “an experimental evaluation of apple siri and google speech recognition”. isca sede montana state university, bozeman, 2015. [5] a. ganj and f. shenava. “2-persian continuous speech recognition software”. in: the first workshop on persian language and computer. the 9th iranian electrical engineering conference, iran, 2004. [6] f. a. ganj, s. a. seyedsalehi, m. bijankhan, h. sameti, s. zadegan and j. shenava. “1-persian continuous speech recognition system”. in: the 9th iranian electrical engineering conference, 2000. [7] a. qader and h. hassani. “kurdish (sorani) speech to text: presenting an experimental dataset”. arxiv: 1911.13087v1, 2019. [8] r. yaseen and h. hassani. “kurdish optical character recognition”. ukh journal of science and engineering, vol. 2, pp. 18-27, 2018. [9] r. d. zarro and m. a. anwer. “recognition-based online kurdish character recognition using hidden markov model and harmony search eng.” engineering science and technology an international journal, vol. 20, no. 2, pp. 783-794, 2017. [10] a. t. tofiq and j. a. hussain. “kurdish text segmentation using projection-based approaches”. uhd journal of science and technology, vol. 5, no. 1, pp. 56-65, 2021. [11] h. veisi, h. hosseini, m. amini, w. fathy and a. mahmudi. “jira: a kurdish speech recognition system designing and building speech corpus and pronunciation lexicon”. arxiv abs/2102. 07412, 2021. [12] a. alkhateeb. “wavelet lpc with neural network for spoken arabic digits recognition system”. jordan journal of applied science, vol. 4, pp. 1248-1255, 2014. [13] n. turab, k. khatatneh and a. odeh. “a novel arabic speech recognition method using neural networks and gaussian filtering”. ijeecs international journal of electrical, electronics and computer systems, vol. 19, pp. 1-5, 2014. [14] s. malekzadeh, m. h. gholizadeh and s. n. razavi. “persian phonemes recognition using ppnet”. arxiv preprint arxiv: 1812.08600, 2018. [15] h. veisi and a. haji mani. “persian speech recognition using long short-term memory”. in: the 21st national conference of the computer society of iran. university of tehran, iran, 2015. [16] a. graves, a. r. mohamed and g. hinton. “speech recognition with deep recurrent neural networks”. in: icassp conference. institute of electrical and electronics engineers, piscataway, 2013. [17] a. r. mohamed, g. dahl and g. hinton. “deep belief networks for phone recognition”. in: nips workshop on deep learning for speech recognition and related applications. . ijca proceedings on national conference, usa, 2009. [18] h. p. arun, j. kunjumon, r. sambhunath and a. s. ansalem. “malayalam speech to text conversion using deep learning”. iosr journal of engineering (iosrjen), vol. 11, no. 7, pp. 24-30, 2021. [19] m. m. h. nahid, b. purkaystha and m. s. islam. “bengali speech recognition: a double layered lstm-rnn approach”. in: procceding 20th institute of communication culture information and technology, pp. 1-6, 2017. [20] m. ravanelli, p. h. brakel, m. omologo and y. bengio. “light gated recurrent units for speech recognition”. ieee transactions on emerging topics in computational intelligence, vol. 2, pp. 92-102, 2018. [21] n. zerari, s. abdelhamid, h. bouzgou and c. raymond. “bidirectional deep architecture for arabic speech recognition”. open computer science, vol. 9, pp. 92-102, 2019. [22] r. ahmed, s. islam, a. k. m. muzahidul islam and s. shatabda1. “an ensemble 1d-cnn-lstm-gru model with data augmentation for speech emotion recognition”. arxiv: 2112.05666, 2021. [23] c. huang, g. chen, h. yu, y. bao and l. zhao. “speech emotion recognition under white noise”. archives of acoustics, vol. 38. pp. 457-463, 2013 [24] i. kandel and m. castelli. “the effect of batch size on the generalizability of the convolutional neural networks on a histopathology dataset”. ict express, vol. 6, no. 4, pp. 312-315, 2020. [25] k. cho, b. v. merrienboer, d. bahdanau and y. bengio. “on the properties of neural machine translation: encoder-decoder approaches”. arxiv: 1409.1259v2, 2014. [26] j. chung, c. gulcehre, k. cho and y. bengio. “empirical evaluation of gated recurrent neural networks on sequence modeling”. arxiv: 1412.3555v1, 2014. tx_1~abs:at/tx_2:abs~at uhd journal of science and technology | july 2022 | vol 6 | issue 2 77 1. introduction in the context of computing, logs are bits of data that give insight into numerous events that occur during the execution of a computer program [1]. infor mation technolog y utilization has increased at an unparalleled rate during the previous two decades. data of various types are shared through a broad range of networks, from company-wide lan networks to public hub wireless access networks. as the transmission and consumption of data through these networks grows, so does number of breaches and network intrusion efforts aimed at obtaining secret and personal information. as a consequence of this, security for networks and data has become a highly significant topic in both the academic and practical computing communities [2]. log data comprised more than 1.4 billion logs each day is used to detect suspicious business-specific activities and user profile behavior [3]. a series of devices and software generate log files in dissimilar formats. log files are used by software systems to retain track of their activities. different system part, like os, may record its events to a remote log server. an os is the machine software that controls computer hardware and software resources and permits the execution of multiple applications [4]. the start or end of occurrences or activities of software system, status information, and error information are all captured in the log files. user information, application information, date and time information, and event information are normally included in each log line. when these files are properly analyzed, they may provide important information about numerous characteristics every system. for monitoring, troubleshooting, and problem detection, logs are often gathered [5]. log file analysis based on machine learning: a survey rawand raouf abdalla, alaa khalil jumaa department of information technology, technical college of informatics, sulaimani polytechnic university, sulaimani, kurdistan region, iraq a b s t r a c t in the past few years, software monitoring and log analysis become very interesting topics because it supports developers during software developing, identify problems with software systems and solving some of security issues. a log file is a computer-generated data file which provides information on use patterns, activities, and processes occurring within an operating system, application, server, or other devices. the traditional manual log inspection and analysis became impractical and almost impossible due logs’ nature as unstructured, to address this challenge, machine learning (ml) is regarded as a reliable solution to analyze log files automatically. this survey tries to explore the existing ml approaches and techniques which are utilized in analyzing log file types. it retrieves and presents the existing relevant studies from different scholar databases, then delivers a detailed comparison among them. it also thoroughly reviews utilized ml techniques in inspecting log files and defines the existing challenges and obstacles for this domain that requires further improvements. index terms: log files, log analysis, machine learning, anomaly detection, user behavior, log file maintenance corresponding author’s e-mail: rawand.raouf.a@spu.edu.iq received: 16-07-2022 accepted: 07-09-2022 published: 07-10-2022 access this article online doi: 10.21928/uhdjst.v6n2y2022.77-84 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2022 abdalla and jumaa. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) s u r v e y uhd journal of science and technology abdalla and jumaa: log file and machine learning 78 uhd journal of science and technology | july 2022 | vol 6 | issue 2 normally, log files are saved as text, compressed, or binary files. the most commonly used format is text files, which have the gains of utilizing fewer cpu and i/o resources when producing files, allowing for long-term storage and maintenance, and being easy to read and use. the binary format is a machine-readable log file format created by a platform that requires a particular tool to view, making it unsuitable for long-term team storage. while compressing log files, use an appropriate compression format standard with multi-platform compatibility for efficient log storage and usage [6]. log files record activity on different kinds of servers, including data servers and application servers. because log files record a range of server actions and include a huge data in the form of server-produced messages, they are often relatively big files. these messages include valuable knowledge, like what apps were operating on the server, when they were run, and by whom a log file contains a lot of messages that indicate various server actions. in addition, each unique message type will have hundreds, if not thousands, of entries in the log file, each with slight variances in the general structure. the system administrator may utilize these messages for a lot of reasons, such as intrusion detection, reporting, performance monitoring, anomaly detection, and so on. 2. state of the art this survey tries to explore the most recent existing studies on analysis log files by using both of supervised and unsupervised machine learning algorithms. different models and techniques have been proposed by researchers to different aspects. several studies have utilized techniques and models to predict attack, user behavior, and system failure to increase server security and systems, marketing, and decrease failure time. this study revealed that there are many gaps which require further improvements such as: using real dataset in creating models, more log analysis, or mining must be done to obtain meaningful information and minimize the false positive and negative results and the maintenance aspect requires further improvements compared to the other mentioned aspects. 3. common log file types almost every network device produces a unique form of data, and each component logs those data in its own log. as a result, there are several types of logs, such as [7]: 3.1. application logs developers have a strong grip on application logs. it may include any kind of event, error message, or warning that the program generates. application logs provide information to system administrators concerning the status of an application running on a server. application logs should be well-structured, event-driven, and include pertinent data to assist as the center for higher-level abstraction, visualization, and aggregation. the application logs’ event stream is required for viewing and filtering data from numerous instances the programs. 3.2. web server logs every user communication with the web is saved as a record in a log file called a “web log file” that is a text file with the extension “.txt.” the data created automatically of users’ interactions with the web, and will be saved in a variance log files, including server access logs, error logs, referrer logs, and client-side cookies. in the style of text file, this web log fig. 2. overview of the learning system [14]. fig. 1. some of log file sources [11]. abdalla and jumaa: log file and machine learning uhd journal of science and technology | july 2022 | vol 6 | issue 2 79 saves all and every web request executed by the client to the servers. each line or record in the web log file links to a user’s request to the servers. the logs file for the web data are: web server logs, proxy server logs, and browser logs [8]. web server logs typically include the ip address, the date and time of request, the exact request line providing by users, the url, and the requested file type. 3.3. system logs the os records specified events in system log. in addition, these logs are an excellent resource for obtaining information about external events. typically, a system log includes entries generated by the os, such as system failures, warnings, and errors. individual programs may generate log files related with user sessions that include data on the user’s login time, interactions with the application, authentication result, and so on. while an operating system-generated log file is mentioned to as a system log, files produced by particular programs or users are related to as audit data. examples include records of successful and unsuccessful login attempts, system calls, and user command executions [9]. 3.4. security logs security logs are utilized to give enough capabilities for identifying harmful actions after their occurrence with the intention of prevent them from recurrence. security logs preserve track of a range information that has been predefined by system administrators. for example, firewall logs contain information about packets routed from their sources, rejected ip addresses, outbound activity from internal systems, and failed logins. security logs contain detailed information that security administrators must manage, regulate, and evaluate in conformity with their requirements [7]. 3.5. network logs network logs offer different information on various events that occurred on the networks. among the events are the recording of malicious activity, a rise in network traffic, packet losses, and bandwidth delays. network logs may be gathered from a range of network devices, including switches, routers, and firewalls. by monitoring network logs for various attack attempts, network administrators can monitor and troubleshoot normal networking. 3.6. audit logs record any network or system activity that is not allowed in a sequential order. it aids security managers in analyzing malicious activity during an attack. the source and destination addresses, timestamp and user login information are usually the most significant parts of information in audit log files. 3.7. virtual machine logs this log files include details on the instances that are executing on the vm, such as their startup configuration, operations, also the time they complete their execution. the vm logs retain track of many processes, including the number of instances operating on the vm, the execution duration of each application, and application migration, which assists the csp in identifying malicious activity that happened while the attack [10]. figure 1 show some of log file sources. table 2: a list of studies for the purpose of (security) reference no. classification algorithms performance [20] k-means clustering direction accuracy ratio is 83% [21] svr, lr, and knr offers excellent security protection [22] lr, nn, rf, and xg direction accuracy ratio is 85% with 0.78% false positive rate [23] svm direction accuracy ratio is 99% [24] k-means clustering direction accuracy ratio for som#34 is (84.37%) aau is (90.01%) table 3: a list of studies for the purpose of (maintenance) reference no. classification algorithms performance [25] discovering patterns from temporal sequences provides high system performance [26] principal component analysis (pca) provides high system performance table 1: a list of studies for the purpose of (identifying user behavior) reference no. classification algorithms performance [16] nn prediction accuracy is 90% [17] (parzen) (gauss) (pca) and (kmc) prediction accuracy for daily activity dataset is 90% e-mail content dataset is 65% e-mail communication network dataset 75% [18] modified span algorithm and the personalization algorithm provides high predication accuracy abdalla and jumaa: log file and machine learning 80 uhd journal of science and technology | july 2022 | vol 6 | issue 2 4. log mining log mining is a technique which employs statistics, data mining, and ml to automatically explore and analyze vast amounts of log data in essence to find useful patterns and trends. the data and tendencies gleaned might assist in the monitoring, administration, and troubleshooting of software systems [12]. web usage mining (wum) as an example because it the most frequently utilized log file types. web mining is separated into three categories: web usage mining, web content mining and web structure mining. web usage mining is a procedure for capturing web page access data. the pathways leading to viewed web sites are providing by this use data. this data are often collected automatically by the web server then stored in access logs. other important information provided by common gateway interface cgi scripts includes referral logs, survey log data, and user subscription information. this area is significant to the entire usage of data mining by businesses, institutions, and their data access and web-based applications. three steps necessity be taken in web usage mining which are [13]: 4.1. data preprocessing the web logs contain raw data that cannot be utilized to generate information. during this step, engineers use techniques to transform original data for a usable format. typically, realworld data are incomplete, unexpected, and lacking of behavior or patterns, in addition including many mistakes. data preprocessing is a tried-and-true way of solving these issues. 4.2. pattern discovery those pre-processing results are then used to determine a pattern of frequent user access. to identify significant information, several data mining methods for instance association rules, clustering, classification, and sequential pattern approach will be used in pattern discovery. the information that has been obtained could be presented in a number of ways including graphs, charts, tables, and so on. 4.3. pattern analysis final outcome of the pattern discovery step are not utilized directly in the analysis. accordingly, during this phase, a strategy or tool will be developed to assist analysts in comprehending the knowledge which has been gathered. visualization approaches, online analytical processing olap analysis, and this phase might involve the use of tools or methods for instance knowledge query mechanisms. figure 2 overview of the learning system. 5. types of log format common servers use one of these three types of log file formats [15]: 5.1. common log file format web servers create log files utilizing this standardized text file format. the setup of the standard log file format is provided in the box that follows. example of common log file format [15]. 5.2. combined log format it is like the previous log file format with the add of the referral field, user agent field, and cookie field. the setup for this format is shown in the box follows. logformat “%h %l %u %t \”%r\” %>s %b \”%{referer} i\” \”%{useragent}i\”” combined customlog log/access_ log combined eg: 127.0.0.1 frank [15/oct/2021:14:59:38 -0700] “get /apache_pb.gif http/1.0” 200 2328 “https://www.example.com/start.html” “mozilla/4.08 [en] (win98; i ;nav)” example of combined log format [15]. 5.3. multiple access logs it is a hybrid of the common log and the combined log file format, with the ability to establish several directories for access logs. the structure of various access logs is detailed in the box that follows. logformat “%h %l %u %t \”%r\” %>s %b” common customlog logs/access_log common customlog logs/ referer_log “%{referer}i -> %u” customlog logs/ agent_log “%{user-agent}i” example of multiple access logs [15]. 6. data preprocessing and ml 6.1. data preprocessing it is critical to preprocess data to handle with different flaws in raw gathered data, which may include noise such as mistakes, redundancies, outliers, and other missing values or unclear data. the most prevalent procedures in data preprocessing are [16]: abdalla and jumaa: log file and machine learning uhd journal of science and technology | july 2022 | vol 6 | issue 2 81 6.1.1. data cleaning handle data inconsistencies, noise, and missing values. 6.1.2. data integration seeks to integrate data from several sources into a cohesive data storage unit. which is not an easy operation, since it entails establishing compatibility across several schema types. weak or inefficient data integration might result in inconsistency and redundancy, while a well-implemented solution would surely improve accuracy and improve subsequent operations. data integration techniques involve entity identification, correlation analysis, tuple deduplication, redundancy, and along with the discovery and resolution of data value conflicts. 6.1.3. data transformation aims to transform data in a style which is both useable and meaningful format. the reason is for data mining processes to be more efficient. smoothing, feature building, normalization, discretization, and generalization of nominal data are all examples of data transformation strategies. these subtasks are heavily reliant on the preprocessed data and need human supervision. 6.2. machine learning a science focused with the theory, performance, and features of learning systems and algorithms. ml is a highly interdisciplinary field that depend on techniques from several fields, including artificial intelligence, cognitive science, optimization theory, information theory, statistics, and optimal control. ml has permeated virtually every scientific subject on consequence of its broad use in a various of applications, having a tremendous impact on both research and society. it was used to a range of challenges, such as autonomous control systems, informatics and data mining, recognition systems, and recommendation systems [17]. ml is broadly classified into three subfields [18]: • supervised learning: it needs training on labeled data that contain both inputs and outputs. • unsupervised learning: not needs labeled training data, as the environment solely offers unlabeled inputs. • reinforcement learning: it permits learning to occur on consequence of feedback obtained from interactions with the external environment. analysis of log files relevant to a failed execution may be laborious, particularly if the file contains thousands of lines. utilizing current advancements in text analysis using deep neural networks (dnn), research [3] presents an approach to decrease the effort required to study the log file by highlighting the most likely informative content in the failed log file, which may aid in troubleshooting the failure’s causes. in essence, they decrease the size of the log file by deleting lines deemed to be of less significance to the problem. 7. contribution of the log analysis the log analysis contribution split into four categories, as follows [19]: 7.1. performance used to find the system’s performance during the optimization or troubleshooting phase. in the instance of performance, logs assist the administrator in clarifying how a specific system’s resource has been utilized. 7.2. security security logs are a lot used to detect security breaches or misconduct and to conduct postmortem investigations into security occurrences. for example, intrusion detection requires reconstructing sessions from logs that identify illegal system access. 7.3. prediction in addition, logs have ability of producing predictive information. there are predictive analytic systems that utilize log data to assist with marketing plan development, advertising placement, and inventory management. 7.4. reporting and profiling furthermore, log analysis is required for analyzing resource usage, workload, and user activity. for instance, logs will capture the attributes of jobs inside a cluster’s workloads to profile resources utilize within large data center. 8. survey methodology collect articles for this research have been done in a systematic manner comprehensive database for the research on automated log analysis was utilized. relevant articles were identified in online digital libraries, and the repository was extended manually by evaluating the references to these articles. the libraries can now be accessed online. to begin, looked through a range of well-known online digital repositories (e.g., acm digital library, elsevier online, sciencedirect, ieee xplore, springer online, and wiley online). according to these studies, most prevalent uses of log files with ml algorithms is classified into numerous categories, on which we based our study: abdalla and jumaa: log file and machine learning 82 uhd journal of science and technology | july 2022 | vol 6 | issue 2 8.1. identify user behavior user activity analysis using logs may provide significant information about users. user clustering based on logs enables the gathering of clients considering their activity and subsequent analysis of user access patterns, making it an excellent option for problem solving [20]. xu et al. [21] examine use of http traffic to find the identities of users. techniques presuppose access to a proxy server’s log. thus, it is likely to develop web use profiles for people who utilize devices with a static ip address. they demonstrated that given a web use profile, it is feasible to identify users on any other device or to monitor when another user uses a device. technically, they divide web traffic across sessions that link to the traffic of a distinct ip address over a definite time period. they reduce every session to a frequency vector distributed over the vector space of accessible domain. they used a set of methods for instance-based user identification centered on this representation. experiments showed that centered on gathered web usage profiles using nearest neighbor classification, user identification is achievable with a prediction accuracy of greater than 90%. this paper needs to examine the usage of more sophisticated identification and obfuscation methods integrating the time series of urls more closely. kim et al.[22], based on user behavior modeling and anomaly detection techniques, the authors offered a framework for detecting insider threats throughout the user behavior modeling process. they constructed three datasets depending on the cert database: users daily action dataset, an e-mail content dataset, and an e-mail communication dataset depending on the user account and sending and receiving information. they proposed insiderthreat identification models using those datasets, applying ml set anomaly detection methods to imitate real-world companies with just a few potentially harmful insiders’ activities. in this work, the authors employed classification algorithms for insider-threat detection. the findings in this study recommend that the suggested framework is capable of detecting malicious insider behaviors relatively effectively. on the basis of the daily activity summaries dataset, the anomaly detection achieved a maximum detection 90% percent by monitoring top 30% of anomaly. according to the e-mail content datasets, the detection 65.64 % detected while 30% of sceptical e-mails have been monitored. the paper’s limitation is that, although the dataset (cert) used to building the system was carefully developed and contains a variety of threat scenarios, it stills an artificially and simulated produced dataset. prakash et al. [23] investigated for the scope of analyzing user prediction behavior based on users personalization obtained from web logs. a web log records the user’s navigation patterns when visiting websites. the user navigation pattern could be analyzed using the user’s recent weblog navigation. the weblog has several posts with data such as the status code, ip address, and amount of bytes sent, along with categories and a time stamp. user interests could be categorized according to categories and attributes, which aids in determining user behavior. the goal of this research is to differentiate between interested and uninterested user behavior through classification. the modified span algorithm and the personalization algorithm are used to identify the user’s interest. table 1 provides a summary list of studies we reviewed for the purpose of identifying user behavior. 8.2. security issues recently, some researchers and programmers utilizing data mining methods to log-based intrusion detection systems (ids) resulted in a powerful anomaly detection-based (ids) which depended solely on the inflowing stream of logs to discern what may be normal and what is not (possibly an attack) [24]. zeufack et al. [25] offered a fully unsupervised framework for real-time detection of abnormalities. this concept is separated into two phases: a knowledge base development stage, that use clustering to identify common patterns, and a streaming anomaly detection stage that detects abnormal occurrences in real time. they test their framework on (hadoop distributed file system) log files and it successfully detects anomalies with an f-1 score of 83%. this framework ought to be improved to get advantages for other features that are embedding in a log file and has positive impact on anomalies detection. the authors of [26] presented a dempster-shafer (d-s) evidence theory-based host security analysis technique. they acquire information of monitoring logs and use it to design security analysis model. they utilize three regression models as sensors for multi-source information fusion: logistic regression, support vector regression, and k-nearest neighbor regression. the suggested technique offers excellent strong security for host. improved ml approaches may increase accuracy of evidence in this research, resulting in more accurate probability values for host security analysis. study [27] a ml-based system for identifying insider threats in organizations’ networked systems is provided. the research discussed four ml algorithms: neural networks (nn), random forest (rf), logistic regression (lr), and abdalla and jumaa: log file and machine learning uhd journal of science and technology | july 2022 | vol 6 | issue 2 83 xgboost (xg) across multiple data granularities, limited ground truth, and training scenarios to assist cyber security analysts in detecting malicious insider behaviors in unseen data. evaluation results showed that the proposed system can successfully learning from the limited training data and generalize to detect new users with malicious behaviors. the system has a great detection rate and precision, mainly when user-generated findings are considered. the downside: will examine the utilization of temporal information in user activities. specifically, all the systems in this research gave labels based on a single exemplar’s state description. allowing models to view many exemplars or to maintain state (recurrent connections) can allow models to make nonmarkovian decisions. shah et al. [28] offered an expanded risk management strategy for bring your own device (byod) to increase the safety of the device environment. the proposed system makes usage mobile device management system, system logs, and risk management systems to detect malicious activities using machine learning. they can state that the result achieved 99% detection rate with the practice of support vector machine algorithm. tadesse et al. [29] employed multilayer log analysis to discover assaults at several stages of the datacenter. thus, identifying distinct assaults requires considering the heterogeneity of log entries as an initial point for analysis. the logs were integrated in a common format and examined based on characteristics. clustering and correlation are the root of the log analyzer in the center engine, which operate alongside the attack knowledge base to detect attacks. to calculate the quantity of clusters and filter events according on the filtering threshold, clustering methods for instance expectation maximization and k-means were utilized. on the furthermore, correlation establishes a connection or link between log events and provides new attack concepts. then, they analyzed the developed system’s log analyzer prototype and discovered that the average accuracy of som #34 and aau is 84.37% and 90.01%, respectively. the downside: more log analysis or mining must be done to obtain meaningful information and minimize the false positive and negative results. table 2 provides a summary list of studies we reviewed for the purpose of security. 8.3. system maintenance log analysis is typically required during system maintenance because to the intricacy of network structure. chen et al. [30] studied the issue of extracting useful patterns using temporal log data. they present a new algorithm discovering patterns from temporal sequences (dts) algorithm for extracting sequential patterns from temporally regular sequence data. engineers can utilize the patterns that find to well know how a network-based distributed system behaves. they apply the minimum description length (mdl) concept to well-known issue of pattern implosion and take another step forward in summarizing the temporal links between neighboring events in a pattern. tests on actual log datasets showed the method’s effectiveness. extensive tests on real-world datasets show that the suggested methodologies are capable of swiftly discovering high-quality patterns. cheng et al. [31] suggested a method for detecting anomalies using log file analysis. they extract normal patterns from log data and then do anomaly detection using principal component analysis (pca). depending on the experimental results, they concluded that the proposed technique is a great success; this enables the technique to be devised and implemented to the real log file analysis, which makes the work of the system auditor easier. with a minimum of 66% and a high of 92.3%, the average accuracy in detecting anomalies is about 80%. table 3 provides a summary list of studies we reviewed for the purpose of maintenance. 9. conclusion log files are records and track of computing events across different kinds of servers and systems. ml is a reliable solution for automatically analyzing log files. log analysis as a ml application is a fast-emerging technique for extracting information from unstructured text log data files. this study analyzed several studies from various academic databases. they each utilized a different ml method for a different objective. we have summarized the importance, methodologies, and algorithms utilized for each element we have studied. many of the recent publications provided models intended to forecast assaults, user behavior, and system failure to improve server and system security, marketing, and failure times. the disadvantage is that methods that discriminate between normal and abnormal data require a threshold. selecting a correct threshold is challenging and involves prior knowledge; utilizing actual datasets in model creation; many log analyses or mining must be performed to gain significant information; and minimizing false positive and negative findings. furthermore, due to the lack of studies on, the maintenance component requires further improvements compared to the other specified features; nonetheless, interested scholars can study it further. abdalla and jumaa: log file and machine learning 84 uhd journal of science and technology | july 2022 | vol 6 | issue 2 references [1] e. shirzad and h. saadatfar. “job failure prediction in hadoop based on log file analysis”. international journal of computers and applications, vol. 44, no. 3, pp. 260-269, 2022. [2] a. u. memon, j. r. cordy and t. dean. “log file categorization and anomaly analysis using grammar inference”. queen’s university, canada, 2008. [3] m. siwach and s. mann. “anomaly detection for web log data analysis: a review”. journal of algebraic statistics, vol. 13, no. 1, pp. 129-148, 2022. [4] h. s. malallah, s. r. zeebaree, r. r. zebari, m. a. sadeeq, z. s. ageed, i. m. ibrahim, h. m. yasin and k. j. merceedi. “a comprehensive study of kernel (issues and concepts) in different operating systems”. asian journal of research in computer science, vol. 8, no. 3, pp.16-31, 2021. [5] i. mavridis, i and h. karatza. “performance evaluation of cloudbased log file analysis with apache hadoop and apache spark”. journal of systems and software, vol. 125, pp. 133-151, 2017. [6] t. yang and v. agrawal. “log file anomaly detection”. cs224d fall, vol. 2016, pp. 1-7, 2016. [7] s. khan, a. gani, a. w. a. wahab, m. a. bagiwa, m. shiraz, s. u. khan, r. buyya and r. y. zomaya. “cloud log forensics: foundations, state of the art, and future directions”. acm computing surveys (csur), vol. 49, no. 1, pp. 1-42, 2016. [8] v. chitraa and a. s. davamani. “a survey on preprocessing methods for web usage data”. international journal of computer science and information security, vol. 7, no. 3, p. 1257. 2010. [9] r. a. bridges, t. r. glass-vanderlan, m. d. iannacone, m. s. vincent and q. chen. “a survey of intrusion detection systems leveraging host data”. acm computing surveys (csur), vol. 52, no. 6, pp. 1-35, 2019. [10] h. studiawan, f. sohel and c. payne. “a survey on forensic investigation of operating system logs”. digital investigation, vol. 29, pp. 1-20, 2019. [11] available from: https://www.humio.com/glossary/log-file [last accessed on 2022 sep 01]. [12] s. he, p. he, z. chen, t. yang, y. su and m. r. lyu. “a survey on automated log analysis for reliability engineering”. acm computing surveys (csur), vol. 54, no. 6, pp. 1-37, 2020. [13] m. kumar, m. meenu. “analysis of visitor’s behavior from web log using web log expert tool”. in 2017 international conference of electronics, communication and aerospace technology (iceca). vol. 2, institute of electrical and electronics engineers, manhattan, new york, pp. 296-301, 2017. [14] w. li. “automatic log analysis using machine learning: awesome automatic log analysis version 2.0”. 2013. [15] n. singh, a. jain and r. s. raw. “comparison analysis of web usage mining using pattern recognition techniques”. international journal of data mining and knowledge management process, vol. 3, no. 4, p. 137, 2013. [16] m. a. latib, s. a. ismail, o. m. yusop, p. magalingam and a. azmi. “analysing log files for web intrusion investigation using hadoop”. in: proceedings of the 7th international conference on software and information engineering, pp. 12-21, 2018. [17] j. qiu, q. wu, g. ding, y. xu and s. feng. “a survey of machine learning for big data processing”. eurasip journal on advances in signal processing, vol. 2016, no. 1, pp. 1-16, 2016. [18] n. jones. “computer science: the learning machines”. nature, vol. 505, no. 7482, pp. 146-148, 2014. [19] m. a. latib, s. a. ismail, h. m. sarkan and r. c. yusoff. “analyzıng log ın bıg data envıronment: a revıew”. arpn journal of engineering and applied sciences, vol. 10, no. 23, pp. 1777717784, 2015. [20] h. xiang. “research on clustering algorithm based on web log mining”. journal of physics conf series, vol. 1607, no. 1, p. 012102, 2020. [21] j. xu, f. xu, f. ma, l. zhou, s. jiang and z. rao. “mining web usage profiles from proxy logs: user identification”. in: 2021 ieee conference on dependable and secure computing (dsc). institute of electrical and electronics engineers, manhattan, new york, pp. pp. 1-6, 2021. [22] j. kim, m. park, h. kim, s. cho and p. kang. “insider threat detection based on user behavior modeling and anomaly detection algorithms”. applied sciences, vol. 9, no. 19, p. 4018, 2019. [23] p. g. prakash and a. jaya. “analyzing and predicting user navigation pattern from weblogs using modified classification algorithm”. indonesian journal of electrical engineering and computer, vol. 11, no. 1, pp.333-340, 2018. [24] a. abbas, m. a. khan, s. latif, m. ajaz, a. a. shah and j. ahmad. “a new ensemble-based intrusion detection system for internet of things”. arabian journal for science and engineering, vol. 47, no. 2, pp. 1805-1819, 2022. [25] v. zeufack, d. kim, d. seo and a. lee. “an unsupervised anomaly detection framework for detecting anomalies in real time through network system’s log files analysis”. high confidence computing, vol. 1, no. 2, pp. 100030, 2021. [26] y. li, s. yao, r. zhang and c. yang. “analyzing host security using d-s evidence theory and multisource information fusion”. international journal of intelligent systems, vol. 36, no. 2, pp. 10531068, 2021. [27] d. c. le, a. n, zincir-heywood and m. i. heywood. “analyzing data granularity levels for insider threat detection using machine learning”. ieee transactions on network and service management, vol. 17, no. 1, pp. 30-44, 2020. [28] n. shah and a. shankarappa. “intelligent risk management framework for byod”. in: 2018 ieee 15th international conference on e-business engineering (icebe). institute of electrical and electronics engineers, manhattan, new york, pp. 289-293, 2018. [29] s. g tadesse and d. e dedefa. “layer based log analysis for enhancing security of enterprise datacenter”. international journal of computer science and information security, vol. 14, no. 7, pp.158, 2016. [30] j chen, p wang, s du and w wang. “log pattern mining for distributed system maintenance”. complexity, vol. 2020, no. 2, pp. 1-12, 2020. [31] x. cheng and r. wang. “communication network anomaly detection based on log file analysis”. in: international conference on rough sets and knowledge technology. springer, cham, pp. 240-248, 2014. fig. 1. some of log file sources [11]. fig. 2. overview of the learning system [14]. aq4 aq4 tx_1~abs:at/tx_2:abs~at uhd journal of science and technology | july 2022 | vol 6 | issue 2 29 1. introduction in the digital world, handwriting is one of the most appeared challenges faced in daily life. when handwriting is detected and transformed into a digital device, several pattern analysis problems will appear that need to be solved. the problems include handwriting recognition, script identification and recognition, signature verification, and writer identification. one of the most challenging and researchable fields among mentioned problems is handwriting recognition. the wellknown system in this field is optical character recognition (ocr) which transforms the uneditable text-image format of script into a machine-editable and manageable format of the script. in other words, ocr is a converter software of scanned scripts to a format that could be processed as a character by a computer. for the 1st time, ocr was invented by carley in 1870 for processing scanned retina [1]. it is worth mentioning that nearly all of the ocr systems are script specific in the sight that they are restricted to recognizing a particular language or a writing system excluding several works that focused on multilinguistic handwriting recognition. however, most works focus on a specific script or language, but still, it has been broken down for more specificity which only covers special symbols, numerals, or characters within the same language or script. after performed an in-depth review of several research articles including survey articles [2]–[4], we conclude that the entire process of alphabetic handwriting recognition could be classified under some separated classification types based on several factors as below. offline handwritten english alphabet recognition (ohear) hamsa d. majeed, goran saman nariman department of information technology, college of science and technology, university of human development, kurdistan region, iraq a b s t r a c t in most pattern recognition models, the accuracy of the recognition plays a major role in the efficiency of those models. the feature extraction phase aims to sum up most of the details and findings contained in those patterns to be informational and non-redundant in a way that is sufficient to fen to the used classifier of that model and facilitate the subsequent learning process. this work proposes a highly accurate offline handwritten english alphabet (ohear) model for recognizing through efficiently extracting the most informative features from constructed self-collected dataset through three main phases: pre-processing, features extraction, and classification. the features extraction is the core phase of ohear based on combining both statistical and structural features of the certain alphabet sample image. in fact, four feature extraction portions, this work has utilized, are tracking adjoin pixels, chain of redundancy, scaled-occupancy-rate chain, and density feature. the feature set of 27 elements is constructed to be provided to the multi-class support vector machine (msvm) for the process of classification. the ohear resultant revealed an accuracy recognition of 98.4%. index terms: alphabet recognition, handwriting recognition, multi-class support vector machine, feature extraction, optical character recognition corresponding author’s e-mail: hamsa d. majeed, department of information technology, college of science and technology, university of human development, kurdistan region, iraq. e-mail: hamsa.al-rubaie@uhd.edu.iq received: 24-05-2022 accepted: 15-08-2022 published: 20-08-2022 access this article online doi: 10.21928/uhdjst.v6n2y2022.pp29-39 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2022 majeed and nariman. this is an open access article distributed under the creative commons attribution noncommercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l re se a rc h a rt i c l e uhd journal of science and technology majeed and nariman: english handwritten recognition 30 uhd journal of science and technology | july 2022 | vol 6 | issue 2 1. script writing system 2. data acquisition (input modes) (online and offline) 3. granularity level of documents 4. source of the collected dataset 5. script recognition process the scriptwriting system type defines the selected language to be recognized in the proposed system. the languages which are in use today throughout the world have been defined under several different systems, more details can be found in sinwar et al. [2], ghosh and shivaprasad [3], pal [4], ubul et al. [5]. the mechanism of data acquisition could be separated into two categories [2], [6], [7]: offline and online handwriting recognition. in online handwriting recognition, a digital device with a touch screen without a keyboard must be involved like a personal digital assistant (pda) or mobile. where screen sensors receive the switching of pushing and releasing the pen on the screen together with the pen tip movements over the screen. while in offline mode, image processing is involved by converting an input image (from a scanner or a camera) of text to character code which is aimed to be utilized by a text processing application. granularity level of documents describes the stage of detailed information taken as initial input to the defined and proposed framework, as example, a full page or a single letter of text image uses as initial input. there are two types of sources of collected dataset; public dataset (real-world dataset) and self-constructed dataset. the term “public dataset” refers to a dataset that has been saved in the cloud and made open to the public. mnist, keras, kaggle, and others are examples. while the self-constructed dataset is the dataset that the researchers create and prepare on their own by scanning handwritten documents from different people. the script recognition process is the primary section which is the practical part of the work. in general, it is formed from four main phases, namely, preprocessing (p), segmentation (s), feature extraction (f), and classification (c). the last two phases, f and c, are the common phases in the study, there is not any work without any of these two phases. however, there are many researches in literatures without p and/or s. 2. related work in this section, several works will be illustrated in the field of english alphabet handwritten recognition for bringing to light varied methodologies employed in each step to accomplish the recognition. starting with a review study [8] which summarizes eight research papers with their contributions, limitations coupled with strategies employed to enhance ocr systems. here, we mention two of them and demonstrate their conclusion; patel et al. [9] was working on the ann (artificial neural network). characters were extracted using matlab. the module was analyzed pixel by pixel and transformed into a list of characters. to find edges, they used an edge and skew detection algorithm. moreover, it became normalized thereafter. the authors claim that the accuracy is improved by increasing the hidden layers and neurons. only 100 input neurons were used for testing which accounts for the work’s limitation. the litterateurs of gupta et al. [10] segment the input data at the word level into separated characters using ai and heuristic functions. then, the feature vector is generated by extracting features from the segmented characters. as a property of vectors, blending three types of fourier descriptors are utilized in parallel. finally, svm has been employed as a classifier. the authors claim that a piece of recognition error rates may arise from the usage of low-quality material and ink density diversity, as well, is another point that degrades document quality. the authors of karthi et al. [11] propose a system to recognize cursive handwriting english letters. the initial system input is in pdf for mat of both alphabet and cursive english letters which have been gathered from 100 different people and the total samples are 2k. this module is accomplished through four processes, namely, image preprocessing, skeletonization, segmentation points identification, and contour separation. the final module utilizes a convolutional neural network (cnn) for training the dataset to predict recognition. support vector machine (svm) is the system classifier. the accuracy rate of this work achieved 95.6%. the investigation of pre-processing, feature extraction, and classifier techniques is emphasized in ibrahim et al. [12]. the pre-processing initiates with normalizing image letters to 70x50 pixel dimensions by utilizing the nearest neighbor technique. then, the binarization process is executed using otsu’s threshold sampling procedure. character skeleton and contour algorithms have been employed to accomplish the feature extraction step. further, both isolated and combined feature extraction procedures are involved in the experiments. the study employed two different classifications (hibbert majeed and nariman: english handwritten recognition uhd journal of science and technology | july 2022 | vol 6 | issue 2 31 classifier) techniques which are support vector machine (svm) and multilayer perceptron (mlp) classifiers. the recognition experiment outcomes obtained an accuracy of 97%. in parkhedkar et al. [13], a system has been produced that implements all four available steps of the handwritten recognition process. it takes a scanned document as initial input and proceeds through preprocessing for the oncoming step in which each letter of the word will be separated from the other (segmentation). then, the gabor feature is served for extraction of the features that will be passed through the knn classifier on the final step. the accuracy rate of the developed project has not been given. rather, the authors claim that multiple experiments have been established using publicly available data and the achieved accuracy is the highest in the experimentation studies when using constant data. in gautam and chai [14], the proposed work uses the publicly available dataset eminst and minst. this means that no pre-processing and segmentation have been applied. the work only focuses on the last two steps, namely, feature extraction besides classification. the features of english letters and digits have been extracted by employing a hybrid proposal, which combines the zoning method and zig-zag diagonal scan. the feedforward nn (ffnn) is utilized as a classifier. then, the back-propagation learning algorithm is used for the network training. the accuracy rate of english characters (e emnist) and english numbers (mnist) recognition stands for 99.8% and 94%, respectively. the litterateurs of zanwar et al. [15] select 3410 samples of chars74k which is another publicly available dataset. independent component analysis (ica) technique is used in feature extraction phase. backpropagation neural networks have been employed in the final phase (classification). the recognition accuracy shows 98.21% of matching characters. the authors of the previous study have improved their work [16] by hibernating two techniques at the feature extraction phase while the rest remained the same apart from the dataset that mnist employed in this work. the new technique integrates detached component analysis and hybrid pso and firefly optimization for effective selection of features and then applies a supervised learning technique called backpropagation neural network to perform classification. recognition accuracy scores of 98.25% were recorded using the models. 3. proposed method the proposed technique for offline handwritten english alphabet recognition (ohear) is revealed in this section. according to the aforementioned classification of handwritten recognition, table 1 shows the used category of the classes for the presented method. the selected input script to the model is the english alphabet (capital and small). the presented approach acquires data offline, which implies that scanned documents (images) are served as an entry to the model. because the model operates at the character level, it takes character images as input. the used dataset nature is self-constructed, stating that it was manually gathered from 120 individuals, each of whom typed 52 characters from a to z and a-z. the contribution takes place in the general script recognition process phases which are the primary and the heart of such works. apart from data acquisition which was mentioned before (commonly referred to as the first phase), it is divided into three major phases (pfc), which are pre-processing, feature extraction, along with classification. each phase’s output will be provided into the next. the phases are illustrated in fig. 1 and described in the subsections that follow. table 1: classification of the proposed technique classes nominated category script writing system english alphabet data acquisition offline granularity level of documents character level source of the collected dataset self-constructed dataset script recognition process pfc fig. 1. script recognition process of the proposed model. majeed and nariman: english handwritten recognition 32 uhd journal of science and technology | july 2022 | vol 6 | issue 2 3.1. pre-processing this step is required and is a critical procedure because we are using a self-constructed dataset rather than public datasets. it should be carefully studied because the model’s accuracy rate directly leans on the output quality of this phase. the reason being such a dataset used instead of using the small image size, cleaned, and noise-free public dataset is that it’s truly close to data actuality in terms of real-world application. the pre-processing procedure is broken down into six isolated processes, as shown in fig. 2. the initial process is converting the inputs to grayscale for the purpose of size reduction which implies higher performance for the following processes without affecting accuracy. the contrast enhancement manipulates and redistributes image pixels to improve the partitioning of hidden structural variations in pixel intensity to assemble a more distinct structural distribution. the distribution of the pixels is calculated utilizing the histogram equalization (he) approach, which represents the probability allocation of the image’s gray levels (pixels). adaptive thresholding, based on otsu’s approach, was used to convert the grayscale picture to a binary image (binarization). this technique is used to divide the pixels into two classes: foreground and background. following the creation of the binary image, the sizes of all input images are uniform such that the output image only comprises the english letter. the compromised area refers to the region of interest (roi). after size uninformed, edge detection is the next step. it was done using the canny approach, which locates all edges with the shortest distance between the detected edge and the processed letter’s true edge. the final step of the pre-processing is for usage of skeletonization and thinning to produce the skeleton of the letter image. the thinning technique removes black foreground pixels, one at a time, until a skeleton of one-pixel width is obtained. 3.2. feature extraction this phase is the uppermost critical and crucial because a proper feature extraction mechanism should be selected for a specified script. it is obvious that various scripts have distinct properties, therefore, factors that are effective in recognizing one script may not be effective in identifying another. the primary contribution of this study is the identification of features of english letter patterns that will be extracted and prepared for the oncoming and final phase of the recognition process. the feature vector is the output that consists of four segments as illustrated in fig. 3. each segment of the extracted feature vector is described below: 3.2.1. tracking adjoins pixels the first step in feature vector creation starts with studying the image details at the pixel level, discovering the starting point then tracking the flowing of each letter through the pixels owned by concerned image. any pixel with more than 2 adjoins is represented as an intersection point, while the open-end point has precisely one adjoin as illustrated in fig. 4 which is the english letter h with two intersection points and four open-ended points. 3.2.2. chain of redundancy (cr) the next feature is retrieved using freeman chain code [17]. in the proposed ohear, the chain code is employed to describe the form of english alphabets as a linked sequence of pixels in a restricted length and direction. this expression is based on clockwise 8-connectivity, as shown in fig. 5a. the skeleton image is tracked starting from the open-ended pixels and stopped at the last open-ended pixel. as for the intersection pixel, the tracking operation will be done fig. 2. the pre-processing phase of the proposed approach. majeed and nariman: english handwritten recognition uhd journal of science and technology | july 2022 | vol 6 | issue 2 33 by proceeding in the alternative direction defined by that intersection point until it gets to the terminated open-end pixel. this process will continue until the entire pixels of the entered skeleton image of the english alphabet are tracked. a numbering method is employed to code the direction and length belonging to the pixels. for instance, the generated chain code for the letter (s) is illustrated in fig. 5b which shows that the starting pixel is the top-right open-ended one which indicates chain code 7 followed by three more 7s. then, it turns to the left as 5 and so on. these chain code numbers will be adjusted for creating the change of redundancy (cr). cr consists of eight elements starting from index 1 to 8 which index numbers represent the directional numbers from the freeman chain code. for instance, in the full tracking process, 11, 3, and 19 times the chain code directions of 1, 2, and 3 have been repeated, respectively. in the result, the indexes 1, 2, and 3 of cr contain 11, 3, and 19. finally, the cr with eight elements will be added to the feature vector as the second segment. 3.2.3. scaled-occupancy-rate chain (sor) more valuable information can be retrieved from the abovegenerated data (cr) which involves the total pixels’ number occupied by the english letter and considering the repetition fig. 3. feature extracted process of ohear. fig. 4. intersection points and open ended of letter h. fig. 5. (a) eight directions of freeman chain code. (b) s letter with chain code directions. ba majeed and nariman: english handwritten recognition 34 uhd journal of science and technology | july 2022 | vol 6 | issue 2 of the individual number chain code directions. the scaledoccupancy-rate chain (sor) can generate a reasonable value to be added to the feature vector that could be generated, the scaled-occupancy-rate chain (sor). sor is a significant segment of the feature vector that gives weight to each chain code direction. for instance, the ideal cr of direction 3 (from fig. 5a) for letters i and e is similar but the sor of them is totally different, it gives 100% weight to the direction of 3 for i but much less for e. sor will be generated as follows, the division process applied to each index of cr on the total pixels number occupied by the skeleton image of the english letter, in other words, each index of cr is divided by the summation of cr’s indexes values. for instance, from mentioned cr of s, the total pixels number of s’s foreground is 76, so, the computation of the first and third indexes will be 11/76=0.144 and 19/76=0.25, respectively. finally, a scale factor of 10 will be hands-on to get a more practical value for classification objectives. for example, 0.144 and 0.25 will be 1.4 and 2.5, respectively. the final result with eight elements will be added to the feature vector as the third segment. 3.2.4. density feature (df) the final insertion to the feature set is the information extracted from the demanded character under the employment of the density feature. this segment of feature is achieved using the zoning technique which has been applied to the skeleton image of letters. zoning is a statistical feature extraction that calculates the density of foreground pixels by the zone’s pixel numbers, each letter’s image divided into 9 (3 × 3) zones. the zone’s size of each is 10 × 10 denoting that the entered image will be resized to 90 × 90 before these divisions are applied as illustrated in fig. 6. this density feature (df) will be calculated for all nine zones. consequently, nine values will be generated and will be added to the feature vector as the last segment. the ideal (s) illustrated in fig. 6 takes all the nine zones, but in reality, the handwritten is dissimilar from the ideal state, the results from our dataset plotted in fig. 7 demonstrate that with different handwritten styles, the zones’ occupation will be changed accordingly. fig. 7a3 shows that zone-3 and zone-7 will be discarded in the calculation of df because they have zero density. it is alike the situation for fig. 7b3 zone-3. 3.3. classification the latest phase of the proposed approach is the classification process which determines the recognition output of the given english letter’s image. the multi-class support vector machine (msvm) has been implied which is based on the support vector machine (svm) technique. the svm is a well-known classifier, and it has obtained much traction in machine learning and statistics since it was first introduced. vapnik’s foundational work (1998) [18] set the groundwork for the theory of svm generic statistical fig. 6. ideal resized and zoned s letter. fig. 7. zone density and occupation of different handwritten styles. majeed and nariman: english handwritten recognition uhd journal of science and technology | july 2022 | vol 6 | issue 2 35 learning, which, in turn, inspired several expansions. svm is a binary classifier which means it only handles two-class classification issues. therefore, it does not suit our work while having 52 english alphabet classes. more details about binary svm can be found in cristianini and shawetaylor [19], schoelkopf and smola [20]. as a result of its limitation, the msvm model has been developed to determine the dynamic process instability using multi-class classification. it has also found use in a variety of fields, including control chart pattern recognition besides industrial problem diagnosis [21], and is employed for many different language characters and numerals recognition such as romaine, thai, french, and arabic persian [1]. furthermore, it is worth noting that, according to ubul et al. [5], msvm classifiers using various extracted features outperformed k-nn and nn classifiers in handwritten recognition field. the feature vectors from the previous phase which were generated from 80% of the self-constructed dataset will be employed to train msvm to create the classification model. this model creates 52 classes of small and capital english letters. the remaining 20% dataset are for testing operation. 4. results the experimental outcomes have been established to assess the proposed model ohear performance. the model is implemented using matlab 2020a and the evaluation process had been performed through a constructed dataset consisting of 52 offline handwritten english alphabet from a (a) -to -z (z) self-collected from 120 individuals, in a total of 6240 samples collected for capital and small letters together. with the aim of covering most of the various possibilities of the handwritten patterns, various types of writing objects (pen, pencil, and magic marker) with different colors and font sizes were applied to prove the effectiveness of the presented model regarding the recognition process. the first set of results was in image form and from the share of preprocessing phase, as fig. 2 presents, this phase goes through six stages starting from greyscale conversion to thinning, the outcome of this phase is illustrated in fig. 8 for letter g. regardless of the entered image’s color, it will be converted into the grayscale in the early steps of preprocessing phase, in the second stage, the brightness level is equalized yielding the contrast enhancement of that image. the oncoming stage shows the outcome of binary conversion through the adaptive thresholding of the input. size refinement is applied after binarization to determine roi in a preparation step to the following stage where the edge of the interesting region is detected, the final stage represents the resultant thinning output to be ready for the oncoming phase of ohear which is the feature extraction. in this phase, the same stages are applied for all the 52 letters for each individual. in fact, it applied to all the collected dataset to the feature extraction state. the second set of results was in the numbers form where the feature set has been extracted for each letter of the english alphabet through the ohear model where the statistical and structural features have been extracted and combined into one feature set with 27 elements. as fig. 3 revealed, the 27 elements of the extracted features are combined in four portions. in this section, those elements are translated into numbers in four tables, each table describes one of those portions in different capital with small letters. in fact, each table describes the outcomes of that portion of the feature set for all of the 52 letters, but for the publication requirements, the letters are distributed among the tables to show most of the outcomes of the letters. consequently, all portions were gathered from the tables to define each letter in the dataset so as to create the feature set of that letter to be distinguished by the used classifier. table 2 illustrates the first portion of the feature set which is outlined by the pair (intersection and endpoints), the fig. 8. pre-processing phase results. majeed and nariman: english handwritten recognition 36 uhd journal of science and technology | july 2022 | vol 6 | issue 2 outcomes of a-to-e small and capital letters were illustrated, some challenges appear in this section of feature collection one of belonged long to the handwriting style in which the lines were not connected properly or more intersection points than normal created. hence, this portion alone was not reliable enough and needs to have more features to be extracted, which lead to the second and third portions where their results are illustrated in tables 3 and 4. tables 3 and 4 contain the portions: chain of redundancy (cr) and scaled-occupancy-rate chain, respectively, each portion has eight elements. the outcomes of f-to-j small and capital letters were illustrated in table 2, while table 3 shows the outcomes of k, l, m, y, and p small and capital letters, the letters in table 3 are not consecutive as trying to decrease the letters with a looks like letters as capital and small or looks as other letters in the same table. those two portions increased the richness of the extracted characteristics from the letters with a minimum number of feature elements. moreover, the combination of features’ outcomes of the three tables so far improved the classification accuracy. yet, some limitations floating to the surface of the process, because of existing different techniques in handwriting tracking the chain through the directions may differ for the same letter, for example, the straight line in a letter been written in bent way, or circles in some letter were not written completed, otherwise, some handwritten styles write circles where it should be a normal line, all these issues affect the chain creating process in those portions because it leans on the directions. these limitations have been solved by using another portion of combination which is the density feature. reaching table 5 which reports the last portion of the feature set, the density chain provides the feature vector with the last nine elements. those elements describe the density of nine zones for each letter, the results of q, r, n, t, and u letters capital and small. combing the outcomes of this portion with the previous chains boost the recognition accuracy, it gives occupied zones for each letter with the exact rate of that occupation in each zone, which advances the amount of information that extracted about each letter although there are some issues appear in some letters causing due to the writing direction sometimes it’s in slant or diagonal way but when it’s combined with the other features from the other portions it gives a cleared version of description to the classifier for recognition operation of that letter. the next and final phase in the ohear model is classification, multi-class svm is employed for this purpose in the proffered model, as a preparation step for this phase, all the features are gathered from the collected samples and then grouped into two packs of data, training data which contain 80% of the constructed dataset (96 samples out of 120 for each letter) fed to classifier for the purpose of training, while the remaining 20% labeled as test data (24 samples out of 120 for each letter) supplied to classifier for performance testing of the presented recognition model. the recognition accuracy out of 100% has been measured for all the gathered samples. according to the outcomes from the self-constructed dataset used in this study, the handwritten english alphabet recognition accuracy in the proposed model can be classified into three groups: first group: the letters which achieved 100% accuracy throughout all the testes samples regardless of the font size, type of used pen, or its color, accompanied by the variety in how it’s written or how straight it is (mostly slanted). the proposed combination of feature extraction mechanisms powered up the recognition ability of the classifier. most of the letters (capital and small) belong to this group and this matter caused the raise of the total recognition accuracy of the proposed model. table 2: intersections and endpoints character a a b b c c d d e e no. of intersection points 2 2 1 1 2 2 0 2 1 0 no. of endpoints 2 1 0 1 0 0 0 2 3 1 table 3: chain of redundancy (cr) character chain of redundancy (cr) f 4 4 6 3 13 0 0 1 f 5 3 16 2 1 0 0 2 g 18 5 9 20 12 5 2 6 g 4 7 19 5 8 2 3 4 h 12 3 20 1 0 0 0 1 h 0 5 25 1 0 0 0 0 i 10 7 23 4 4 1 0 0 i 0 3 19 1 0 0 0 0 j 0 1 20 10 6 2 5 5 j 0 2 14 4 4 2 0 0 majeed and nariman: english handwritten recognition uhd journal of science and technology | july 2022 | vol 6 | issue 2 37 table 4: scaled‑occupancy‑rate chain (sor) character scaled-occupancy-rate chain (sor) k 0.1875 0.3437 0.0312 0.0343 0.0937 0 0 0 k 0.2285 0.4571 0.3142 0 0 0 0 0 l 0.2142 0.0714 0.4285 0.2142 0 0 0 0.0714 l 0 0.1666 0.8333 0 0 0 0 0 m 0.0476 0.1309 0.4047 0 0 0 0.2023 0.2142 m 0.1904 0.0714 0.2857 0.0714 0 0 0.0952 0.2857 y 0 0 0.2500 0.7250 0.0250 0 0 0 y 0.0212 0.1276 0.4042 0.2127 0 0 0 0.1702 p 0 0.0555 0.6944 0.2222 0 0 0 0 p 0.0681 0.0681 0.2045 0.1136 0.0101 0 0 0 table 6: illustrations of accuracy rates for various feature extraction techniques previous work feature extraction approach accuracy rate gautam and chai [14] combination: zoning method+zig-zag diagonal scan 94% zanwar et al. [16] combination: detached component analysis+hybrid pso 98.25% ibrahim et al. [12] combination: features that are based on viewing capabilities+bit map feature. 97% zanwar et al. [15] independent component analysis (ica) technique 98.21% the proposed model (ohear) combination: tracking adjoin pixels+chain of redundancy+scaled-occupancy-rate chain+and density feature 98.4% second group: portion of the letters which belong to this group, precisely (small letter of l, capital letter of i, and z) are not fully recognized successfully, the classifier misclassifies one sample from the testing set of samples (i.e., 23 from 24 testing sample scored). this is due to the common way of handwriting those letters, commonly capital letter of i is similarly written as a small letter of l, beside the used way of writing the capital letter of z with an extra line in the middle which confused the first portion of the feature vector. third group: the letters (i and j) are the reason for this group creation, the classifier misclassifies two of the testing samples (i.e., scored 22 out of 24) for two major reasons, first, the dot (.) above the letters sometimes writing close to the letter, far, or lightly written in a way that excluded in the preprocessing phase. the second reason is produced by roi determination, when the dot is written far from the letter, then it is considered out of the region of interest and excluded from the process. despite the fact that the model achieved an excellent recognition rate of (98.4%), there are still areas for improvement, such as reconsidering the mentioned issues in classification groups, which will be discussed in the following section. the proposed combination of extracted features in this work is unique, for that matter, a comparison study has been made for the percentage of recognition rate achieved by other researchers that used different approaches for feature extraction as table 6 illustrates. it is noticeable that the proposed model contributes remarkable efficient table 5: density features character zones density values z1 z2 z3 z4 z5 z6 z7 z8 z9 q 14.166 15.111 12.277 14.166 15.111 16.055 10.622 22.133 0 q 10.818 17 11.333 17 27.818 5.6666 0 0 12.750 r 34.151 36.428 0 31.875 30.222 11.333 19.125 0 20.777 r 21.250 7.0833 18.888 21.250 28.333 0 17.163 4.3589 0 n 3.2692 8.1730 16.346 19.615 19.615 19.615 15.088 15.088 9.0532 n 13.909 23.181 4.2148 6.9545 23.181 23.181 0 18.545 6.3223 t 16.071 28.928 13.928 0 15 0 0 12 0 t 0 10.699 5.3496 13.730 35.664 10.699 0 23.181 12.482 u 20.863 0 11.590 25.500 0 23.181 11.590 23.181 9.2727 u 16.227 0 8.4297 25.500 9.2727 23.181 2.3181 13.909 14.752 majeed and nariman: english handwritten recognition 38 uhd journal of science and technology | july 2022 | vol 6 | issue 2 recognition performance with a non-previously processed self-constructed dataset with different types of writing objects along with avoiding redundancy in the generated data for classification purposes. 5. conclusion and future consideration the most compacted and informative set of features has remarkable effectiveness to enhance the classifier – efficiency, recognition accuracy, and reliable classification accomplishment. this work presents an optimized feature extraction phase by employing both statistical and structural techniques to retrieve the features from constructed dataset self-collected for offline handwritten english alphabets through recognition (ohear) model. the extraction process goes through four stages: tracking adjoins pixels, redundancy chain, adjusted scaled redundancy chain, and density feature. the extracted feature set is provided to the multi-class svm classifier which has been trained and tested using 120 sets of each capital and small letters of handwritten english alphabets. the proffered model achieved a recognition accuracy of 98.4%. despite the good recognition rate, the experimental outcomes reveal some misclassification of some letters, those issues could be enhanced by making slight changing in the used features extraction techniques to raise the classification accuracy. replacing the tracking adjoin pixels with another technique is a suggestion to overcome those misclassification issues, adopting the actual length of chain before redundancy calculation as a number in the features set are another possible suggestion besides expanding the threshold of roi to include all the detailed characteristics of the letters while still, the increasing of the training set is always a valid option to improve the classification accuracy process. all over, reducing the total length of the feature vector with preserving the quality of the system and the level of validation rate is the goal looking forward to, on the other hand, employing another classifier is an important factor to achieve an optimum outcome from the proposed system. moreover, the presented recognition model (ohear) can be extended for symbols, special characters, or other language recognition. references [1] j. mantas. “an overview of character recognition methodologies”. pattern recognition, vol. 19, no. 6, pp. 425-430, 1986. [2] d. sinwar, v. s. dhaka, n. pradhan and s. pandey. “offline script recognition from handwritten and printed multilingual documents: a survey”. international journal on document analysis and recognition, vol. 24, no. 1-2, pp. 97-121, 2021. [3] d. ghosh and a. p. shivaprasad. “handwritten script identification using the possibilistic approach for cluster analysis”. journal of the indian institute of science, vol. 80,no. 3, pp. 215, 2000. [4] u. pal. “automatic script identification: a survey”. j. vivek, bombay, vol. 16, no. 3, pp. 2635, 2006. [5] k. ubul, g. tursun, a. aysa, d. impedovo, g. pirlo and i. yibulayin. “script identification of multi-script documents: a survey”. ieee access, vol. 5, pp. 6546–6559, 2017. [6] a. priya, s. mishra, s. raj, s. mandal and s. datta. “online and offline character recognition: a survey”. 2016 international conference on communication and signal processing (iccsp), pp. 0967-0970, 2016. [7] x. y. zhang, y. bengio and c. l. liu. “online and offline handwritten chinese character recognition: a comprehensive study and new benchmark”. pattern recognition, vol. 61, pp. 348-360, 2017. [8] b. m. vinjit, m. k. bhojak, s. kumar and g. chalak. “a review on handwritten character recognition methods and techniques”. in: 2020 international conference on communication and signal processing (iccsp), 2020 [9] c. i. patel, r. patel and p. patel. “handwritten character recognition using neural network,” international journal of scientific and engineering research. vol. 2, no. 5, pp. 1-6, 2011. [10] a. gupta, m. srivastava and c. mahanta. “offline handwritten character recognition using neural network”. in: 2011 ieee international conference on computer applications and industrial electronics (iccaie), 2011. [11] m. karthi, r. priscilla and k. s. jafer. “a novel content detection approach for handwritten english letters”. procedia computer science. vol. 172, pp. 1016-1025, 2020. [12] b. ibrahim, h. yaseen and r. sarhan. “english character recognition system using hybrid classifier based on mlp and svm”. international journal of inventions in engineering and science technology, vol. 5, pp. 1-15, 2019. [13] s. parkhedkar, s. vairagade, v. sakharkar, b. khurpe, a. pikalmunde, a. meshram and r. jambhulkar. “handwritten english character recognition and translate english to devnagari words”. international journal of scientific research in computer science, engineering and information technology, vol. 5, pp. 142-151, 2019. [14] n. gautam and s. s. chai. “zig-zag diagonal and ann for english character recognition”. international journal of advanced research in computer science. vol. 8, no. 1-4, pp. 57-62, 2019. [15] s. r. zanwar, a. s. narote and s. p. narote. “english character recognition using robust back propagation neural network”. in: communications in computer and information science. springer, singapore, pp. 216-227, 2019. [16] s. r. zanwar, u. b. shinde, a. s. narote and s. p. narote. “handwritten english character recognition using swarm intelligence and neural network”. in: intelligent systems, technologies and applications. springer, singapore, pp. 93-102, 2020. [17] h. freeman. “computer processing of line-drawing images”. acm computing surveys, vol. 6, no. 1, pp. 57-97, 1974. [18] v. n. vapnik and v. vapnik. “statistical learning theory”. vol. 1. wiley, new york, 1998. [19] n. cristianini and j. shawe-taylor. “background mathematics”. majeed and nariman: english handwritten recognition uhd journal of science and technology | july 2022 | vol 6 | issue 2 39 in: an introduction to support vector machines and other kernelbased learning methods. cambridge university press, cambridge, pp. 165-172, 2013. [20] b. schoelkopf and a. j. smola. “learning with kernels: support vector machines, regularization, optimization, and beyond,” ieee transactions on neural networks and learning systems, vol. 16, no. 3, pp. 781-781, 2005. [21] y. liu and h. zhou. “msvm recognition model for dynamic process abnormal pattern based on multi-kernel functions”. journal of systems science and information, vol. 2, no. 5, pp. 473-480, 2014. ole_link1 tx_1~abs:at/tx_2:abs~at 34 uhd journal of science and technology | jan 2022 | vol 6 | issue 1 1. introduction in today’s world, diabetes is one of the most frequent diseases [1], whereas diabetes is a non-communicable disease that has a significant impact on people’s health today [2]. it is a chronic condition or collection of metabolic diseases in which a person’s blood glucose levels remain elevated for an extended period due to insufficient insulin synthesis or improper insulin response by the body’s cells [3]. diabetes mellitus (dm) is a condition that affects more than 60% of the population and has a high mortality rate [4], this disease has increased at an exponential rate in recent years, according to a statistical study published on the world health organization’s website. the number of diabetic patients worldwide has increased significantly, from 108 million in 1980 to 422 million in 2014 [5]. a type of diabetes is gestational diabetes. gestational diabetes is a kind of diabetes that develop during pregnancy [6]. changes in dietary habits, increased spending power, and climate change, among other factors, are all contributing to an increase in the number of women with gestational diabetes aggravated by pregnancy [7]. gestational dm (gdm) is a type of glucose intolerance that develops during pregnancy and can cause difficulties for both the mother and the fetus [8]. women are also more likely to have diabetes-related comorbidities such as renal disease, depression, and poor vision [9]. it can be detected early, which reduces the patient’s health risk [10]. however, if the illness is not treated promptly, it can have serious consequences for the kidneys, brain system, retina of the eyes, and heart problems [11], to diagnose diabetes, medical experts require a technique of prediction [12]. using the pima dataset and a variety of machine learning (ml) algorithms, it is feasible to give an advanced technique an intelligent and precise method used for detecting gestational diabetes in the early stages safa abdul wahab hameed, alaa badeea ali department of computer science, faculty of engineering and science, bayan university, erbil, iraq a b s t r a c t this paper suggests a naive bayes classifier technique for identifying and categorizing gestational diabetes mellitus (gdm), gdm is a kind of diabetes mellitus that affects a small proportion of pregnant women but recovers to normal once the baby is born. the pima indians diabetes dataset was chosen for a comprehensive analysis of this critical and pervasive health disease because it contains 768 patient characteristics acquired from a machine learning source at the university of california, irvine. the goal of the study is to apply smart technology to categorize diseases with high accuracy and precision, practically free of conceivable and potential faults, to provide satisfying findings. the approach is based on eight major characteristics that are present in the operations that are required to establish a precise and reliable categorization system. this approach involves training and testing on real data, as well as for deciding whether or not to construct a categorization model. the work was compared to earlier work and had a 96% accuracy rating. index terms: classifier, feature selection, gestational diabetes, machine learning, naïve bayes corresponding author’s e-mail: safa abdul wahab hameed, alaa badeea ali, department of computer science, faculty of engineering and science, bayan university, erbil, iraq. e-mail: safa.hamid@bnu.edu.iq; alaa.baban@bnu.edu.iq received: 07-12-2021 accepted: 18-02-2022 published: 20-03-2022 access this article online doi: 10.21928/uhdjst.v6n1y2022.pp34-42 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2022 hameed and ali. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l r e s e a r c h a r t i c l e uhd journal of science and technology hameed and ali: an intelligent method used for detecting gestational diabetes uhd journal of science and technology | jan 2022 | vol 6 | issue 1 35 for diabetes prediction. because deep learning algorithms may be used in a variety of ways in this industry, models based on artificial neural networks (ann) and the quasi-newton technique, for example [13]. this area lends itself well to ml methods. models are trained using ml techniques. there are three types of ml algorithms: supervised learning (in which datasets are labeled and regression and classification techniques are used), unsupervised learning (in which datasets are not labeled and techniques such as dimensionality reduction and clustering are used), and reinforcement learning (in which the model learns from its every action). ml is a rapidly developing new technology with several applications [14]. ml techniques have advanced at a breakneck pace and are now widely used in a variety of medical applications [15], one of the most commonly explored challenges by dm and ml researchers is classification [16]. one of the most crucial parts of supervised learning is classification. picking the proper classification model is a tradeoff between performance, the execution time of models, and scalability. parameter adjustment should also be considered in order to improve model performance. ml training data are an important input to an algorithm that comprehends and memorizes information from such data to predict the future. understanding the significance of the training set in ml can assist you in obtaining the appropriate quality and amount of training data for your model training. once you understand why it’s essential and how it influences model prediction, you’ll be able to select the best method based on the availability and compatibility of your training data set [17]. it is vital to develop predictive algorithms that are both accurate and simple to use when evaluating large amounts of data and converting it into useful information [18] ml methods are commonly utilized for detection and classification [19]. this diagnosis allows for proper treatment to begin as soon as feasible, avoiding deaths [20], [21]. people will be able to seek treatment for this condition if it can be detected and predicted at an early stage [22]. this type of disease is the gestational diabetes is the focus of this research, in this research an effective method was used, which is a naïve bayes classifier, it used to detect and identify gestational diabetes and give high performance and accurate results. 2. literature review similar works on diabetes analysis, prediction, and diagnosis are reviewed in this section. it uses a variety of classification and ml algorithms to handle diabetes management prediction problems. in pradhana et al. [5], the suggested methodology analyzed publically accessible data collected from diabetic patients to identify the causes of diabetes, the most afflicted age groups, job styles, and eating patterns. ann are used in the model to detect diabetes and determine its kind. the authors utilized the “pima indian diabetes” dataset, which has the maximum accuracy of 85.09%. 768 patients’ medical histories are included in the dataset. where as in the filho et al. [7], the suggested method improved the accuracy of the classification techniques by focusing on identifying the features that fail in early diagnosis of diabetes miletus utilizing predictive analysis using support vector machine (svm) and naïve base (nb) algorithms. the accuracy of the improved svm is 77%, while the accuracy of the nb is 82.30%. in prasanth et al. [2] explained, the captured data were fed into supervised ml techniques. the pima indians diabetes dataset was utilized in this study, and a model was created using svm, catboost, and relative frequency to predict dm, with an accuracy of 86.15%. and in rawat and suryakant [18], a comparison was done between suggested approaches and previously published studies. in this work, five ml algorithms, adaboost, logicboost, robustboost, naive bayes, and bagging, were proposed for the analysis and prediction of dm patients. the suggested methodologies were tested on a data set of pima indians with diabetes, and the proposed algorithm, bagging, was applied on the same database with an accuracy of 81.77%. the two algorithms naive bayes and svm used in gupta et al. [1] as classification models, and feature selection to improve the model’s accuracy. the accuracy, precision, and recall values were used to evaluate the results. the model’s improved performance was calculated using the k-fold cross-validation technique. in rajivkannan and aparna [22], the aim of this study was to build an objective method to evaluate dm risk from past gdm data recorded 15 years ago and find a shortlist of the most informative indicators. the research steps involve pre-processing data to evaluate missing values mvs, finding the most informative attributes, and testing standard classification algorithms to combine into the most effective voting metaalgorithm. meta-algorithm-based classification of limited anamnestic gdm related data for dm prediction is proving. relative frequency of occurrence (rfo) analysis of attributes combined with voting meta-algorithm helped find the optimal amount of attributes giving the best possible classification result. the algorithm applied to two-class data set with 12 selected attributes produced an accuracy of 75.85. in [17], the major goal of this research is to look at different forms of machine-learning classification algorithms and compared them. in this study, use machine-learning classification algorithms to detect the start of diabetes in diabetic patients. the top performing algorithm, logistic regression, has an accuracy of 80%. in moon [21], the research provided a biological ml hameed and ali: an intelligent method used for detecting gestational diabetes 36 uhd journal of science and technology | jan 2022 | vol 6 | issue 1 method that is both efficient and effective which is applied on pima indian diabetic database (pidd). the proposed ensemble of svm and back-propagation neural network (bp-nn) tested on diabetes diagnosis; one of the most frequently investigated topics in bioinformatics. the findings reveal an accuracy of 88.04% on this problem. in sanakal and jayakumari [16], the application of fuzzy c-means clustering (fcm), fcm, and svm on a collection of medical data linked to diabetes diagnostic difficulties was the subject of this work. the medical data comprises nine input variables relating to diabetes clinical diagnosis and one output attribute that indicates whether or not the patient has been diagnosed with diabetes. fcm had the best outcome, with an accuracy of 94.3%. in lavanya and rani [23], this work provided a quick overview of the old and new data mining approaches utilized in diabetes. this study used fcm and svm procedures and got an accuracy of 94.3%. in sarwar et al. [4], the results showed that the ensemble approach had a 98.60% accuracy, which combines the predictive performance of numerous ai-based algorithms and is superior to all other individual competitors. ann, naive bayes, svm, and k-nearest neighbor are the techniques that were more precise than the others (k-nn). about 400 persons were included in the database, which came from all across the world. and in resti et al. [9] a model validation built which based on 5-fold cross-validation, which divided the data into training and test data. the gaussian nave bayes was the best strategy for predicting diabetes diagnosis, according to the model validation results. the contribution of this research was that the multinomial naïve bayes method’s performance metrics all exceed 93%. with the same explanatory factors, these findings are useful in predicting diabetes status. in this paper, the proposed method is implemented on pima indians diabetes dataset. here in this paper, the work was presented using a naive base algorithm. the work was carried out in stages that relied on training and testing on real data based on certain characteristics as explained later, and a classification result was obtained with high efficiency and accuracy. 3. method this work is done to diagnose gestational diabetes by classification using a naive base algorithm. this work includes several stages, including training and testing on real data, to adopt and use the system. following the stages to implement the method: 3.1. system diabetic detection the diabetes classification system includes two phases; the first is the training stage, which has particular functions such as reading the diabetes dataset, feature selection, discretization, and the classifier model used to create the decision rules. the second stage is the testing stage, which includes the following particular functions: read data set, discretization, decision rules, and output [1], [24], as illustrated in fig. 1. 3.2. data set the data set has been taken from 768 women, (500 negative cases and 268 positive cases) from 21 years old and above, and eight recorded features as follows: • number of previous pregnancies • plasma glucose concentration at 2-h in an oral glucose tolerance test • diastolic blood pressure (mm hg) • triceps skinfold thickness (mm) • 2-h serum insulin (mu u/ml), • body mass index (weight in kg/[height in m]2) • diabetes pedigree function • age (years) there are a variety of causes for incomplete data, including patient death, device problems, and respondents’ reluctance to answer particular questions. training stage testing stage read diabetes data read diabetes data feature selection discretization discretization naïve base classifier true false fig. 1. the structure of the training and testing stages of the diabetes classification system. hameed and ali: an intelligent method used for detecting gestational diabetes uhd journal of science and technology | jan 2022 | vol 6 | issue 1 37 every patient in the database is a pima indian lady of at least 21 years of age who lives in or around arizona. there are eight properties in each dataset sample (attributes), as shown in table 1. a sample of the data used is shown in table 2: the row header in the table related to the column header (features) in table 1. there are 768 samples in all, divided into two groups. the following is the distribution of classes: 1. normal group: (500) samples 2. abnormal group: (268) samples. fold creation: the complete labeled dataset is separated into mutually exclusive folds to perform the cross validation procedure. the 768 cases in this study were chosen from the pidd database, which means: 1. the training-set contains 499 occurrences, 325 of which are normal and 174 of which are aberrant (i.e., 65%) 2. there are 269 examples used in the testing, 175 of which are normal and 94 of which are abnormal (i.e., 35%) as shown in the table 3. 3.3. feature selection stage the pidd data set has eight attributes in each sample. according to the entropy of the property with class, let s (training data) be a data-set of c outputs. for the classification issue with c classes, let p (i) indicate the proportion of s that belongs to class i, where i is distinct from one to c [1], [23], [24]. p is the simple diversity index (i) (1) the information–theoretical approach of assessing the quality of the split is entropy. it calculates the quantity of information in a given attribute. ( ) = = −∑ 1 en t r o p y s ( p ( i ) p ( i ) ) i (2) the information gain of the example set s on the attribute a is defined as gain(s,a). ( ) ( ) ( ) g a i n s , a s v en t r o p y s en t r o p y s v s    = − ×      (3) where sv = subset of s in which feature a has value v, |sv| = amount of data in sv, and |s| = amount of elements in s, and is over every value v of every conceivable value of the attribute a, a sub-set property is chosen from eight options. it takes a long time to choose more than five properties. selecting <5 attributes, on the other hand, make the diabetes classification system less time-consuming but less accurate. according to the greater entropy, the ideal five attributes (1, 3, 4, 5, and 7) for describing diabetes have been determined and will be used as an input to the classification step. 3.4. discretization before the categorization procedure in the diabetes classification system, a crucial phase must be completed. it is necessary to transform the numerical values of eight attributes to category values. this is accomplished by splitting the range of values for eight characteristics into k equal-sized bins, or equal width intervals, where k is a user-selected quantity based on the length of data. where it can be done in a mechanism of the equal width interval discretization (ewid) algorithm, by the method used in this algorithm, table 1: number of feature of pidd feature description range 1 number of times pregnant 1-4 2 plasma glucose concentration a 2 h in an oral glucose tolerance test. i. 120-140 3 diastolic blood pressure (mm hg). 80-90 4 triceps skin fold thickness (mm). 12 mm (male) 23(female) 5 2-h serum insulin (lu/ml). 16-166 mlu/l 6 body mass index (weight in kg/(height in m)^2) 18-24.9 kg/m2 7 diabetes pedigree function feature 8: 2-h serum insulin (lu/ml). 1-3 8 age (years). 21 and above table 2: a sample of pima indian diabetes dataset 1 2 3 4 5 6 7 8 class 1 116 78 29 180 36.1 0.496 25 normal 2 130 96 0 0 22.6 0.268 21 normal 0 129 110 46 130 67.1 0.319 26 abnormal 0 135 68 42 250 42.3 0.365 24 abnormal 4 112 78 40 0 39.4 0.236 38 normal table 3: number of cases tested and trained training no. of cases normal abnormal 499 325 174 testing 269 175 94 hameed and ali: an intelligent method used for detecting gestational diabetes 38 uhd journal of science and technology | jan 2022 | vol 6 | issue 1 algorithm (1): equal width interval discretization (ewid) input: eight attributes have numerical values. output: eight attributes have categorical values. begin minimum1 = op1(0) : maximum1 = op 1(0) : minimum 2 = op 2(0) : maximum 2 = op 2(0) : minimum 3 = op 3(0) : maximum 3 = op 3(0) minimum4 = op 4(0) : max4 = op 4(0) : minimum 5 = op 5(0) : maximum 5 = op 5(0) : minimum 6 = op 6(0) : maximum 6 = op 6 (0) minimum 7 = op 7(0) : maximum 7 = op 7(0) : minimum 8 = op 8(0) : maximum 8 = op 8(0) for x = 1 to len of data set { if op1(x) > maximum1 then maximum1 = op1(x) : if op1(x) < minimum 1 then minimum 1 = op1(x) if op2(x) > maximum2 then maximum2 = op2(x) : if op2(x) < minimum 2 then minimum 2 = op2(x) if op3(x) > maximum3 then maximum3 = op3(x) : if op3(x) < minimum 3 then minimum 3 = op 3(x) if op4(x) > maximum4 then maximum4 = op4(x) : if op4(x) < minimum 4 then minimum 4 = op4(x) if op5(x) > maximum5 then maximum5 = op5(x) : if op5(x) < minimum 5 then minimum 5 = op5(x) if op6(x) > maximum6 then maximum6 = op6(x) : if op6(x) < minimum 6 then minimum 6 = op6(x) if op7(x) > maximum7 then maximum7 = op7(x) : if op7(x) < minimum7 then minimum 7 = op7(x) if op8(x) > maximum8 then maximum8 = op8(x) : if op8(x) < minimum8 then minimum8 = op8(x) } k=3 // possibilities number g1 = round(((maximum1 minimum1) / k), 3) : g2 = round(((maximum2 minimum2)/ k), 3) g3 = round(((maximum3 minimum3) / k), 3) : g4 = round(((maximum4 minimum4)/ k), 3) g5 = round(((maximum5 minimum5) / k), 3) : g6 = round(((maximum6 minimum6)/ k), 3) g7 = round(((maximum7 minimum7) / k), 3) : g8 = round(((maximum8 minimum8)/ k), 3) // identify k ranges for each of the eight attributes. lowop1 = (minimum1 + g1) : medop1 = (minimum 1 + (2 * g1)) : highop1 = (minimum1 + (3 * g1)) lowop2 = (minimum2 + g2) : medop2 = (minimum2 + (2 * g2)) : highop2 = (minimum2 + (3 * g2)) lowop3 = (minimum3 + g3) : medop3 = (minimum3 + (2 * g3)) : highop3 = (minimum3 + (3 * g3)) lowop4 = (minimum4 + g4) : medop4 = (minimum4 + (2 * g4)) : highop4 = (minimum4 + (3 * g4)) lowop5 = (minimum5 + g5) : medop5 = (minimum5 + (2 * g5)) : highop5 = (minimum5 + (3 * g5)) lowop6 = (minimum6 + g6) : medop6 = (minimum6 + (2 * g6)) : highop6 = (minimum6 + (3 * g6)) lowop7 = (minimum7 + g7) : medop7 = (minimum7 + (2 * g7)) : highop7 = (minimum7 + (3 * g7)) lowop8 = (minimum8 + g8) : medop8 = (minimum8 + (2 * g8)) : highop8 = (minimum8 + (3 * g8)) end fig. 2. the structure for the ewid algorithm code. we have obtained the traits under a specified range. it is an important step before classifying to convert it into categories, to be used for training. [25], [26] [27], which consists of steps as shown in fig. 2. 3.5. classifier model the most well-known task classification is the constructing classifier model. this structure is used to determine the diabetes class, which can be either normal or abnormal. the diabetes (training-set) database is made up of attributevalue representations for a large number of patients, with five categorical characteristics (1, 3, 4, 5, and 7) and class attributes. those characteristics are fed into the learning classifier model. the classifier model is used to predict the new case. the decision is used to build the classifier model in this work, which is based on training diabetes [1], [25]. this mechanism is implemented using a naive base algorithm, where the algorithm is fed by the characteristics from the training set, and this helps in building a classifier model for prediction based on the decision, the steps of this algorithm and the mechanism used in it are explained as shown in fig. 3. 4. results as previously stated, the diabetes categorization system’s initial data included pima indian diabetes illness measures. before classification, the retrieved numerical values from attributes must be transformed into categorical values, which will be used to train the classifier using the ewid method. table 4 shows the categorical values of the five attributes according to the diabetes classification system. attributes, attribute-value, and range of values are the three fields that hameed and ali: an intelligent method used for detecting gestational diabetes uhd journal of science and technology | jan 2022 | vol 6 | issue 1 39 algorithm (2): naïve base algorithm output: decision begin for i = 0 to 1 // two group q = 0 for j = 0 to lentr if opr(i) = op(j) then q = q + 1 end for opr (i) = math.round((q / ct), 3) end for for g = 1 to 8 if g = 1 then a = b1 : if g = 2 then a = b2 : if g = 3 then a = b3 if g = 4 then a = b4 : if g = 5 then a = b5 : if g = 6 then a = b6 if g = 7 then a = b7 : if g = 8 then a = b8 for i = 0 to a for k = 0 to 1 q = 0 :y = 0 for j = 0 to lentr if g = 1 then if p1(i) = f1(j) and opr (k) = op(j) then q= q+ 1 if g = 2 then if p2(i) = f2(j) and opr(k) = op(j) then q = q + 1 if g = 3 then if p3(i) = f3(j) and opr(k) = op(j) then q = q + 1 if g = 4 then if p4(i) = f4(j) and opr(k) = op(j) then q = q + 1 if g = 5 then if p5(i) = f5(j) and opr(k) = op(j) then q = q + 1 if g = 6 then if p6(i) = f6(j) and opr(k) = op(j) then q = q + 1 if g = 7 then if p7(i) = f7(j) and opr(k) = op(j) then q = q + 1 if g = 8 then if p8(i) = f8(j) and opr(k) = op(j) then q = q + 1 if g = 9 then if p9(i) = f9(j) and opr(k) = op(j) then q = q + 1 if g = 10 then if p 10(i) = f 10(j) and opr(k) = fc(j) then q = q + 1 next // likelihood if g = 1 then pp1(i, k) = math.round((q / (qp(k) *ct)), 3) if g = 2 then pp2(i, k) = math.round((q / (qp(k) *ct)), 3) if g = 3 then pp3(i, k) = math.round((q / (qp(k) *ct)), 3) if g = 4 then pp4(i, k) = math.round((q / (qp(k) *ct)), 3) if g = 5 then pp5(i, k) = math.round((q / (qp(k) *ct)), 3) if g = 6 then pp6(i, k) = math.round((q / (qp(k) *ct)), 3) if g = 7 then pp7(i, k) = math.round((q / (qp(k) *ct)), 3) if g = 8 then pp8(i, k) = math.round((q / (qp(k) *ct)), 3) end for end for end for end fig. 3. naïve base algorithm. make up this table. the five attributes are shown in the first field (how many times pregnant, diastolic blood pressure, triceps skin fold thickness, serum insulin and diabetes pedigree function feature). the categorical values of five attributes are shown in the second field. the range of values obtained by the ewid method is represented in the third field. table 5 shows the categorical values samples that were obtained by changing numerical property values of the diabetes case with the class attribute according to the range of value field in table 4. table 6 shows the results of the top five attributes chosen for use in the classifier model. the following properties were used to train and evaluate the classifier model: preg signifies the number of pregnancies a woman has had, pres the diastolic blood pressure, skin the thickness of the triceps skin folds, insu the serum insulin, and pedi the diabetes pedigree function feature. the entropy for each feature is calculated using the entropy equation (2). for the diabetes instances indicated in the tables, the following naïve bayés classifier was trained: the likelihood of the number of times pregnant features for the three ranges of two classes (normal and abnormal) is represented in table 7. hameed and ali: an intelligent method used for detecting gestational diabetes 40 uhd journal of science and technology | jan 2022 | vol 6 | issue 1 the normal class is represented by the number 0.322, while the abnormal class of the low range is represented by the value 0.551. the abnormal class of medium-range is represented by the value 0.301. the number 0.241 represents the typical normal class, whereas the value 0.148 represents the aberrant high-range class. the likelihood of each class being equal to one is shown by the end row. the likelihood of the number of skin characteristics for the three ranges of two classes (normal and abnormal) is represented in table 8. the number 0.35 represents the normal class, whereas the value 0.431 represents the lowrange abnormal class. the values 0.276 and 0.179 represent the normal and abnormal classes in the medium range, respectively, while the values 0.403 and 0.39 represent the normal and abnormal classes in the high range, respectively. the likelihood of each class being equal to one is shown by the end row. the likelihood of the number of ins features for the three ranges of two classes is represented in table 9. (normal, abnormal) the values 0.977 and 0.865 reflect the normal and abnormal classes, respectively, of the normal –range. the normal class is represented by the number 0.023, whereas the abnormal class is represented by the value 0.006. the likelihood of the number of pedi characteristics for the three ranges of two classes (abnormal and normal) is shown by table 10, where the value 0.874 represents the normal class and the value 0.935 represents the abnormal class of the low –range. the number 0.126 represents the normal class, whereas the value 0.065 represents the high-range abnormal class. the probability of each class that is equal to one is summed in the last row. table 11 provides the confusion matrix of classifier implementation retrieved from the testing stage using nb. the accuracy and error rate for diagnosed cases are calculated using table 11. the values used in calculating the accuracy of the nb classifier using the accuracy equation (4), where the: tp: true positive tn: true negative tp and tn are added together, then divided by the sum of all with fp (false positive) and fn (false negative), and the error is computed using the equation (5) [28]. a c c u r a c y = ( t p +t n ) \ ( t p +t n +f p +f n ) a c c u r a c y =( 17 3 +8 7 ) / ( 17 3 +4 +5 +8 7 ) =0 .9 6 (4) er r o r = ( f p t n ) / ( t p t n f p f n ) =0 .3 3 4 + + + + (5) table 4: categorical features attributes attribute-value range of values number of times pregnant low (0–2) medium (3–5) high (6–17) diastolic blood pressure low (0–80) medium (80–100) high (100–122) triceps skin fold thickness low (0–20) medium (20–60 ) high (60–99) serum insulin normal (0–280) abnormal (280–860) diabetes pedigree low (0.084–1.251) high (1.251–2.42) table 5: samples of categorical features values id preg press skin insu pedi class 1 low low low normal low normal 2 low medium medium normal high normal 3 high low medium normal low normal 4 high medium medium abnormal high abnormal 5 low high medium normal high abnormal 6 medium high high abnormal high abnormal table 6: entropy of categorical features values entropy preg plas skin skin ins mass pedi age 0.71 0.13 0.49 0.88 0.55 0.02 0.64 0.08 hameed and ali: an intelligent method used for detecting gestational diabetes uhd journal of science and technology | jan 2022 | vol 6 | issue 1 41 these values are compensated according to the work performed and the results obtained. this proposed work was used to enhance accuracy using the naive base classifier method, where the mechanism was implemented with a performance that gives higher accuracy than the accuracy obtained from previous studies. this work has been compared with the previous works as shown in table 12, as each work used the appropriate mechanism for diagnosing and classifying this disease and obtained an appropriate accuracy rate for the work. after studying and analyzing this problem, a high percentage of accuracy was obtained. 5. conclusion in this study, the nb classifier was used, we attempted to provide an approach for identifying the classification method for detecting and classifying diabetes at an early stage. there are eight properties in each dataset sample (attributes), divided into two classes normal and abnormal, and used in two stages (training and testing), during the training stage, specific functions were performed, such as reading the diabetes dataset, feature selection, discretization, and the classifier model used to create the decision rules, and during the testing stage, specific functions were performed, such as reading the diabetes dataset, discretization, decision rules, and output. the findings of the experiments were run on the dataset and compared with the previous works, and a system that can table 7: preg feature probability preg normal abnormal low 0.322 0.551 medium 0.437 0.391 high 0.241 0.148 sum 1 1 table 8: skin feature probability preg normal abnormal low 0.35 0.431 medium 0.276 0.179 high 0.403 0.39 sum 1 1 table 9: ins feature probability preg normal abnormal normal 0.977 0.865 abnormal 0.023 0.006 sum 1 1 table 10: pedi feature probability preg normal abnormal low 0.874 0.935 high 0.126 0.065 sum 1 1 table 11: the confusion matrix using naïve bayes classifier predicate class actual class normal abnormal normal (tp) 173 (fp) 4 abnormal (fn) 5 (tn) 87 table 12: comparison of accuracy with previous works authors the year dataset the method used accuracy sneha n. and gangil t 2019 pima indians diabetes dataset 1) support vector machine svm 77% 2) naïve base nb 82.30% islam m. a. and jahan n. 2017 pima indians diabetes dataset logistic regression algorithm 80% rawat v and suryakant s. 2019 pima indians diabetes dataset bagging method 81.77% pradhana n , rania g, singh v, dhaka v. s and pooniab r. c. 2020 pima indian diabetes dataset artificial neural networks 85.09% prasanth s, banujan k and btgs k 2021 pima indians diabetes dataset the adaptation model of svm, catboost, and random forest (rf) 86.15%. zolfaghar 2012 pima indians diabetes dataset the support vector machine (svm) and back-propagation neural network (bp-nn) 88% resti y., kresnawati e. s., dewi n. r., zayanti d. a. and eliyati n. 2021 pima indians diabetes dataset naive bayes, discriminant analysis, and logistic regression 93% sanakal r. and jayakumari s. t., 2014 pima indians diabetes dataset fuzzy c means clustering 94.3& jayanthi n. , babu .v. b. and rao s. 2016 pima indians diabetes dataset fcm and svm 94.3% hameed s. a. and baban a. b. (the proposed method in this paper) 2021 pima indians diabetes dataset naive base classifier 96% hameed and ali: an intelligent method used for detecting gestational diabetes 42 uhd journal of science and technology | jan 2022 | vol 6 | issue 1 reliably diagnose and categorize gestational diabetes was shown to be 96% accurate. 6. references [1] s. gupta, h. k. verma and d. bhardwaj. “classification of diabetes using naïve bayes and support vector machine as a technique”. operations management and systems engineering, pp. 365-376, 2020. [2] s. prasanth, k. banujan and k. btgs. “hyper parameter tuned ensemble approach for gestational diabetes prediction”. international conference on innovation and intelligence for informatics, computing, and technologies (3ict), ieee. pp. 18-23, 2021. [3] n. sneha and t. gangil. “analysis of diabetes mellitus for early prediction using optimal features selection”. journal of big data, vol. 6, p. 13, 2019. [4] a. sarwar, m. ali, j. manhas and v. sharma. “diagnosis of diabetes type-ii using hybrid machine learning based ensemble model”. international journal of information technology, vol. 12, pp. 419-428, 2020. [5] n. pradhana, g. rania, v. singh, v. s. dhaka and r. c. pooniab. “diabetes prediction using artificial neural network”. in: deep learning techniques for biomedical and health informatics, sciencedirect, pp. 327-339, 2020. [6] m. d. okpor. “prognostic diagnosis of gestational diabetes utilizing fuzzy classifier”. international journal of computer science and network security, vol. 15, no. 6, pp. 44-48, 2015. [7] e. g. filho, p. r. pinheiro, m. c. d. pinheiro, l. c. nunes and l. b. g. gom. “heterogeneous methodology to support the early diagnosis of gestational diabetes. ieee access, vol. 99, p. 1, 2019. [8] m. marozas, s. sosunkevič, m. francaitė-daugėlienė, d. veličkienė and a. lukoševičius. “algorithm for diabetes risk evaluation from past gestational diabetes data”. technology and health care, vol, 26, no. 4, pp. 637-648, 2018. [9] y. resti, e. s. kresnawati, n. r. dewi, d. a. zayanti and n. eliyati. diagnosis of diabetes mellitus in women of reproductive age using the prediction methods of naive bayes, discriminant analysis, and logistic regression. science and technology indonesia, vol. 6, no. 2, pp. 96-104, 2021. [10] m.a. islam and n. jahan. prediction of onset diabetes using machine learning techniques. international journal of computer applications, vol. 180, no. 5, pp. 7-11, 2017. [11] r. saxena, s. k. sharma and m. gupta. “analysis of machine learning algorithms in diabetes mellitus prediction”. journal of physics: conference series, vol. 1921, p. 012073, 2021. [12] n. jayanthi, v. b. babu and s. rao. “data mining techniques for cpd of diabetes”. international journal of engineering computational research and technology, 2014. [13] k. lakhwani, s. bhargava, k. k. hiran, m. m. bundele and d. somwanshi. “prediction of the onset of diabetes using artificial neural network and pima indians diabetes dataset”. 5th ieee international conference on recent advances and innovations in engineering, pp. 1-6, 2020. [14] r. zolfaghar. “diagnosis of diabetes in female population of pima indian heritage with ensemble of bp neural network and svm”. international journal of computational engineering and management, vol. 15, no. 4, pp. 115-121, 2012. [15] a. kaushik, a. sehgal, s. vora, v. palan and s. patil. “presaging the signs of diabetes using machine learning algorithms”. 12th international conference on computing communication and networking technologies, 2021. [16] r. sanakal and s. t. jayakumari. “prognosis of diabetes using data mining approach-fuzzy c means clustering and support vector machine”. international journal of computer trends and technology, vol. 11, no. 2, pp. 94-98, 2014. [17] h. naz and s. ahuja. “deep learning approach for diabetes prediction using pima indian dataset”. journal of diabetes and metabolic disorders, vol. 19, no. 1, pp. 391-403, 2020. [18] v. rawat and s. suryakant. “a classification system for diabetic patients with machine learning techniques”. international journal of mathematical, engineering and management sciences, vol. 4, no. 3, pp. 729-744, 2019. [19] p. kaur and r. kaur. “comparative analysis of classification techniques for diagnosis of diabetes”. in: jain, l., virvou, m., piuri, v. and balas, v. (eds.), advances in bioinformatics, multimedia, and electronics circuits and signals advances in intelligent systems and computing. vol. 1064. springer, singapore, 2020. [20] l. jonk. “chronic disease prevention a vital investment”. world health organization, geneva, switzerland, 2005. [21] l. moon. “prevention of cardiovascular disease, diabetes and chronic kidney disease: targeting risk factors”. vol. 118. aihw, 2009. available from: http://www.aihw.gov.au/publications/index. cfm. [last accessed on 2022 mar 09]. [22] a. rajivkannan and k. s. aparna. “a survey on diabetes prediction using machine learning techniques”. international journal of research in engineering, science and management, vol. 4, no. 11, pp. 51-54, 2021. [23] d. lavanya and k. u. rani. “performance evaluation of decision tree classifiers on medical datasets”. international journal of computer applications, vol. 26, no. 4, pp. 1-4, 2011. [24] r. raja, i. mukherjee and b. k. sarkar. “a machine learningbased prediction model for preterm birth in rural india”. journal of healthcare engineering, vol. 2021, p. 6665573, 2021. [25] a. saleha and f. nasari. “implementation of equal-width interval discretization in naive bayes method for increasing accuracy of students' majors prediction”. lontar komputer jurnal ilmiah teknologi informasi. vol. 9, no. 2, pp. 104-113, 2018. [26] r. dash, r. l. paramguru and r. dash. “comparative analysis of supervised and unsupervised discretization techniques”. international journal of advances in science and technology, vol. 2, no. 3, pp. 29-37, 2011. [27] j. dougherty, r. kohavi and m. sahami. “supervised and unsupervised discretization of continuous features”. in: proceedings of the twelfth international conference on international conference on machine learning (icml’95). morgan kaufmann publishers inc., san francisco, ca, usa, 1995, pp. 194-202. [28] j. han and m. kambar. “data mining: concepts and techniques”. 2nd ed. morgan kaufmann publisher, burlington, massachusetts, 2006. tx_1~abs:at/tx_2:abs~at uhd journal of science and technology | jul 2021 | vol 5 | issue 2 47 1. introduction the last decades and the present century have witnessed an acceleration in the pace of change toward the knowledge economy, as the production and organization of knowledge have become a top priority for business organizations [1]. knowledge is an essential ingredient for driving economic growth in countries [1]. knowledge has already become an intangible asset of the organization, prompting organizations to rearrange their priorities (national information technology council, 2004) [2]. many technological applications have been developed that have enhanced organizational capabilities and have created a huge influence of information and their use in organizations [3]. business organizations now face a clear challenge as a result of the knowledge and technical revolution in all areas of knowledge [4]. effective decision for enabling senior management to enhance its role in investing in technical and knowledge developments is very important to face the turbulent environment and its requirements [5]. it is necessary to identify theoretical foundations and theoretical structures that are capable of achieving the goals of the organization [6]. business organizations in general and industrial companies in particular are affected by the dramatic change in the business environment and its drive toward the use of information technology in which information has become a key resource knowledge management functions applied in jordanian industrial companies: study the impact of regulatory overload muzhir shaban al-ani1, shawqi n. jawad2, suha abdelal2 1department of information technology, university of human development, college of science and technology, sulaymaniyah, krg, iraq, 2department of management, amman arab university, college of business, amman, jordan o r i g i n a l r e s e a r c h a r t i c l e uhd journal of science and technology a b s t r a c t this research aims to study the impact of electronic information overload on knowledge management functions in jordanian industrial companies. the research population included all jordanian industrial companies listed on the amman stock exchange. a simple random sample of 30% of the research population of 1242 seniors and middle managers in the research population was done to 373 individuals. 206 questionnaires are successfully retrieved to be analyzed. descriptive and heuristic statistical methods such as simple and multiple regression analysis were applied using spss.16 program. the obtained result indicated that there is a statistically significant impact of the electronic information overload (organizational overload) on the knowledge management functions (acquisition, generation, transmission, sharing, and application of knowledge) in jordanian industrial companies. in the scope of the results, this work made a number of recommendations, including: adopting an organizational aspect that suits the nature of the tasks that the industrial companies operate in jordan, in addition to providing technical capabilities to reduce the electronic information overload faced by the industrial companies in jordan while practicing their tasks. index terms: knowledge management, organizational overload, statistical analysis, jordanian industrial companies corresponding author’s e-mail: muzhir shaban al-ani, department of information technology, university of human development, college of science and technology, sulaymaniyah, krg, iraq. muzhir.al-ani@uhd.edu.iq received: 14-09-2021 accepted: 28-10-2021 published: 04-11-2021 access this article online doi: 10.21928/uhdjst.v5n2y2021.pp 47-56 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2021 al-janabi, et al. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) al-ani, et al.: the impact of regulatory overload in jordanian companies 48 uhd journal of science and technology | jul 2021 | vol 5 | issue 2 for the growth and progress of these organizations [7]. the information becomes the most important in terms of its accessibility and storage in electronic databases and then reemployment in which generated the overload of electronic information [8]. knowledge management is one of the most recent topics in the world of management and it is of great interest to stakeholders in business organizations sekaran, [9]. in addition, this increased of interest in the rush of organizations of various types toward the possibility of applying knowledge [10]. knowledge management is gateway to the development of contemporary organizations to enable them to meet future challenges [10]. the significance of knowledge management in business organizations is not in the knowledge itself, but in the value added to these organizations, in addition to the role it plays in transferring organizations to the knowledge economy that emphasizes investment in knowledge capital [11]. the business environment of organizations is characterized by rapid change, dominated by the infor mation and communication revolution [12]. knowledge is the weapon adopted by organizations to ensure their growth and sustainability [13]. participatory knowledge is widespread and increased by practice and use. knowledge is an important resource that contributes to the success of various organizations [14]. modern business organizations constantly strive to adapt in every stage of development in the knowledge economy and keep pace with the requirements of the era [12]. electronic information systems have become the basis for management and productivity processes in business organizations of all kinds (bawden and robinson, 2008) [13]. these systems are playing a clear role in the processes related to the objectives, business, marketing, and productivity of the organizations [14]. this study is characterized by the fact that it verified the regulatory overload in electronic information within the jordanian industrial companies. this was done through a survey questionnaire to study the reality of these companies within this new environment. this study is the first of its kind in this field and within the jordanian industrial sector. 2. statement of the problem the organization and its staff face with the phenomenon of information overload that requires attention, study, and treatment. since the companies in general and the jordanian industrial companies in particular deal with the large amount of information that is available as electronic data. this weakens their position in making various decisions and makes mistakes due to the excessive overload in the aspects of knowledge information. this requires companies to find a new mechanism to enable them to meet these overloads with the importance of finding a form of control over the application of this mechanism to determine the prospects for dealing with their data. therefore, the research seeks to measure the “knowledge management functions applied in jordanian industrial companies: study the impact of organizational overload.” 3. research questions the following questions are achieved to perform the research: • what is the level of managers’ perceptions of the regulatory overload in two dimensions (channels of communication and regulatory environment) in jordanian industrial companies? • what is the level of managers’ perceptions of the regulatory overload in two dimensions (communication channels and regulatory environment) in the jordanian industrial companies? • what is the level of perceptions of managers in the impact of the regulatory overload on the functions of knowledge management dimensions (acquisition, generation, transport, sharing, and application of knowledge) in the jordanian industrial companies? 4. research objectives the research hopes to achieve the following objectives: • measuring the impact of variables related to the electronic infor mation (regulator y overload) on knowledge management functions in the researched companies • identify the positive aspects that help to improve knowledge management functions and the negative aspects that limit the effectiveness of these functions • measuring the level of application of knowledge management functions by industrial companies in jordan; to reach appropriate recommendations that can be made to deal with electronic information (regulatory overload). al-ani, et al.: the impact of regulatory overload in jordanian companies uhd journal of science and technology | jul 2021 | vol 5 | issue 2 49 5. research hypothesis there is no statistically significant effect at the level (α ≤ 0.05) of the regulatory overload (channels of communication and regulatory environment) in the knowledge management functions (acquisition, generation, transmission, sharing, and application of knowledge) in jordanian industrial companies. 6. research model the study model is designed with its variables regarding to the problem of the study and its hypothesis and to achieve its purpose and reach its specific objective (fig. 1). 7. electronic information overload the overload of electronic information in businesses companies is very important and requires a series of creative measures to be achieved. the use of electronic data warehouse and the application of electronic knowledge in the organization are constantly re-using knowledge within the organization [16]. noted that technology caused the explosion of information due to the lower costs of multimedia technology, which simplified access to information and helped in their publication. as whelan and teigland, 2010, [16] explained, the information overload is a problem facing contemporary organizations. according to himma, 2007, [17], there is a difference in meaning between these two terms, contrary to what some think that the overload means an increase. this was made clear when the excess quantity was seen as a precautionary measure. therefore, this quantity had no negative effect and could be dealt without incurring high costs. if this increase becomes negative for individuals and becomes problematic, in which there is an overload. eppler and mengis, 2003, [15] indicated that the overload of information appears on the receipt of a large amount of information beyond the capacity of individuals to deal with a process, which reflects negatively on the quality of the decision. grise and gallupe [18] defined the overload of information as having a lot of information with the inability of the individuals concerned with that vast amount of information. mulder et al., 2006, [19] added that the information overload increases the sense of tension when the volume of information exceeds the capacity to be processed. kim et al., 2007, [20] considered that the information overload meant confusion in the information received that impeded learning and impaired individuals’ decision-making capacity. farhoomand and drury, 2002, [21] have warned that the information overload may be created by situations such as the web and the internet, which are the main reasons for the overloads because of the abundant information you provide from external resources. the complexity of organizational tasks and the lack of sufficient number of people to complete the tasks, which all lead to an information burden, as well as the huge amount of information coming to the offices of managers daily. not to mention the availability of a lot of information that is not understood by individuals or not knowing whether this information serves their orientation or not (dubosson and fragniere, 2009) [22]. lesa, 2009, [23] reported that the regulatory overload is endemic in today’s fast-track environment. inadequate regulatory environment for organizational learning impedes the flow of information, ideas, and knowledge into the organization, resulting in an information overload. lesa, 2009, [23] mentioned that there are a number of strategies for dealing with the information overload at the organizational level, including: establishing task-specific task forces, building informal relationships across the organizational structure and applying modern information management systems. the organizational overload (wilson, 2001) [24] is also a situation in which the information overload flowing widely from individuals across the organizational structure is reflected in a reduction in the overall effectiveness of the organization’s operations management. filippov and iastrebova, 2010, [25] explained that the regulatory overload implies an imbalance between the requirements for processing organizational information and the ability to process that information the burden of electronic information (independent variable) regulatory overload (communication channels, regulatory environment) knowledge management functions (dependent variable) (acquisition, generation, transmission, sharing and application) fig. 1. research model. al-ani, et al.: the impact of regulatory overload in jordanian companies 50 uhd journal of science and technology | jul 2021 | vol 5 | issue 2 within the organization. the organizational structure helps facilitate the collection, processing, and dissemination of information and the protection of individuals from the burden of information. below some concepts that clarify knowledge will be addressed: 7.1. knowledge information combined with experience and intuition (yan, 2009) [26]. that is, knowledge is information that has been processed, organized, and structured to be applicable (hester, 2009) [27]. knowledge consists of a combination of values, contextual information, and expertise, as well as new information and expertise that exists in knowledgeable minds, organizational routines, documents, rules, processes, and practices in organizations (haytham, 2005) [28]. knowledge is also seen as an intellectual capital and a critical component of today’s organizations and is growing with increasing practice and learning (najm, 2005) [29]. as knowledge is a combination of experience, practice, judgments, and values of both the individual and the organization that are reflected in the work of employing knowledge for the desired goals (salwa, 2008) [30]. there are two types of knowledge: tacit knowledge and explicit knowledge. tacit knowledge is the knowledge stored in human minds and behavior, and what is generated by learning from past experiences that are difficult to document and transfer to others. explicit knowledge is knowledge that can be shared among individuals, groups, and organizations and can be documented, stored electronically, transmitted and used through various means (jazar and talaat, 2005) [31]. 7.2. knowledge management fernandez et al., 2004, [32] defined knowledge management as doing what is needed to maximize the benefit of knowledge resources. since knowledge management is the gateway to adding and generating value by mixing knowledge elements to create better knowledge combinations, this will change the role of data, information and knowledge to flow individually (najm, 2005) [29]. mohammed and ziad, 2010, [33] described knowledge management as the organization’s knowledge resources and assets, adaptability and learning, increasing the creative process, sharing and optimal use of these assets. ashoc, 2004, [34] also noted that knowledge management is effective learning processes associated with the exploration, exploitation, and sharing of human knowledge (explicit and tacit), which applies appropriate civilization, technology, and culture to extract performance and intellectual capital. regarding to the organizational implications of knowledge management, it was pointed out that knowledge management contributes to the generation of knowledge that seeks to improve the performance of organizations through four dimensions (fernandez et al., 2004) [32]. these dimensions are influencing individuals, influencing processes, influencing product, and influencing organizational performance. there are five knowledge management functions as follows: • knowledge acquisition: knowledge acquisition is a function that seeks to gain knowledge and obtain it from a variety of documented sources, as well as the acquisition of undocumented sources that stored in the minds of individuals and issued through their behavior. knowledge can be gained from experts and stakeholders and information technology plays an important role in supporting knowledge acquisition through its role in data capture, classification, processing, and harnessing to build the competitive advantage of an organization (kamel, 1999) [35]. the function of knowledge acquisition goes through four stages (kamel, 1999) [35]: collection, interpretation, analysis, and design • knowledge generation: knowledge generation comes from a variety of sources and channels to expand the repositories of organizational memory and enable the organization to creatively solve solutions to its problems leading to innovation. it is individuals who generate knowledge within the organization. this will be done through four processes of knowledge transfer: social participation, embodied external knowledge, integrated internal knowledge, and synthetic knowledge • knowledge transfer: knowledge transfer depends on several factors that need to be considered: leadership, support for organizational structures, absorptive capacity, degree of privacy, degree of complexity, and dependability of knowledge vocabulary. knowledge is transferred through the use of management information systems, training and e-learning systems using the internet (ashoc, 2004; alavi and leidner, 2001) [10], [34] • knowledge sharing: knowledge sharing is an important element in production, responding to environmental changes, promoting opportunities, outperforming competitors, and maintaining the effectiveness of modern organizations. the knowledge base of the organization is increasingly shared by individuals with their knowledge and experience formally through regular formal meetings. the sharing of knowledge with individuals prevents the loss, fading and erosion of that knowledge over time. in addition, sharing knowledge between organization and other organizations al-ani, et al.: the impact of regulatory overload in jordanian companies uhd journal of science and technology | jul 2021 | vol 5 | issue 2 51 enhances the knowledge storage that will be available in organization repositories from those of other organizations [36] • knowledge application: it is a knowledge management purpose where modern organizations apply knowledge using web-based technology systems and knowledge retrieval techniques. furthermore, these systems provide the ability to access, transfer and use information in a timely and appropriate manner and to communicate with the right person [36]. zainab, 2009, [37] presented a case study of king abdul aziz university, which measured the readiness of organizations to apply knowledge management through the four dimensions of knowledge management (human dimension, technological dimension, strategic dimension, and operations dimension). he concluded that kau has a readiness to manage knowledge with a medium degree. carlevale, 2010, [38] concluded that technology causing the information overload caused by the huge amount of e-mail that managers are exposed to every day in which generating more pressure. that e-mail is the biggest cause of the burden of information on these managers on a daily basis, hinders the decision-making process and that incoming emails need to be filtered and managed. hodge, 2010, [39] noted that there is a positive correlation between knowledge management processes (capturing, storing, classifying, and applying) and knowledg e management capabilities (lessons learned, experiences, and knowledge documents). salwa, 2008, [30] emphasized the role of knowledge management and information technolog y in achieving competitive advantages that concluded the banks in question apply the knowledge management technology system in all units and departments within banks, although there is no organizational unit or special department for knowledge management and information technology. inside any bank (ismail and yusof, 2010) [40] showed that there is a positive relationship between individual factors (awareness, confidence, and personality), and the quality of knowledge sharing. personal style (extrovert and introvert) is the most important for the quality of knowledge sharing followed by trust and awareness. 8. methodology 8.1. research: questionnaire design the overload of electronic information in businesses companies and the relevant factors and tasks that have been explained in the previous sections. this section explains how the questionnaire is designed and what are their paragraphs. the questionnaire is designed based on the likert scale of five fields: strongly agree, agree, neutral, disagree, and strongly disagree. regarding to the research model the designed questionnaire is divided into: independent variable (regulatory overload) and dependent variable (knowledge management functions). • independent variable (regulatory overload) (fig. 2): this field is divided into two parts: communication channels and regulatory environment. communication channels related to loss of boundaries between roles, tasks and work duties, which affects the movement and exchange of information within the departments of a single organization due to the inadequacy or lack of clarity of the organizational structure. regulatory environment related to the availability of work requirements in the work environment through which the organization can control the variables of its environment of humans, devices and the administrative decisions • dependent variable (knowledge management functions) (fig. 3): this field is divided into five parts: knowledge acquisition, knowledge generation, knowledge transfer, knowledge sharing, and knowledge application. knowledge acquisition according to obtain knowledge from various internal sources (such as collaboration, learning, feedback from staff, workshops, training programs, and databases within which knowledge is stored) and external (such as competitors, customers, consultants, attract experienced and competent personnel, and establish relationships with partners and allies). knowledge generation regarding to derive and create new creative knowledge from existing knowledge within the organization to secure various types of knowledge for the benefit of future decisions which are concerned with equipping workers in the knowledge field with graphics and analysis, and this is done through teaching, learning, research, and development. knowledge transfer related to communicate the right knowledge to the right person in the appropriate manner (communications, bulletins, reports, staff movements, and use of technological means to facilitate knowledge transfer), and unintentional (informal meetings of individuals) at the right cost and at the right time. knowledge sharing regarding to circulate and exchange of various types of knowledge among individuals. interacting with others’ dialogues inside and outside the organization, securing collective cooperation among them, reaching out and working simultaneously on the same document and from different locations to form new creative mental ideas. knowledge application al-ani, et al.: the impact of regulatory overload in jordanian companies 52 uhd journal of science and technology | jul 2021 | vol 5 | issue 2 dealing with the utilizing of knowledge to support innovation, development of people and resource, business improvement, using specific technolog y systems and knowledge dissemination channels at all organizational levels. 8.2. research: population and sample the study population consisted of the industrial companies listed in the amman stock exchange, and the inspection and analysis unit included all directors working in the higher managements (general managers, their assistants, or their representatives), as well as managers working in the middle managements (managers of the main departments and heads of departments). a relatively random sample of (1242) members of the sampling and analysis unit, (30%) was selected to become the sample (373). the number of questionnaires valid for statistical analysis (206) questionnaires (55%) of the total questionnaires distributed. regarding to data sources to achieve the objective of the study, secondary sources were adopted, which include those data and information published in various library sources for review of previous literatures. the primary sources of data were the questionnaire built for this purpose which was aimed at obtaining the raw data to complete the applied aspect of the study in terms of handling the study questions and testing the hypotheses. fig. 2. regulatory overload questioner. fig. 3. knowledge management functions questioner. al-ani, et al.: the impact of regulatory overload in jordanian companies uhd journal of science and technology | jul 2021 | vol 5 | issue 2 53 9. results and discussion descriptive statistical analysis of study variables is applied on the obtained data. a low (less than 2.33), middle (2.33–3.66), and high (3.77 and above) was used to determine the relative importance of the respondents’ perceptions of the study questions based on the likert-5 scale, and the arithmetic and standard deviations are shown in fig. 4. question 1: what is the level of perceptions of managers working for the regulator y overload (channels of communication and regulatory environment) in jordanian industrial companies? the results of fig. 1 show that the level of perceptions of the respondents regarding the regulatory overload was high and the researchers attribute the result to the fact that the administrative organization in the industrial companies serves the nature of the work adopted in the presence of electronic systems. regarding the regulatory environment, the result indicates that there is an average level of influence of the regulatory environment as one dimension of the regulatory overload on knowledge management where the general arithmetic mean of the type of technology was (3.35) and the standard deviation (0.67). question 2: what is the level of perceptions of managers working for knowledge management functions (acquisition, generation, transmission, sharing, and application of knowledge) in jordanian industrial companies? the results of fig. 1 indicate that the respondents’ level of response was high where arithmetic mean is 3.71 and the standard deviation is 0.57. the roles of the work they do and this knowledge are done in accordance with the surrounding environment, as they derive from the external environment that reflects the relations of companies with customers, as well as the relationships between companies at the level of industry. multiple regression and the accompanying tests are used to verify the validity of the hypotheses. in addition, f-test for the regression model significance, t-test for the significance of the effect, and the value of the coefficient of determination r2 are used to determine the interpreted percentage of independent variables in the dependent variable, depending on the statistical significance values extracted under the statistical software. in this study, there is no statistically significant effect at α ≤ 0.05 level of the regulatory overload (communication channels and regulatory environment) in the knowledge management functions (acquisition, generation, transmission, sharing, and application of knowledge) in jordanian industrial companies. 0 0.5 1 1.5 2 2.5 3 3.5 4 regulatory overload knowledge management standard deviation arithmetic mean fig. 4. arithmetic mean and standard deviation (regulatory overload). table 1: the impact of regulatory overload on knowledge management functions dependent variable coefficient of determination standard deviation f-value degree of freedom regression coefficients independent variable β standard error t-test significance level acquisition of knowledge 0.318 0.101 22.966 (1,204) regulatory overload 0.373 0.078 4.792 0.000 generation of knowledge 0.339 0.115 26.416 (1,204) regulatory overload 0.419 0.081 5.140 0.000 transmission of knowledge 0.422 0.178 44.278 (1,204) regulatory overload 0.499 0.075 6.654 0.000 sharing of knowledge 0.337 0.113 26.091 (1,204) regulatory overload 0.422 0.083 5.108 0.000 application of knowledge 0.311 0.097 21.813 (1,204) regulatory overload 0.337 0.072 4.670 0.000 functions of knowledge management 0.398 0.158 38.396 (1,204) regulatory overload 0.410 0.066 6.196 0.000 al-ani, et al.: the impact of regulatory overload in jordanian companies 54 uhd journal of science and technology | jul 2021 | vol 5 | issue 2 table 1 shows that the simple regression model is applied to measure the impact of the organizational burden on the dimensions of knowledge management functions, in which have significant impact. this effect is significant based on the test value (t = 6.196) when compared with the value of the significance level (sig = 0.000 ≤ 0.05). table 2 indicates that multiple regression model to measure the effect of both dimensions (communication channels and organizational environment) in knowledge management functions is significant, where the value of (f = 46.863) at the level of significance (sig = 0.000). together, the two variables explain that r2 = 31.6% of the differences in the values of knowledge management functions are reinforced by this result (t = 9.170). 10. conclusions focusing of the previous discussions, the study reached a number of conclusions as below: • the results of the study pointed to the relative importance of the regulatory overload in relation to the communication channels. this reflects that the roles of individuals are not clearly defined. the results of the study concur with (lesa, 2009) [23] in the regulatory section, considering that regulatory factors are the primary cause of the information burden phenomenon according to (lesa, 2009) [23]. the study also agreed with (raoufi, 2003) [41] in terms of organizational factors, especially on the leadership side and their impact on the information overload created especially with those working in the field of knowledge • the results in the level of importance of the regulatory overload in relation to the channels of communication in the jordanian industrial companies coincided with the results of the analysis of the regulatory environment. (manovas, 2004) [42], particularly in the field of knowledge transfer, learning culture, sharing, and incentive systems as elements of infrastructure in the regulatory environment • the results of the study show that there is a high level of interest in knowledge generation due to the ability of managers to diversity knowledge sources and their focus on research and development and bridging knowledge gaps as a result of developments in the work environment and attention to the internal organizational dimension in the generation of knowledge through table 2: multiple regression analysis to test the effect of organizational burden dimensions on the dimensions of knowledge management functions dependent variable coefficient of determination standard deviation f-value degree of freedom significance level regression coefficients independent variable β standard error t-test significance level acquisition of knowledge 0.493 0.243 32.556 (203,2) 0.000 communication channels 0.631 0.081 7.835 0.000 regulatory environment 0.090 0.057 1.568 0.118 generation of knowledge 0.492 0.242 32.456 (203,2) 0.000 communication channels 0.655 0.085 7.701 0.000 regulatory environment 0.067 0.061 1.114 0.267 transmission of knowledge 0.521 0.271 37.776 (203,2) 0.000 communication channels 0.612 0.080 7.689 0.000 regulatory environment 0.024 0.057 0.421 0.674 sharing of knowledge 0.485 0.235 31.232 (203,2) 0.000 communication channels 0.652 0.087 7.534 0.000 regulatory environment 0.063 0.062 1.024 0.307 application of knowledge 0.457 0.209 26.741 (203,2) 0.000 communication channels 0.534 0.076 7.011 0.000 regulatory environment 0.059 0.054 1.083 0.280 functions of knowledge management 0.562 0.316 46.863 (203,2) 0.000 communication channels 0.617 0.067 9.170 0.000 regulatory environment 0.051 0.048 7.835 0.288 al-ani, et al.: the impact of regulatory overload in jordanian companies uhd journal of science and technology | jul 2021 | vol 5 | issue 2 55 brainstorming processes. this result is consistent with the findings of the zakia, 2009, [43] that studied on knowledge sources, acquisition and transmission. the results were also consistent with the zainab, 2009, [37] which examined the dimensions and processes of knowledge management (acquisition, generation, transmission, distribution, and application) at king abdulaziz university • the results of the study demonstrated the impact of the regulatory overload (channels of communication and regulatory environment) on knowledge management functions (acquisition, generation, transmission, sharing, and application of knowledge) in jordanian industrial companies from the point of view of managers working in jordanian industrial companies. the results of the current study are consistent with the (carlevale, 2010) [25] study, as reliance on communications technology creates a burden, especially e-mail, which creates a burden for managers and that e-mail needs to be filtered. this was also agreed with (dubosson and fragniere, 2009) [27] and (lesa, 2009) [23]. 11. recommendations regarding to the results reached through the research, the researchers provide a number of recommendations to adopt them by the jordanian industrial companies in the course of the research, so as to adapt them in reducing the burden of electronic information in term of regulatory overload and these recommendations as follows: • adopting the type of technology appropriate to the environment in which jordanian industrial companies operate in such a way as to reduce the burden of information in term of regulatory overload that may be exposed in carrying out their decision-making tasks • the jordanian industrial companies should conduct a strategic analysis of the strengths and weaknesses that are reflected in the performance of the company’s departments and departments and determine their impact on increasing or decreasing the regulatory overload in the departments • developing companies in their regulatory environment to achieve effective communication systems based on the concept of reducing the regulatory overload in an attempt to restructure their systems to achieve their effectiveness • companies continue to filter and exclude unnecessary information in a way that reduces the large amount of information that restricts the capabilities of the employees of the initiative and this does not negatively affect the capabilities of the public in providing initiatives in the field of electronic work, and not affected by the capabilities of employees in solving electronic problems. references [1] s. n. jawad, a. a. m. shaban, h. h. ali and i. husen. “small business management, a technology entrepreneurial perspective”. safa publishing house, amman, jordan, 2010. [2] national information technology council (nitc). “malaysia, (k-economy-introduction and background)”. 2004. available from: http://www.nitc.org. [last accessed on 2017 dec 15]. [3] m. song, h. bij and m. weggeman. “factors for improving the level of knowledge generation in new product”. r & d management, vol. 36, no. 2, pp. 173-187, 2006. available from: http://www.ssrn. com. [last accessed on 2017 dec 15]. [4] t. asmahan and m. ibrahim. “requirements for sharing knowledge and obstacles facing its application in jordanian telecommunication companies, presented to the scientific conference”. applied science university, amman, jordan, 2007. [5] a. a. m. shaban and j. s. naji. “management process and information technology”. al-ethaa publishing house, amman, jordan, 2008. [6] a. a. m. shaban and j. s. naji. “business intelligence and information technology”. amman, jordan, safa publishing house, 2012. [7]. a. m. sami. “measuring the impact of organizational culture factors on the implementation of knowledge management in the jordan telecom group (orange): case study, unpublished master thesis, graduate school of administrative and financial studies”. amman, jordan, amman arab university for graduate studies, 2008. [8] n. abboud. “knowledge management concepts, strategies and operations”. dar al warraq, amman, jordan, 2005. [9] u. sekaran. “research methods for business”. 4th ed. john wiley & sons, ltd., new york, united states, 2003. [10] m. alavi and d. leidner. “review: knowledge management and knowledge management systems: conceptual foundation and research issues”. mis quarterly, vol. 25, no. 1, p. 107-136, 2001. available from: http://www.ebsco.host.com. [last accessed on 2017 dec 15]. [11] z. mohammed. “contemporary trends in knowledge management”. safaa publishing and distribution house, amman, jordan, 2008. [12] m. abbas. “knowledge management and its effect on organizational innovation”. journal of arts kufa, vol. 1, p. 257, 2008. [13] d. bawden and l. robinson. “the dark side of information: overload, anxiety, and other paradoxes and pathologies”. journal of information science, vol. 35, no. 2, pp. 180-191, 2008. [14] l, ruff. “information overload: causes, symptoms and solution, harvard graduate school of education’s learning innovations laboratory, (lila)”. 2002. available from: http://www.lila. pz.harvard.edu/_upload/lib/infooverloadbrief.pdf. [last accessed on 2017 dec 20]. [15] m. eppler and j. mengis. “a framework for information overload research in organizations: insights from organization science, accounting, marketing, mis, and related disciplines, ica working paper”. university of lugano, lugano, 2003. available from: http:// www.bul.unisi.ch/cerca/bul/pubblicazioni/com/pdf/wpca0301.pdf. al-ani, et al.: the impact of regulatory overload in jordanian companies 56 uhd journal of science and technology | jul 2021 | vol 5 | issue 2 [last accessed on 2018 jan 10]. [16] e. whelan and r. teigland. “managing information overload: examining the role of the human filter”. 2010. available from: http://www.ssrn.com. [last accessed on 2018 jan 10]. [17] k. e. himma. “a preliminary step in understanding the nature of a harmful information-related condition: an analysis of the concept of information overload”. ethics and information technology, vol. 9, no. 4, pp. 259-272, 2007. [18] m. l. grise and b. gallupe. “information overload: addressing the productivity paradox in face-to-face electronic meetings”. journal of management information systems, vol. 16, no. 3, pp. 157-186, 2000. [19] i. mulder, h. de poot, c. verwij, r. janssen and m. bijlsma. “an information overload study: using design methods for understanding, conference on computer-human interaction: design: activities, artefacts and environments, sydney, australia”. pp. 245-252, 2006. [20] k. kim, m. lustria and d. burke. “predictors of cancer information overload: findings from a national survey”. 2007. available from: http://www.informationr.net/ir/12-4/paper326.html#mil56. [last accessed on 2018 jan 10]. [21] a. farhoomand and d. drury. “managerial information overload”. communications of the acm, vol. 45, no. 10, pp. 127-131, 2002. [22] m. dubosson and e. fragniere. “the consequences of information overload in knowledge based service economies: an empirical research conducted in geneva”. service science, vol. 1, no. 1, pp. 56-62, 2009. [23] b. lesa. “the impact of organizational information overload on leaders: making knowledge work productive in the 21st century, doctoral dissertation”. university of idaho, united states, 2009. [24] t. wilson. “information overload: implications for healthcare services”. health informatics journal, vol. 7, no. 2, pp. 112117, 2001. [25] s. filippov and k. iastrebova. “managing information overload: organizational perspective”. journal on innovation and sustainability, vol. 1, no. 1, pp. 1-17, 2010. available from: http:// www.revistas.pucsp.br/index.php/risus/article/view/4260. [last accessed on 2018 jan 10]. [26] x. yan. “an empirical analysis of the antecedents of knowledge management strategies, doctoral dissertation”. nova southeastern university, united states, 2009. [27] a. hester. “analysis of factors influencing adoption and usage of knowledge management systems and investigation of wiki technology as an innovative alternative to traditional systems, doctoral dissertation”. university of colorado denver, united states, 2009. [28] h. haytham. “measuring the impact of knowledge management perception on employment in jordanian organizations: a comparative analytical study between the public and private sectors towards building a model for knowledge management employment, unpublished doctoral thesis, faculty of administrative and financial studies”. amman arab university for graduate studies, amman, jordan, 2005. [29] h. najm. “management information systems: contemporary entrance”. wael publishing house, amman, jordan, 2005. [30] a. s. salwa. “the role of knowledge management and information technology in achieving competitive advantages in banks operating in gaza strip”. master of business administration, islamic university, gaza, 2008. [31] a. jazar and a. talaat. “proposed project for knowledge management in jordanian public universities, unpublished doctoral thesis, faculty of higher education studies”. amman arab university for graduate studies, amman, jordan, 2005. [32] i. fernandez, a. gonzalez and r. sabherwal. “knowledge management, challenges, solution, and technologies”. 1st ed. pearson prentice hall, london, united kingdom, 2004. [33] b. mohammed and m. ziad. “knowledge management between theory and practice”. jalis al-zaman publishing house, amman, 2010. [34] j. ashoc. “knowledge management an integrated approach”. pearson education, prentice-hall, london, united kingdom, 2004. [35] m. kamel. “knowledge acquisition, wiley encyclopedia of electrical and electronics engineering”. john wiley and sons, inc., new york, united states, 1999. [36] x. zhang. “understanding conceptual framework of knowledge management in government (condensed version), presentation on un capacity-building workshop on back office management for e/m-government in asia and the pacific region, shanghai, china”, 2008. [37] s. zainab. “the readiness of public organizations for knowledge management: a case study of king abdul aziz university in jeddah, an introduction to the international conference on administrative development: towards distinguished performance of the government sector, riyadh”, 2009. [38] e. carlevale. “exploring the influence of information overload on middle management decision making in organizations, doctoral dissertation”. university of phoenix, united states, 2010. [39] j. hodge. “examining knowledge management capability: verifying knowledge process factors and areas in an educational organization, doctoral dissertation”. northcentral university, united states, 2010. [40] m. ismail and z. yusof. “the impact of individual factors on knowledge sharing quality”. journal of organizational knowledge management, vol. 2010, p. 327569, 2010. [41] m. raoufi. “avoiding information overload-a study on individual’s use of communication tools, proceeding of the 36th hawaii international conference on system sciences”. 2003. [42] m. manovas. “investigating the relationship between knowledge management capability and knowledge transfer success, mastery degree”. concordia university, canada, 2004. [43] t. zakia. “knowledge management: the importance and extent of application of its operations from the point of view of the supervisors and administrator’s departments of the department of education in makkah and jeddah, master thesis, umm al-qura university”. 2009. tx_1~abs:at/tx_2:abs~at uhd journal of science and technology | july 2022 | vol 6 | issue 2 49 1. introduction bacterium coli commune was initially reported as a commensal gram-negative rod from the healthy individual’s intestinal flora by theodor escherich, a german pediatrician, in 1885, and in his honor, these rods were named escherichia coli [1]. the genus escherichia coli is distributed widely and is the most common facultative anaerobe found among humans and warm-blooded animals’ – large intestine [2]. depending on the number of virulence determinants found, specific combinations were created, determining the currently known e. coli pathotypes, which are generally recognized as diarrheagenic e. coli (dec) [3]. dec pathotypes are classified into enteropathogenic e. coli, enterotoxigenic e. coli (etec), eiec is for enteroinvasive e. coli, and shiga stands for enterohemorrhagic e. coli, enteroaggregative e. coli (toxin-producing e. coli) is another type of e. coli. they pathotypes of e. coli vary widely in terms of preferred host colonization locations, virulence mechanisms, and clinical symptoms and out [4]-[6]. in the developing countries, etec is still one of the most common causes of infectious diarrhea in travelers and children [7]. watery diarrhea, vomiting, stomach cramps, and, molecular detection of enterotoxigenic escherichia coli toxins and colonization factors from diarrheic children in pediatric teaching hospital, sulaymaniyah, iraq hezhan faeq rasul, sirwan muhsin muhammed, huner hiwa arif, paywast jamal jalal department of biology, college of science, university of sulaymaniyah, sulaymaniyah, iraq a b s t r a c t enterotoxigenic escherichia coli (etec) is one well-established causative agent of diarrhea in the developing countries among young children. this prospective study was performed at laboratories of university of sulaimani (in sulaymaniyah city/iraq) from september to october 2021which aimed to determine the prevalence of etec among children and the most prevalence colonization factor (cfa/i) among etec. one hundred and twenty-five fresh stool samples were collected from hospitalized – children with diarrhea at dr. jamal ahmed rashid’s pediatric teaching hospital. the collected samples were cultured on macconkey and eosin methylene blue agar as selective and differential media for gramnegative bacteria. colonies were identified through gram staining and biochemical tests including: indole, methyl red, and catalase reaction test. vitek-2 machine was depended to test some obtained isolates. most of isolates (60%) showed positive results for e. coli – out of this percentage, 14 (18.66%) were positive for etec using polymerase chain reaction assay identifying stable and labile toxins (lts). it was noticed that all of the etec isolates were stable toxin producer isolates whereas lt producer isolates were not identified. colonization factor 5 (cs5) has been detected among three etec isolates (21.42%), meanwhile, 11 isolates (78.57%) have not expressed colonization factors at all. index terms: escherichia coli, enterotoxigenic e. coli, stable toxin, labile toxin corresponding author’s e-mail: hezhan faeq rasul, department of biology, college of science, university of sulaymaniyah, sulaymaniyah, iraq. e-mail: hezhan.rasul@univsul.edu.iq received: 16-06-2022 accepted: 31-08-2022 published: 22-09-2022 access this article online doi: 10.21928/uhdjst.v6n2y2022.pp49-57 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2022 rasul, et al. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l r e s e a r c h a r t i c l e uhd journal of science and technology rasul, et al.: molecular detection of enterotoxigenic escherichia coli toxins from diarrheic children in, sulaymaniyah 50 uhd journal of science and technology | july 2022 | vol 6 | issue 2 in some circumstances, declining in body temperature are the common symptoms of etec infections [8]. infections can be self-limiting in normally healthy people, but they may be fatal among children and young adults as well as among immune compromised patient [9]. etec causes over 200 million cases of diarrhea and 380,000 deaths per year, mostly below the age 5 among children [10]. etecs ability to stick to and colonize intestinal epithelium, it is critical for pathogenicity. in addition, its ability to produce heat-labile labile toxin (lt) enterotoxin and/or heat-stable stable toxin (st) enterotoxin, both of which can produce diarrhea. st is a limited peptide made up of 18–19 amino acid residues, whereas lt is a high-molecular-weight (84 kda) enterotoxin with an active alpha subunit surrounded by five identical binding b subunits [11]. the two main genotypes of st are sta and stb; typically, etec strains isolated from people produce sta (sti or sth), which is encoded by the esta gene, whereas stb (stii orstp) is primarily produced by animal etec strains which is encoded by the estb [12]. the lts that etec strains produce are likewise a diverse category of toxins. there are two main lt families known as lt-i and lt-ii [13]. lt genes elta and eltb produce lt-i and lt-ii, respectively. the st genes are possible to express independently or in tandem with the lt genes elta and eltb [13]. etec strains can express seven different toxin combinations: sth, stp, sth/lt, stp/lt, lt, and less typically, sth/stp and sth/stp [14]. the existence of colonization factors (cfs) on membrane of a bacterial cell, which normally form pili, also known as fimbriae, is necessary for colonization [15]. depends on antigenic specificity and/ or the n-terminal amino-acid sequence of the main subunit, different forms of colonization factor antigens (cfa) and putative colonization factors have been identified(pilin) such as cfa/i, cs1, cs2, cs3, cs4, cs5, cs6, cs7, cs14, cs17, cf19, cf21, cf22 [16]. there is a limited research available on etec colonization and prevalence of diarrheagenic etec among human, particularly children below the age of six in this region. to fill this gap, this study have conducted to determine e. coli, etec toxins (st and lt) producers, and colonization factors from children under 6 years suffering from watery diarrhea in the pediatric teaching hospital, in sulaimani city. 2. materials and methods 2.1. sample collection during the period from september to october, 125 sample stools were collected from children <6 years at dr. jamal ahmed rashid’s pediatric teaching hospital and smart private hospital in sulaymaniyah city. both sexed were included (63 females and 62 males). the necessar y information about the patients were taken from the hospitals, and the collected samples were transferred from the hospitals to the advanced bacteriology laboratory from biology department of university sulaimani within less than 3 hours in an ice box to culture them. 2.2. bacterial cultivation and characterization all samples were preliminary cultured on differential and selective media for presumptive isolation of gramnegative enteric rods. these included macconkey agar for first isolation and eosin methylene blue for confirmation as described by [17], [18]. all lactose fermenting, deeply pink, circular, medium in size, colonies were subcultured on the medium (eosin methylene blue) agar (neogene, uk). all plates were incubated at 37°c for 18–20 h, colonies showing green metallic sheen on emb agar were considered as e. coli strains. e. coli samples utilized in the present study were identified by gram staining [19] and initial biochemical tests including indole [20], methyl red [21], catalase test [22], and other bacteriologic characterization using vitek-2 system (vitek®2 gn id card) by vitek machine (biomerieux, france) were performed for some of them [23]. 2.3. dna extraction and purification the dna of isolates under test was isolated and purified using and following the directions. 2.3.1. colony extraction it was performed by transferring two colonies from fresh bacterial culture then mixed with 40 μl of ddh 2 o and preheated at 95°c for 10 min using the thermo cycler and purified dna obtained by centrifugation at 12,000 rpm for 1 min. the supernatant was used as a polymerase chain reaction (pcr) template [24]. 2.3.2. dna extraction with kit overnight fresh colonies from nutrient broth (neogene, uk) utilized. genomic dna from e. coli isolates was extracted and purified using a dneasy kit (addprep genomic, korea) according to manufacturer protocol. 2.4. pcr method pcr mixture contained the dna template, forward/reverse primer (macrogen, korea), and master mix (taq master (2 × conc.)/addbio. korea) deionizing water (accumax, korea). rasul, et al.: molecular detection of enterotoxigenic escherichia coli toxins from diarrheic children in, sulaymaniyah uhd journal of science and technology | july 2022 | vol 6 | issue 2 51 2.4.1. 16s rrna pcr was performed for 75 samples e. coli to identify 16s rrna using this reaction included: initial denaturation for 5 min at 94°c, followed by 35 cycles of amplification (1 min at 94°c, 1 min at 56–58°c and 1 min at 72°c), and finally finished with 7 min at 72°c [25]. 2.4.2. st and lt the 96-well plates were used to amplify stable and lt genes. the pcr procedure included pre-incubation at 95°c for 1 min, followed by 35 cycles of (1 min at 95°c, 1.10 min at 45°c, 1.30 min at 72°c), final incubation at 72°c for 5 min [26]. the products have run on 2% agarose gel (transgen, china). 2.4.3. colonization genes to identify genes of colonization factors of cfa/i, cs1, cs2, cs3, cs4v, cs5, cs6, cs14, cs17. the same prementioned procedure was depended with different primer for each gene. the genes were amplified by an initial denaturation at 94°c (1 min), followed by 35 cycles of amplification (94°c for 30 s, 52°c for 30 s, and 72°c for 1 min), finally, 5 min at 72° [27]. the amplicon was separated with 3% agarose gels by gel electrophoresis (cleaver-cs-300v, uk) and then visualized by ethidium bromide (transgen, china). the specificity of the primers was tested by both blast search and is illustrated in table 1. 2.5. dna sequencing the sequencing was performed for 10 samples with amplified 16s rrna -f and 16s rrna-r (forward and reversed primers (10 pmol). dna sequencing was achieved by sanger sequencing/abi 3500, macrogen genome center, korea using bigdye kit. 2.6. phylogenetic tree evolutionar y analysis was conducted by mega7 program. the evolutionary history was deduced using the kimura 2-parameter model and the maximum likelihood technique [28]. it is shown the tree with the greatest log likelihood (−505.1852). next to the branch is the proportion of trees where the related taxa clustered together. the starting tree(s) for the heuristic search were automatically generated by applying the neighbor-join and bionj algorithms to a matrix of pairwise distances calculated using the maximum composite likelihood technique and thereafter picking the topology with the best log likelihood value. to represent evolutionary rate differences across sites (5 categories [+g, parameter = 0.0500]), a distinct gamma distribution was utilized. the branch distances are calculated by the number of replacement per location, as well as the tree is depicted to scale. a total of 18 nucleotide sequences were examined. the codon locations were included 1st + 2nd + 3rd + noncoding. all positions containing gaps and missing data were eliminated. there were a total of 320 positions in the final dataset [29]. 3. results and discussion 3.1. detection of dec it was appeared that out of 125 tested samples, 83 (66.4%) were gram-negative bacteria after they were cultured on table 1: reference strain, primer sequence, number base pair of 16s rrna, st, lt toxin and colonization factors primer name bp primer sequence reference 16srrna 16s-f 16s-r 426 5′gacgtactcgcagaataagc-3′ 5′-ttagtcttgcgaccgtactc-3′ [25] st toxin stf str 186 5-tct gta ttg tct ttt tca cc-3, 5-ttaata gca ccc ggt aca agc-3, [26] lt toxin ltf ltr 273 5-acggcgttactatcctctc -3 5-tggtctcggtcagatatgtg -3 [27] cfa/i 170 5-gcttattctcccgcatcaaa-3 5-acttgtcctccccatgacac-3 [27] cs1 243 tccgttcggctaagtcagtt ccgcacatttcctgtgttct [27] cs3 100 ctagctttgccaccaccatt ggcaactgactcccatttgt [27] cs5 226 tccgctcccgttactcag gaaaagcgttcacactgtttatatt [27] cs4 198 acctgcggcaagtcgttt tctgcaggttcaaaagtcaca [27] cs6 152 ctgtgaatccagtttcgggt caggaacttccggagtggta [27] cs14 162 tttgcaaccgacatctacca ccggatgtagttgctccaat [27] cs17 130 ggagacgctgaatacaactga ctcaggcgcagttccttgtcs2 [27] cs2 368 agtggtggcagcgaaactat ttcctctgtgggttctcagg [27] rasul, et al.: molecular detection of enterotoxigenic escherichia coli toxins from diarrheic children in, sulaymaniyah 52 uhd journal of science and technology | july 2022 | vol 6 | issue 2 emb and macconkey agar, they showed metallic shine and pink color respectively. seventy-five (60%) of them were rod shape purple color, when they were grown in peptone water, they produced forming pink ring color at top of tubes after addition of kovac’s reagent. the color of the broth cultured changed to red after adding methyl red indicator to tube during performing methyl red test. h 2 o 2 was added to fresh colonies, bubble formation indicated positive catalase test. the percentage of appeared e. coli similarity to an ideal e. coli by vitek-2 test was done for the samples of 70, 23, 60, and 35 which were 99%, 93%, 87%, and 94%, respectively. seventy-five isolates have given positive for 16s rrnabased pcr, as shown in fig. 1. our study explored that the most diarrheagenic pathogens among gram negative in sulaimani are e. coli, which is compatible with the results reported in a local study by hasan et al. (2020) done in dhok city [30], shatub et al. (2021) found similar results (61.3%) [31], whereas khalil (2015) in baghdad reported lower positive rates (38.6%) [32] as well as other investigators who showed lower positive results [33]-[35]. several studies from worldwide revealed varying dec detection rates in e. coli among children under 5 years old, ranging between 4% and 87% in africa including 22.9%, 7.4%, 55.9%, and 86.5%, asia (45.2%, 4.7%, 6.82%), and america 5.5%. these variations could be related to changes in dec pathotype distribution from one region to the others, also between countries in the same region [36]. according to many reports around the world, various factors may be the primary causes of diarrheal outbreaks including; traveling to tropical zones, consuming contaminate, and lack of personal hygiene [35]. however, considering that prior studies have focused on certain aspects such as geographical conditions, sampling period, study population, hygienic level of region, and detection technologies [37]. the proportion of infected males 44 (58.6%) was relatively higher than females 31 (41.3%), the infected males higher than females were like to result reported by hasan et al. (2020) who reported 87.4% among males and 87.0 among females [30]. the current observations were agreed with results reported by the result mentioned by amir et al. (2020) in iran who showed 53.01% for male and 46.99% for female [35]. similarly, our observations were parallel to the results concluded by ochien and atieno (2021) in west kenya (55.9% and 44.1%) male and female, respectively [38], whereas the current results were not agreed with the result of abbasi in iran (2020) who reported higher rates among females than males [34]. 3.1.1. dna sequencing all accession numbers have shown in fig. 2. ay342058.1 is the accession number of a st gene which sequencing was performed. a phylogenetic tree based on the 16s ribosomal rna sequences was extracted from a representative set of 10 enterobacteriaceae genomes and compared to some other different strains (fig. 2). all strains referred to one clad, the clad of e. coli was more similar to shigella flexneri. the numbers (close individual) clustered and bootstrap percentage of 100 replications. some of the tree nodes are uncertainly predicted. it has concluded that the analysis of variable genes identifies interstrain relationships that may be correlated to the lifestyle of the organisms [39]. fig. 1. 16s rrna gene pcr of escherichia coli: m; is 100 bp ladder 2–6 were 16s rrna gene. fig. 2. molecular phylogenetic analysis by maximum likelihood method. rasul, et al.: molecular detection of enterotoxigenic escherichia coli toxins from diarrheic children in, sulaymaniyah uhd journal of science and technology | july 2022 | vol 6 | issue 2 53 3.2. identification of etec toxins the etec characterization using pcr among isolates was done. for the tested children who suffered from non-bloody, acute diarrhea due to e. coli (n = 75). all e. coli isolates were evaluated by pcr monoplex for st gene and lt gene. a total of 14 (18.66%) were proved etec. the majority toxin profile among the selected strains were st, whereas lt was not recorded in the present study, as shown in fig. 3. while all other patho type of e. coli were 81.33%. results of the present study were agreed with observations reported in studies done in other parts of kurdistan and iraq. pcr-based studies detected showed different percentage rates of etec in stool samples ranging from 18% to 26% [30], [32], [36], [40], [41]. results of this study were not parallel to conclusions mentioned by other investigators who reported lower percentage rates of positive results [34], [37]. etec is more common among lowand middle-income individual states, where it is a prominent pathogenic strain in travelers’ diarrhea, with a large burden on these countries [42]. etec was recognized as a major pathotype among children below 5 years old. the high percentage rates of etecpositive results could be attributed to the family’s poor hygiene and artificial feeding [30], [41], [43]. variations in our results with other researches could be related to changes in primers utilized, geographical considerations, population targeted, and sample size [44]. in nine of the 12 research conducted in africa, 22 of 34 investigations in asia, and three of six studies in latin america and the caribbean, etec was being the first or second most often isolated pathogen [45]. the proportion of infected male and female with etec was different, in our study, 9 (64.3%) etec was among males, while the rest 5 (35.7%) were among females which is parallel to results mentioned by other studies from west kenya [38] whereas other investigators [46] found higher percentage positive results among females which are not in agreement with our results. 3.3. distribution of etec among children according to the age the etec pathotype was identified in all children according to their ages, with a slightly increasing number of infected children with etec under 12 months [47]. our result highlighted the significant of etec like a cause of childhood diarrhea between the ages of 1 and 12 months, as shown in fig. 4. the current observations were agreed to results reported by shatub et al. (2021) from tikrit/iraq [31] whereas our results were far with outcome of khalil (2015) [32]. our findings are backed up by a review article that looked at etec infection from 1984 to 2005, stratified infants by age, and found that the peak incidence occurs after 6 months and can last until 18 months [48]. this might be related to duration of breastfeeding, the source of drinking water, cleanliness, sanitation, age, and the level of maternal educational. in contrast, in the finding by abdul-hussein et al. (2018) in wasit/iraq, the most etec prevalence was found among (3–24 months [33]. 3.4. stable and lt in our result, it was noticed that 14 (18.66%) st genes were present among isolates while there had no any identified lt solely and st-lt toxin as shown in fig. 3. the prevalence of st was the most common toxin gene. in the other region of kurdistan and the rest of the world, there are several reports showing differences in the prevalence of etec pathotype. the present study differed from a study in duhok city by fig. 3. agarose gel electrophoresis for st gene pcr products of outpatient isolates of etec. lane m: 100 bp dna ladder followed 1-14 are st gene, 15: negative control is distilled water. fig. 4. distribution of etec among children according to their age. rasul, et al.: molecular detection of enterotoxigenic escherichia coli toxins from diarrheic children in, sulaymaniyah 54 uhd journal of science and technology | july 2022 | vol 6 | issue 2 hasan et al. (2020), by which lt toxin solely identified in 37% of the cases [30], whereas seven strains positive for lt gene and three strains positive for st were identified by khalil (2015) in bagdad [32]. consequently, our study was close to a study by shabazi et al. (2021) that found three st and one lt gene in their results. besides, the prevalence of lt and st gene in several studies was presented as follows. for instance, a study by alizade et al. (2014) found 11.97% st and 9.86% lt, saka et al. (2019) concluded with 19 st and 14 lt, and nazarian et al. (2014) with 4 st and 1 lt [36], [37], [46], [49]. st, a peptide with a molecular weight of 18 or 19 amino acids, combined with a carrier protein and then will be antigenic. as a result, after infected with st-producing etec, immunological responses to st are not produced. the percentage of strains that produce lt alone, st alone, or lt/st varies by geographic region; in general, 30–50% of clinical etec isolates appear to produce st solely [9], [45]. for pathogenic strains, such as etec, it was shown that specific conditions of host might increase or decrease bacterial virulence. the impact of glucose and bile on the gene expression and protein level of st generated by different etec isolates was investigated by joffre et al. (2016), and he discovered that there are unique sta amino acid variations that respond differently to environmental signals such as bile [50]. a substantial amount of literature highlights the effect of different seasons on the prevalence of etes-associated infections, by which in the late spring and entire summer, this type of infection was repeatedly identified [9], [51]-[54]. however, in our study, the incidence of etes-associated infection is lower than the expected rate when compared to other studies. this may be due to the period of data collection that was performed in september and october. 3.5. detection of colonization factors among the 14 etec produced st isolates, nine primers were chosen for most common cf, among them, only 3 (21.42%) etec isolates shown cf, 11 (78.57%) etec isolates without cf. in our result, there was only 1 isolate (7%) posed cfa/ⅰ, 1 (7%) showed cs4, and also 1 (7%) with cs5 as shown in fig. 5. in among clinically important etec strains, over 22 antigenically different cfs were identified, but only a handful are frequently present in diarrheic patient samples [9]. current results are compatible with the result reported by peruski jr et al. (1999), among the stpositive strain cfa/1, cs4, and cs5 were reported, 77% isolates were failed to express any of 9 cf [55]. our result is close to the finding of shaheen et al. (2004) cfa/i (9.7%), cs4 (2 strain), and cs5 (2 strain) shaheen et al. (2004) [56] and kipkirui et al. (2021) [44]. the discrepancy between our findings and those of other studies could be attributable of variation in cf expression by etec in different geographical regions, as well as differences in laboratory methods/primers used to identify cfs [46]. in addition, due to sub cultured repeatedly or storage for long term, the plasmid containing the cf genes has been lost [44]. decreasing the expression of cf genes, a mutation inside the genetic locus, and expression of a cf not covered by the primers used in the pcr panel [6], [57], [58]. antigens for cf are only created in vivo, or a small percentage of strains do not generate cfs [59]. the cf antigens are differed from the 9 cfs screened for in this study. decreasing of cf has been reported to be associated to lt strains [41], [60], which similar our results where lack lt toxin. however, some studies have reported that cfs are almost equally associated with ltand st-positive etec strains [61]. 4. conclusion this study illustrated etec in children below 6 years old with acute non-bloody diarrhea. pcr-based detection of etec revealed that all isolated etecs found with st toxin-producing gene with no any etec -lt identified gene isolate. from the overall of etec isolates, only three of them showed cf which are cs4, cs5, and cfa/1 on the different strains. this finding can provide an evidence on the fig. 5. agarose gel electrophoresis for etec – colonization factor gene identification first lain (m): 100 bp dna ladder, 1; cs5 which shows 226 bp, 3; cs4 that is 198 bp 5; cfa/1 show 170 bp, 2, 4, and 6 demonstrate negative control contained distilled water. rasul, et al.: molecular detection of enterotoxigenic escherichia coli toxins from diarrheic children in, sulaymaniyah uhd journal of science and technology | july 2022 | vol 6 | issue 2 55 prevalent of etecs pathotype in this region, furthermore, creation a platform for vaccine development can be adapted from this finding. acknowledgment we thank to the staff of dr. jmale ahmaed rashid’s pediatric teaching hospital for helping us and we thank to all persons contributed in this study. references [1] a. d. mare, c.n. ciurea, a. man, b. tudor, v. moldovan, l. decean, et al. enteropathogenic escherichia coli-a summary of the literature. gastroenterology insights, vol. 12, pp. 28-40, 2021. [2] s. ramos, v. silva, m. l. e. dapkevicius, m. caniça, m. t. tejedor-junco, igrejas g, et al., escherichia coli as commensal and pathogenic bacteria among food-producing animals: health implications of extended spectrum β-lactamase (esbl) production. animals (basel), vol. 10, p. 2239, 2020. [3] j. p. nataro and j. b. kaper. diarrheagenic escherichia coli. clinical microbiology reviews, vol. 11, pp. 142-201, 1998. [4] g. d. christensen, w a simpson, j. j. younger, l. m baddour, f. f barrett, d. m. melton, et al. adherence of coagulase-negative staphylococci to plastic tissue culture plates: a quantitative model for the adherence of staphylococci to medical devices. journal of clinical microbiology, vol. 22, pp. 996-1006, 1985. [5] y. zhou, x. zhu, h. hou, y. lu, l. yu, l. mao, et al. characteristics of diarrheagenic escherichia coli among children under 5 years of age with acute diarrhea: a hospital based study. bmc infectious diseases, vol. 18, p. 63, 2018. [6] s. m. turner, a. scott-tucker, l. m. cooper and i. r. henderson. weapons of mass destruction: virulence factors of the global killer enterotoxigenic escherichia coli. fems microbiology letters, vol. 263, pp. 10-20, 2006. [7] j. m. fleckenstein, k. roy, j. f. fischer and m. j. i. burkitt. identification of a two-partner secretion locus of enterotoxigenic escherichia coli. infection and immunity, vol. 74, pp. 2245-2258, 2006. [8] c. k. porter, m. s. riddle, d.r. tribble, a. louis bougeois, r. mckenzie, s. d. isidean, et al., a systematic review of experimental infections with enterotoxigenic escherichia coli (etec). vaccine, vol. 29, pp. 5869-5885, 2011. [9] f. qadri, a. m. svennerholm, a. s. g. faruque and r. b. sack. enterotoxigenic escherichia coli in developing countries: epidemiology, microbiology, clinical features, treatment, and prevention. clinical microbiology reviews, vol. 18, pp. 465-483, 2005. [10] t. p. madhavan and h. sakellaris. colonization factors of enterotoxigenic escherichia coli. advances in applied microbiology, vol. 90, pp. 155-97, 2015. [11] y. zhang, p. tan, y. zhao and x. ma. enterotoxigenic escherichia coli: intestinal pathogenesis mechanisms and colonization resistance by gut microbiota. gut microbiota, vol. 14, p. 2055943, 2022. [12] i. bölin, g. wiklund, f. qadri, o. torres, a. l. bourgeois, s. savarino, et al. enterotoxigenic escherichia coli with sth and stp genotypes is associated with diarrhea both in children in areas of endemicity and in travelers. journal of clinical microbiology, vol. 44, pp. 3872-3877, 2006. [13] m. a. lasaro, j. f. rodrigues, c. mathias-santos, b. e. c. guth, a. balan, m. e. sbrogio-almeida, et al. genetic diversity of heatlabile toxin expressed by enterotoxigenic escherichia coli strains isolated from humans. journal of bacteriology, vol. 190, pp. 240010, apr 2008. [14] a. sjoling, g. wiklund, s. savarino, d. cohen and a. m. svennerholm. comparative analyses of phenotypic and genotypic methods for detection of enterotoxigenic escherichia coli toxins and colonization factors. journal of clinical microbiology, vol. 45, pp. 3295-3301, 2007. [15] c. m. müller, a. aberg, j. straseviçiene, l. emody, b. e. uhlin and c. balsalobre. type 1 fimbriae, a colonization factor of uropathogenic escherichia coli, are controlled by the metabolic sensor crp-camp. plos pathogens, vol. 5, p. e1000303, 2009. [16] o. puiprom, s. chantaroj, w. gangnonngiw, k. okada, t. honda, t. taniguchi, et al. identification of colonization factors of enterotoxigenic escherichia coli with pcr-based technique. epidemiology and infection, vol. 138, pp. 519-524, 2010. [17] t. dadheech, r. vyas and v. rastogi. prevalence, bacteriology, pathogenesis and isolation of e. coli in sick layer chickens in ajmer region of rajasthan, india. international journal of current microbiology and applied sciences, vol. 5, pp. 129-136, 2016. [18] g. y. lee, h. i. jang, i. g. hwang and m. s. rhee. prevalence and classification of pathogenic escherichia coli isolated from fresh beef, poultry, and pork in korea. international journal of food microbiology, vol. 134, pp. 196-200, 2009. [19] r. a. pollack, l. findlay, w. mondschein and r. r. modesto. laboratory exercises in microbiology. john wiley and sons, hoboken, 2018. [20] m. p. macwilliams. indole test protocol. american society for microbiology, washington, d.c, 2012. [21] s. mcdevitt. methyl red and voges-proskauer test protocols. vol. 8, american society for microbiology, washington, d.c, 2009. [22] k. reiner. catalse test protocol. american society for microbiology, washington, d.c, 2014. [23] m. ligozzi, c. bernini, m. g. bonora, m. de fatima, j. zuliani and r. fontana. evaluation of the vitek 2 system for identification and antimicrobial susceptibility testing of medically relevant grampositive cocci. journal of clinical microbiology, vol. 40, pp. 16811686, 2002. [24] i. espinosa, m. báez, m. i. percedo and s. martínez. evaluation of simplified dna extraction methods for streptococcus suis typing. revista de salud animal, vol. 35, pp. 59-63, 2013. [25] l. lin, b. d. ling and x. z. li. distribution of the multidrug efflux pump genes, adeabc, adede and adeijk, and class 1 integron genes in multiple-antimicrobial-resistant clinical isolates of acinetobacter baumannii-acinetobacter calcoaceticus complex. international journal of antimicrobial agents, vol. 33, pp. 27-32, 2009. [26] m. yavzori, n. porath, o. ochana, r. dagan, r. orni-wasserlauf and d. cohen. detection of enterotoxigenic escherichia coli in stool specimens by polymerase chain reaction. diagnostic microbiology and infectious disease, vol. 31, pp. 503-509, 1998. [27] c. rodas, v. iniguez, f. qadri, g. wiklund, a. m. svennerholm and a. sjöling. development of multiplex pcr assays for detection of enterotoxigenic escherichia coli colonization factors and toxins. journal of clinical microbiology, vol. 47, pp. 1218-1220, 2009. rasul, et al.: molecular detection of enterotoxigenic escherichia coli toxins from diarrheic children in, sulaymaniyah 56 uhd journal of science and technology | july 2022 | vol 6 | issue 2 [28] s. c. pawar, a, t. a. p. devi, c. setti, r. gajula, s. srikanth, s. kalyan. molecular evolution of pathogenic bacteria based on rrsa gene. journal of biotechnology and biomaterials, vol.2, pp.12-18, 2018. [29] r. mulchandani, f. massebo, f. bocho, c. l. jeffries, t. walker and l. a. messenger. a community-level investigation following a yellow fever virus outbreak in south omo zone, south-west ethiopia. peer journal, vol. 7, p. e6466, 2019. [30] h. k. hasan, n. a. yassin and s. h. eassa. bacteriological and molecular characterization of diarrheagenic escherichia coli pathotypes from children in duhok city, iraq. science journal of university of zakho, vol. 8, pp. 52-57, 2020. [31] t. w. shatub, n. a. h. jafar and a. k. krekor melconian. detection of diarrheagenic e.coli among children under 5’s age in tikrit city of iraq by using single multiplex pcr technique. plant archives, vol. 21, pp.1230-1237.16-118, 2021. [32] z. k. khalil. isolation and identification of different diarrheagenic (dec) escherichia coli pathotypes from children under five years old in baghdad. iraqi journal of medical sciences, vol. 28, pp. 126132, 2015. [33] z. k. abdul-hussein, r. h. raheema and a. i. inssaf. microbiology. molecular diagnosis of diarrheagenic e. coli infections among the pediatric patients in wasit province, iraq. journal of pure and applied microbiology, vol. 12, pp. 2229-2241, 2018. [34] e. abbasi, m. mondanizadeh, a. van belkum and e. j. i. ghaznavirad. multi-drug-resistant diarrheagenic escherichia coli pathotypes in pediatric patients with gastroenteritis from central iran. infection and drug resistance, vol. 13, p. 1387, 2020. [35] a. emami, n. pirbonyeh, f. javanmardi, a. bazargani, a.moattari, a. keshavarzi, et al. molecular diversity survey on diarrheagenic escherichia coli isolates among children with gastroenteritis in fars, iran. future microbiology, vol. 16, pp. 1309-1318, 2021. [36] h. k. saka, n. t. dabo, b. muhammad, s. garcía-soto, m. ugarteruiz and j. alvarez. diarrheagenic escherichia coli pathotypes from children younger than 5 years in kano state, nigeria. frontiers in public health, vol. 7, p. 348, 2019. [37] g. shahbazi, m. a. rezaee, f. nikkhahi, s. ebrahimzadeh and f. hemmati. characteristics of diarrheagenic escherichia coli pathotypes among children under the age of 10 years with acute diarrhea. asian j med sci, vol. 25, p. 101318, 2021. [38] g. ochien and l. atieno. prevalence of enterotoxigenic escherichia coli among children under five years in siaya county, western kenya. maseno university, kenya, 2021. [39] o. lukjancenko, t. m. wassenaar and d. w. ussery. comparison of 61 sequenced escherichia coli genomes. microbial ecology, vol. 60, pp. 708-720, 2010. [40] s. k. arif and l. i. f. salih. identification of different categories of diarrheagenic escherichia coli in stool samples by using multiplex pcr technique. asian j med sci, vol. 2, pp. 237-243, 2010. [41] c. i. c. ifeanyi, n. f. ikeneche, b. e. bassey, n. al-gallas, a. a. casmir and i. r. nnennaya. characterization of toxins and colonization factors of enterotoxigenic escherichia coli isolates from children with acute diarrhea in abuja, nigeria. jundishapur journal of microbiology, vol. 11, p. e64269, 2018. [42] s. eybpoosh, s. mostaan, m. m. gouya, h. masoumi-asl, p. owlia, b. eshrati, et al. frequency of five escherichia coli pathotypes in iranian adults and children with acute diarrhea. plos one, vol. 16, p. e0245470, 2021. [43] s. zheng, f. yu, x. chen, d. cui, y. cheng, g. xie, et al. enteropathogens in children less than 5 years of age with acute diarrhea: a 5-year surveillance study in the southeast coast of china. bmc infectious diseases, vol. 16, pp. 434-434, 2016. [44] e. kipkirui, m. koech, a. ombogo, r. kirera, j. ndonye, n. kipkemoi, et al. molecular characterization of enterotoxigenic escherichia coli toxins and colonization factors in children under five years with acute diarrhea attending kisii teaching and referral hospital, kenya. tropical diseases travel medicine and vaccines, vol. 7, pp. 1-7, 2021. [45] a. m. svennerholm. from cholera to enterotoxigenic escherichia coli (etec) vaccine development. indian journal of medical research, vol. 133, p. 188, 2011. [46] s. nazarian, s. l. m. gargari, i. rasooli, m. alerasol, s. bagheri and s. d. alipoor. prevalent phenotypic and genotypic profile of enterotoxigenic escherichia coli among iranian children. japanese journal of infectious diseases, vol. 67, pp. 78-85, 2014. [47] h. zeighami, f. haghi, f. hajiahmadi, m. kashefiyeh and m. memariani. multi-drug-resistant enterotoxigenic and enterohemorrhagic escherichia coli isolated from children with diarrhea. journal of chemotherapy, vol. 27, pp. 152-155, 2015. [48] s. gupta, j. keck, p. k. ram, j. a. crump, m. a. miller and e. d. mintz. part iii. analysis of data gaps pertaining to enterotoxigenic escherichia coli infections in low and medium human development index countries, 1984-2005. epidemiology and infection, vol. 136, pp. 721-738, 2008. [49] h. alizade, r. ghanbarpour and m. r. aflatoonian. molecular study on diarrheagenic escherichia coli pathotypes isolated from under 5 years old children in southeast of iran. asian pacific journal of tropical disease, vol. 4, pp. s813-s817, 2014. [50] e. joffré, a. von mentzer, a. m. svennerholm and å. sjöling. identification of new heat-stable (sta) enterotoxin allele variants produced by human enterotoxigenic escherichia coli (etec). international journal of medical microbiology, vol. 306, pp. 586594, 2016. [51] j. m. fleckenstein, p. r. hardwidge, g. p. munson, d. a. rasko, h. sommerfelt and h. steinsland. molecular mechanisms of enterotoxigenic escherichia coli infection. microbes and infection, vol. 12, pp. 89-98, 2010. [52] o. torres, w. gonzález, o. lemus, r. a. pratdesaba, j. a. matute, g. wiklund, et al. toxins and virulence factors of enterotoxigenic escherichia coli associated with strains isolated from indigenous children and international visitors to a rural community in guatemala. epidemiology and infection, vol. 143, pp. 1662-1671, 2015. [53] s. x. zhou, l. p. wang, m. y. liu, h. y. zhang, q. b. lu, l. s. shi, et al. characteristics of diarrheagenic escherichia coli among patients with acute diarrhea in china, 2009-2018. journal of infection, vol. 83, pp. 424-432, 2021. [54] m. paredes-paredes, p. c. okhuysen, j. flores, j. a mohamed, r. s padda, a. gonzalez-estrada, et al. seasonality of diarrheagenic escherichia coli pathotypes in the us students acquiring diarrhea in mexico. journal of travel medicine, vol. 18, pp. 121-125, 2011. [55] l. f. peruski jr., b. a. kay, r. a. el-yazeed, s. h. el-etr, a. cravioto, t. f. wierzba, et al. phenotypic diversity of enterotoxigenic escherichia coli strains from a community-based study of pediatric diarrhea in periurban egypt. journal of clinical microbiology, vol. 37, pp. 2974-2978, 1999. [56] h. i. shaheen, s. b. khalil, m. r. rao, r. a. elyazeed, t. f. wierzba, l. f. peruski jr, et al. phenotypic profiles of enterotoxigenic escherichia coli associated with early childhood diarrhea in rural egypt. journal of clinical microbiology, vol. 42, pp. 5588-5595, 2004. rasul, et al.: molecular detection of enterotoxigenic escherichia coli toxins from diarrheic children in, sulaymaniyah uhd journal of science and technology | july 2022 | vol 6 | issue 2 57 [57] d. g. evans, d. j. evans jr and w. j. i. tjoa. hemagglutination of human group a erythrocytes by enterotoxigenic escherichia coli isolated from adults with diarrhea: correlation with colonization factor. infection and immunity, vol. 18, pp. 330-337, 1977. [58] m. g. jobling and r. k. holmes. type ii heat-labile enterotoxins from 50 diverse escherichia coli isolates belong almost exclusively to the lt-iic family and may be prophage encode. plos one. vol. 7, p. e29898, 2012. [59] f. qadri, s. k. das, a. s. faruque, g. j. fuchs, m. j. albert, r. b. sack, et al. prevalence of toxin types and colonization factors in enterotoxigenic escherichia coli isolated during a 2-year period from diarrheal patients in banglades. journal of clinical microbiology, vol. 38, pp. 27-31, 2000. [60] b. juma, p. waiyaki, w. bulimo, e. wurapa, m. mutugi and s. kariuki. molecular detection of enterotoxigenic escherichia coli surface antigens from patients in machakos district hospital, kenya. ecamj, vol. 1, p. 62-68, 2014. [61] m. r. c. nunes, f. penna, r. franco, e. mendes and p. magalhaes. enterotoxigenic escherichia coli in children with acute diarrhoea and controls in teresina/pi, brazil: distribution of enterotoxin and colonization factor genes. journal of applied micorbiology, vol. 111, pp. 224-232, 2011. tx_1~abs:at/tx_2:abs~at uhd journal of science and technology | july 2022 | vol 6 | issue 2 1 1. introduction these days, there has been a rapid acceleration of change toward the knowledge economy and the information economy. knowledge is an essential ingredient for driving economic growth in countries [3]. knowledge is already an intangible asset of the organization, leading organizations to reprioritize their efforts [49]. as a result, many technological applications that strengthened organizational capabilities and created a massive flow of information and their use in organizations were developed [55]. this led to the development of new trends in the management of organizations based on these ideas. as a result of the information and technical revolution in all sectors of knowledge, today’s business organizations confront several obstacles [10]. therefore, senior management must be able to strengthen its role in investing in contemporary technology and expertise to improve its capacity to respond to the unpredictable environment and its demands [11]. as a result, the foundations of thinking and theoretical frameworks capable of fulfilling the organization’s aims must be identified [12]. the rapid shift in the business environment has an impact on business organizations, particularly industrial enterprises [13]. impact of technological burden on knowledge management functions in jordanian industrial companies muzhir shaban al-ani1, shawqi n. jawad2, suha abdelal2 1department of information technology, college of science and technology, university of human development, sulaymaniyah, krg, iraq, 2department of management, college of business, amman arab university, amman, jordan a b s t r a c t the goal of this study is to see how electronic information overload affects knowledge management functions in jordanian businesses. all jordanian industrial enterprises registered on the amman stock exchange were included in the study’s sample. three hundred and seventy-three people were chosen at random from a simple random sample of 30% of the study population of 1242 senior and intermediate managers in the research community. following the retrieval of the surveys, 206 questionnaires were found to be valid for analysis. it was used to do descriptive and heuristic statistical procedures, like simple and multiple regression analysis. the spss.16 application was used to do this. the study ends with the following findings: electronic information overload (technological overload) has a statistically significant influence on knowledge management functions (acquisition, generation, transmission, exchange, and application) in jordanian industrial companies. this work made a number of recommendations as a result of its findings, including: adopting an organizational aspect that suits the nature of the tasks that industrial companies in jordan perform, as well as providing technical capabilities to reduce the electronic information overload that these companies face while performing their tasks. index terms: knowledge management, organizational overload, statistical analysis, jordanian industrial companies corresponding author’s e-mail: muzhir shaban al-ani, department of information technology, college of science and technology, university of human development, sulaymaniyah, krg, iraq. e-mail: muzhir.al-ani@uhd.edu.iq received: 21-01-2022 accepted: 27-06-2022 published: 01-07-2022 access this article online doi: 10.21928/uhdjst.v6n2y2022.pp1-10 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2022 al-ani, et al. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) o r i g i n a l r e s e a r c h a r t i c l e uhd journal of science and technology al-ani et al.: impact of technological burden on knowledge management functions 2 uhd journal of science and technology | july 2022 | vol 6 | issue 2 the factors that led to the burden of electronic information were how easy it was to get and store information in electronic databases, how often it was used, and how long it was kept [15]. knowledge management is one of the modern topics in the field of management and business and is of great interest to those involved in business organizations [54]. this interest has also grown in the adaptation of various types of organizations to knowledge application [16]. knowledge management is also important for the growth of current businesses and their ability to handle future problems [16]. knowledge management’s impor tance in corporate organizations lies not in the knowledge itself, but in the value, it adds to these firms. it also helps firms transition to a knowledge economy that prioritizes knowledge capital investment [6]. due to the rapid technological development, the business environment of organizations is characterized by rapid change and is dominated by the ict revolution [34]. knowledge is the strength of organizations to ensure their growth and sustainability [32]. knowledge is shared with participation and increased by practice and use [38]. knowledge is an important resource that contributes to the success of different organizations and is governed by three fundamental characteristics, according to which knowledge is an important economic resource [42]. it is a leading sector of the contemporary economy and it can be traded indefinitely across one organization and among others in general [42]. modern corporate organizations strive to adapt at every step of the knowledge economy’s evolution to meet the demands of the time [48]. electronic information systems now serve as the foundation for management and productivity operations in all types of businesses [58]. these systems are important for things such as business, marketing, and productivity [58]. 1.1. statement of the problem the organization and its employees are burdened with information that needs attention, research, and treatment. businesses, particularly jordanian industrial companies, deal with a vast volume of electronic data and information, which weakens their position in making various judgments and leads to mistakes due to the excessive weight placed on understanding information elements. this necessitates businesses to develop new and inventive ways to collaborate constructively to tackle these challenges, with the need to determine control techniques for the implementation of this mechanism to establish its possibilities. “knowledge management functions used in jordanian industrial companies: a study of technological burden” is the goal of the study. 1.2. research questions this research is implemented through addressing the following questions: • according to managers, what kind of technology and what kind of technical skills do jordanian industrial businesses have? • what do managers think about the effects of technology load on the functions of knowledge management? • managers in jordanian factories think about the technological load in two ways: how much technology there is and how well they can use it. 1.3. research objectives the research aims to accomplish the following goals: • assessing the influence of electronic information (technological load) factors on knowledge management functions in the firms studied. • knowledge management functions can be good and bad. find out how to improve them. • measuring the level of use of knowledge management functions by jordanian industrial firms to come up with ideas for how to handle electronic data (technological burden). 1.4. research hypothesis in jordanian industrial organizations, there is no statistically significant influence of technological load (technology type and technical potential) on knowledge management functions (acquisition, generation, transmission, sharing, and application of knowledge) at the level (α ≤ 0.05). 1.5. research model the variables in the study model are chosen in accordance with the research’s problem and hypothesis, as well as the study’s purpose and particular goal (fig. 1). 2. electronic information burden the electronic information burden in businesses companies is very important and needs a series of effective measures to overcome it. frank defined the burden of information as the point at which individuals’ processing of information reaches its highest level, and therefore, the ability of individuals to process that information is reduced [52]. bryant emphasizes al-ani et al.: impact of technological burden on knowledge management functions uhd journal of science and technology | july 2022 | vol 6 | issue 2 3 that the information burden occurs because there is more information than knowledge workers can absorb and determine what they need [28]. it has been shown that the information burden is attributed to the following elements: multiple channels of information, time limiters, noise, and the volume of information coming in. the burden of information is more pronounced in the fields of business in general and commercial business in particular. this confirms that the burden is a natural and inevitable condition and has several reasons for many of the developments and discoveries of the global era [21]. several things show the burden of having too much information, like how people communicate, how they store and retrieve information, and how they make financial decisions. individuals are constantly exposed to a large amount of information that they obtain through their daily work, prompting them to refuse to receive this information and not allocate sufficient time to resolve the communication content. choi et al. also showed that fear of dealing with information, and the inability to concentrate on memoryrelated problems may result in distracting thoughts and a lack of attention [22]. such symptoms are reflected in the effectiveness of both the individual and the organization in their handling of information. according to himma, the information burden arises from individuals’ or the organization’s management’s frustration with not having access to the required information [37]. filej et al. reported in his study that new information and communication technologies aim to facilitate rapid access to information [21]. therefore, they cause a high overload of information, especially with push systems, which provide information to the user without any request for such information [50], [59]. choi et al. explained that information technology plays an important role in accomplishing tasks, especially in business, and is an integral part of the manager’s work [22]. friedrich et al., 2020 [31] reported that icts have increased access to information, processed it and produced new information, resulting in a burdensome information burden for managers [31]. consider, but do not overlook, the role of information technology is assisting in the reduction of the burden through large-scale information processing. mengis and eppler emphasize that the development of technology has helped to increase the amount of information flowing to the stakeholders until it has become the main reason for generating the overload of information, directly or indirectly [27]. below are a number of concepts that illustrate the definition of knowledge and related matters: knowledge: knowledge is the product of data, intuition, and experience [61]. as a result, knowledge is information that has been organized, digested, and structured in such a way that it may be applied [36]. organizational routines, r ules, processes, documents, and practices involve a mix of contextual information, values, expertise, new infor mation, and new exper tise that exist in knowing minds, organizational routines, rules, processes, documents, and practices [4]. knowledge is also recognized as a crucial component and a source of intellectual capital in today’s enterprises, and it grows as a result of learning and practice [5]. as knowledge is a product of both the organization and the individual’s practice, experience, judgments, and values, it is expressed in the process of applying knowledge to specific goals [8]. existing two types of knowledge: there are two types of knowledge: explicit and tacit. explicit knowledge is information that can be shared across organizations, groups, and individuals, and it may be kept electronically, documented, conveyed, and used in a variety of ways, including knowledge maps. tacit knowledge is knowledge that is stored in human thoughts and behavior and is difficult to record and transmit to others [1]. tacit knowledge is knowledge that comes from past experiences and is hard to write down and pass on to others. knowledge management: knowledge management is defined as “doing what it takes to maximize the value of knowledge resources” [30]. because knowledge management is the gateway to adding and producing value by synthesizing the burden of electronic information (independent variable) technological burden (technology type, technical potential) knowledge management functions (dependent variable) (acquisition, generation, transmission, sharing and application) fig. 1. research model. al-ani et al.: impact of technological burden on knowledge management functions 4 uhd journal of science and technology | july 2022 | vol 6 | issue 2 knowledge pieces to generate top knowledge combinations, the purpose of data, information, and knowledge will change [2], [5]. knowledge management encompasses learning and adaptation, improving the creative process, sharing, and making the best use of these assets [40]. knowledge management is defined as “successful learning processes linked with the exploitation, investigation, and sharing of human knowledge,” according to the author (explicit and tacit). it improves performance and intellectual capital by utilizing proper technology, civilization, and culture. knowledge management contributes to the development of knowledge that attempts to enhance the success of companies through four dimensions, according to the organizational cooperation of knowledge management [30]. processes, products, and people, as well as organizational performance, are all influenced by these characteristics. knowledge management functions can be divided into five functions as below: • knowledge acquisition: a function that attempts to collect and gain knowledge from a range of recorded sources as well as undocumented knowledge that is stored in people’s thoughts and issued through their actions. knowledge can be acquired from stakeholders and experts, with information technology playing a key role in data capture, classification, processing, and harnessing to generate a competitive advantage for the organization [25], [29], [41], [51], [56], [60], [62], [64]. it takes a lot of work to get new knowledge [41]. • knowledge generation: it implies that information is created from a variety of sources and channels to expand organizational memory vaults and enable the company to discover innovative solutions to its issues, resulting in innovation. individuals are involved in the creation of knowledge within the company. social involvement, embodied external knowledge, integrated internal knowledge, and synthetic knowledge are the four ways that knowledge will be passed on to new people [60], [61], [62], [63], [64]. • knowledge transfer: leadership, absorptive capability, organizational structure support, degree of complexity, degree of privacy, and reliability of knowledge vocabulary are all elements to consider while transferring knowledge [16], [19], [23], [40], [44], [53]. • knowledge sharing: knowledge sharing is critical in modern businesses for production, adapting to external factors, outperforming rivals, promoting opportunities, and sustaining their efficacy. individuals are sharing their expertise and experience formally through frequent formal meetings, which is expanding the organization’s knowledge base. as well as people’s involvement in knowledge with others, which avoids the loss of such information and its fading with time. furthermore, sharing knowledge between organizations increases the amount of knowledge stored in organization warehouses [17], [20], [35], [39], [43], [45], [47], [56], [57], [63], [64]. • knowledge application: the purpose of knowledge management is for modern enterprises to use knowledge retrieval techniques and web-based technology platforms to apply knowledge. furthermore, these systems allow for the timely and appropriate access, use, and transfer of information, as well as communication with the appropriate individual [18], [33], [63], [64]. tashkandi and zakia examined the impor tance of knowledge management and the extent of its application in the management of education in the city of mecca and concluded that the members of the study community recognize the importance of knowledge management and employment, but their management does not give priority to knowledge management [9]. the reason for this is that knowledge management is a modern area that organizations seek to adopt. dubosson and fragniere emphasized that the information burden affects the efficiency of organizations [26]. the burden is a curse and a real concern for the management of the organization, perhaps because the new it trends are not fully absorbed, also showed that the information burden affects managers differently [22]. the role of the director changed as he spent more time processing information and dealing with technology and less time managing staff, and he considered that the regulatory environment was the primary cause of the phenomenon of information burden, followed by technology and personal factors. manovas concluded that the successful transfer of knowledge in an it project must have a solid knowledge base and practical knowledge capabilities to ensure successful transfer of knowledge and that a culture of learning, sharing, collaboration technology, and incentive systems is important elements of the structure [46]. to measure the impact of organizational culture on knowledge management, the study of the convicted rawluk et al. showed the impact of organizational culture factors individually and collectively in the management of knowledge as a whole, and its individual processes (knowledge generation, sharing, and application) [14]. leadership was the most influential factor in organizational culture in implementing knowledge management. on the impact of knowledge management in achieving organizational creativity, rawluk et al. [14] al-ani et al.: impact of technological burden on knowledge management functions uhd journal of science and technology | july 2022 | vol 6 | issue 2 5 showed that there is a clear awareness among employees of the need to adopt creative ideas in the technical fields to improve and develop. the organizational units in the research organizations want to adopt knowledge in all fields, through the adoption of expansion strategies in the scientific fields and the creation of new scientific departments. 3. methodology 3.1. research: questionnaire design in the preceding sections, the electronic information overload in enterprises and their duties was discussed. this section will look at how to create the questionnaire that is necessary for this study. the questionnaire employs a five-field likert scale, with strong agree, agree, neutral, disagree, and strongly disagree as the options. according to the study paradigm, the questionnaire is divided into two parts: an independent variable (technological burden) and a dependent variable (technological burden) (knowledge management functions). • independent variable (technological burden) (fig. 2): this field is divided into two parts: type of technology and technical potential and varied types of technology may not suit the type of task required, such as hardware, software, networks, and information systems, which are necessary to process data and information to accomplish various tasks. human capabilities and infrastructure that support it from databases and information processing systems, as well as specialists in data collection and analysis, maintenance workers, and equipment operators, are examples of technical potential. • dependent variable (knowledge management functions) (fig. 3): knowledge acquisition, knowledge generation, knowledge transfer, knowledge sharing, and knowledge application are the five components of this discipline. knowledge acquisition refers to obtaining information from both internal (such as learning, cooperation, and employee feedback), as well as external (such as training programs and workshops and knowledge databases) sources (such as customers, consultants, competitors, competent personnel, attract experienced, and establish relationships with allies and partners). knowledge generation is concerned with equipping workers in the knowledge field with graphical analysis, and this is done through learning, teaching, research, and development. knowledge generation is concerned with creating and deriving new creative knowledge from existing knowledge through the organization to secure knowledge types for the benefit of future decisions, which are concerned with equipping workers in the knowledge field fig. 2. technological burden questioner. fig. 3. knowledge management functions questioner. al-ani et al.: impact of technological burden on knowledge management functions 6 uhd journal of science and technology | july 2022 | vol 6 | issue 2 with graphical analysis, and this is done through learning, teaching, research, and development. knowledge transfer refers to the proper conveyance of specific knowledge to a specific individual (communications, reports, bulletins, staff movements, and use of technical tools to enhance knowledge transfer) in a formal and informal way (individual informal meetings) at the correct cost and at the right time. individuals share and circulate numerous sorts of knowledge in the context of knowledge sharing. securing collective collaboration among them, interacting with others’ conversations both within and outside the company, reaching out and working on the same document concurrently from many locations to develop fresh creative ideas everywhere in an organization, knowledge application refers to how people and resources are developed, how businesses are made better, how technology is used to get information, and how people can share information about what they know. 3.2. research: population and sample industrial enterprises registered on the amman stock exchange make up the study’s population. all managers in senior management (general managers, their assistants, or their representatives) as well as managers in middle management were included in the sample and analysis unit (directors of the main departments and heads of departments). the sampling and analysis unit has a population of 1242 people. a random selection of 30% of them was chosen to form the sample (373 people). the questionnaires were then sent out to that group. there were 206 questionnaires that could be used for statistical analysis, accounting for 55% of the total distributed questionnaires. secondary sources, which contain data and material published in various library sources, have been permitted for assessment of the literature and prior works to meet the study’s goal. in this case, the questionnaire was the main source of data. it was used to manage the questions and test the hypotheses in the applied part of a study. 4. results and analysis the relative importance was determined in the respondents’ perceptions of the study questions based on the likert fivepoint scale. a low (<2.33), middle (2.33–3.66), and high (3.77 and above) are used as shown in fig. 4. question 1: how do managers working in jordanian industrial enterprises perceive the technological load (kind of fig. 1 shows that the sample’s perceptions of the technological burden were high in both dimensions of the type of technology and technical possibilities, with the general arithmetic mean of the type of technology (4.20) and the standard deviation of the standard deviation of the standard deviation of (0.55). this is because businesses recognize the significance of the technology they employ as the foundation for data and information filtering. it is very important for businesses to have electronic network operating systems because they have to deal with a lot more information. question 2: what are the perspectives of managers in jordanian industrial businesses that operate in knowledge management functions (acquisition, generation, transfer, sharing, and application of information)? fig. 1 shows that the respondents’ level of response was high, with an arithmetic mean of (3.71) and a standard deviation of (0.57). while the high degree of knowledge sharing revealed a significant influence of the electronic burden on knowledge sharing, the results also revealed a significant impact of the electronic burden on knowledge sharing. the researchers argue that the availability of technological devices and equipment hinders the establishment of a favorable regulatory environment, beginning with the speed with which information is delivered across numerous communication networks both internally and externally. finally, respondents gave a positive response to the application of knowledge. according to the researchers, this is due to the capacity of the investigated firms to apply current techniques in the application of information and to strive to communicate it to all people, as well as the use of fresh knowledge that is relevant and produces excellent results. fig. 4. arithmetic mean and standard deviation (technological burden). al-ani et al.: impact of technological burden on knowledge management functions uhd journal of science and technology | july 2022 | vol 6 | issue 2 7 in this section, the hypotheses of the study are tested, where simple regression, multiple regression, and other tests are used to validate the hypotheses. these tests are the f-test for the significance of the regression model, the t-test for the effect of significance, and the value of r2 (coefficient of determination), to find out the percentage interpreted by the independent variables in the dependent variable, depending on the statistical significance values extracted using the statistical software as below: the hypothesis of the study: in jordanian industrial organizations, there is no statistically significant influence of technological load (type of technology and technical skills) on knowledge management functions (acquisition, generation, transmission, sharing, and application of knowledge) at the level of (0.05). table 1 findings show that technical load is the most important variable that determines knowledge management table 2: the influence of technological load characteristics on knowledge management function dimensions dependent variable coefficient of determination standard deviation f-test degree of freedom significance level regression coefficients independent variable β standard error t-test significance level acquisition of knowledge 0.540 0.292 41.850 2.203 0.000 type of technology 0.091 0.079 0.901 0.368 technical capabilities 0.445 0.480 5.495 0.000 generation of knowledge 0.610 0.372 60.137 2.203 0.000 type of technology 0.077 0.101 0.769 0.443 technical capabilities 0.549 0.080 6.835 0.000 transmission of knowledge 0.536 0.287 40.825 2.203 0.000 type of technology 0.269 0.103 2.625 0.009 technical capabilities 0.321 0.082 3.916 0.000 sharing of knowledge 0.498 0.248 33.464 2.203 0.000 type of technology 0.193 0.112 1.729 0.085 technical capabilities 0.369 0.089 4.136 0.000 application of knowledge 0.518 0.268 37.137 2.203 0.000 type of technology 0.269 0.095 2.821 0.005 technical capabilities 0.261 0.076 3.431 0.001 functions of knowledge management 0.619 0.383 63.088 2.203 0.000 type of technology 0.180 0.083 2.166 0.031 technical capabilities 0.389 0.066 5.861 0.000 table 1: the influence of technological strain on the dimensions of knowledge management functions dependent variable standard deviation coefficient of determination f-value degree of freedom significance level regression coefficients independent variable β standard error t-test significance level acquisition of knowledge 0.515 0.266 73.790 1.204 0.000 technological burden 0.510 0.059 8.590 0.000 generation of knowledge 0.535 0.286 81.709 1.204 0.000 technological burden 0.572 0.063 9.039 0.000 transmission of knowledge 0.601 0.361 115.331 1.204 0.000 technological burden 0.678 0.063 10.739 0.000 sharing of knowledge 0.535 0.286 81.733 1.204 0.000 technological burden 0.577 0.064 9.041 0.000 application of knowledge 0.498 0.248 67.182 1.204 0.000 technological burden 0.569 0.069 8.196 0.000 functions of knowledge management 0.618 0.382 126.324 1.204 0.000 technological burden 0.581 0.052 11.239 0.000 al-ani et al.: impact of technological burden on knowledge management functions 8 uhd journal of science and technology | july 2022 | vol 6 | issue 2 functions. all models of progressive multiple regression are consistent with statistical tests, including f values and the t-test, implying that the initial zero hypothesis is rejected and the alternative hypothesis is accepted. the influence of technological strain on the dimensions of knowledge management functions is shown in table 1. table 2 shows that the multiple regression model used to assess the impact of the dimensions of the independent variable technological burden (type of technology and technical capabilities) on the variable of knowledge management functions is significant, and that the two variables together (r2 = 38.3%) account for the majority of the differences in knowledge management function values. table 1 shows the significance of the multiple regression model when it comes to the impact of type of technology and technical abilities on the dimensions of knowledge management functions. technical capabilities and type of technology have a big impact on the two dimensions of knowledge transfer and application, but not so much on the technical sub-dimensions of knowledge acquisition. 5. conclusions in light of the prior talks, the study came to the following conclusions: • the high importance of the technological burden in terms of the type of technology in jordanian industrial companies, particularly in terms of the companies’ use of appropriate devices and tools to obtain the required and appropriate information in their use through their reliance on activating local and spider networks. • the findings of this study agreed with those of the ching-chiao et al. [24] study, which found that technology produces an information burden and that the sort of technology used is critical in decreasing that burden. the present study’s findings also agreed with those of dubosson and fragniere [26] in that information technology is an information burden that reduces company efficiency. • the study found that the technological burden is becoming more important in terms of technical skills in jordanian industrial enterprises, particularly in terms of their capacity to have a supporting infrastructure for information technology, networks, servers, and other peripherals. this study supported the conclusions of the choi et al. [22] study, especially in terms of the human component, where the larger the information load, the more time managers spend on analysis and audits, and the less time they spend on staff management. in the technological section of the study, the previous study choi et al. [22] varied from the current one since technology was the second source of the information load phenomena. • the study revealed that information generation piques people’s attention. this is due to managers’ capacity to diversify information sources, their focus on research and development, and an effort to bridge knowledge gaps as a result of changes in the workplace, as well as attention to the organizational and technical dimensions of knowledge creation. this result matched that of research on knowledge sources, acquisition, and transmission [9]. the findings are also in line with research suhaimi [7] that looked at the characteristics and processes of knowledge management (acquisition, generation, transmission, distribution, and application). 6. recommendations as a result of their research, the researchers make a number of recommendations for jordanian industrial companies to follow. these recommendations are: adopting the type of technology appropriate to the environment in which jordanian industrial companies operate in such a way as to reduce the burden of information in terms of the technological burden that may be exposed to them in carrying out their decision-making tasks. • jordanian industrial businesses should undertake a strategic study of the company’s strengths and weaknesses as reflected in the performance of their departments and departments and decide their influence on raising or lowering the technical load in those departments and departments. • building organizations in their regulatory environments to make communication systems that work well, based on the idea of cutting down on technology so that they can reorganize their systems to be more effective. • organizations are anxious to guarantee that employees are aware of how to use previously acquired information and how to apply new knowledge developed by these companies based on their unique characteristics. recognize that managers have a knowledge asset that has yet to be invested in and support this trend by offering training courses on how to use and apply that expertise to achieve certain goals. al-ani et al.: impact of technological burden on knowledge management functions uhd journal of science and technology | july 2022 | vol 6 | issue 2 9 references [1] a. jazar and a. talaat. “proposed project for knowledge management in jordanian public universities”. unpublished doctoral thesis, faculty of higher education studies. amman jordan, amman arab university for graduate studies, 2005. [2] m. bataineh and z. mashaqbeh. “knowledge management between theory and practice”. jalis al-zaman publishing house, amman, 2010. [3] s. n. jawad, m. s. al-ani, h. a. hijazi and h. irshaid. “small business management, a technology entrepreneurial perspective”. safa publishing house, amman, jordan, 2010. [4] h. hijazi. “measuring the impact of knowledge management perception on employment in jordanian organizations: a comparative analytical study between the public and private sectors towards building a model for knowledge management employment”. unpublished doctoral thesis, faculty of administrative and financial studies, amman arab university for graduate studies, amman, jordan, 2005. [5] n. hamidi. “management information systems: contemporary entrance”. wael publishing house, amman, jordan, 2005. [6] m. ziadat. “contemporary trends in knowledge management”. safaa publishing and distribution house, amman, jordan, 2008. [7] z. suhaimi. “the readiness of public organizations for knowledge management: a case study of king abdul aziz university in jeddah”. an introduction to the international conference on administrative development: towards distinguished performance of the government sector, riyadh, 2009. [8] s. al. sharfa. “the role of knowledge management and information technology in achieving competitive advantages in banks operating in gaza strip”. master of business administration, islamic university, gaza, 2008. [9] z. tashkandi. “knowledge management: the importance and extent of application of its operations from the point of view of the supervisors and administrator’s departments of the department of education in makkah and jeddah”. master thesis, umm al-qura university, 2009. [10] a. taher and i. al-mansour. “requirements for sharing knowledge and obstacles facing its application in jordanian telecommunication companies”. presented to the scientific conference, applied science university, amman, jordan, 2007. [11] m. s. al-ani and s. n. jawad. “management process and information technology”. al-ethaa publishing house, amman, jordan, 2008. [12] m. s. al-ani and s. n. jawad. “business intelligence and information technology”. safa publishing house, amman, jordan, 2012. [13] a. a. eniola, g. k. olorunleke, o. o. akintimehin, j. d. ojeka and b. oyetunji. “the impact of organizational culture on total quality management in smes in nigeria”. heliyon, vol. 5, no. 8, p. e02293, 2019. [14] a. rawluk, r. m. ford, l. little, s. draper and k. j. h. williams. “applying social research: how research knowledge is shaped and changed for use in a bushfire management organization”. environmental science and policy, vol. 106, pp. 201-209, 2020. [15] a. r. said, h. abdullah, j. uli and z. a. mohamed. “relationship between organizational characteristics and information security knowledge management implementation”. procedia social and behavioral sciences, vol. 12320, pp. 433-443, 2014. [16] a. m. abubakar, h. elrehail, m. a. alatailat and a. elçi. “knowledge management, decision-making style and organizational performance”. journal of innovation and knowledge, vol. 4, no. 2, pp. 104-114, 2019. [17] s. almahamid, a. awwad and m. mcadams. “effects of organizational agility and knowledge sharing on competitive advantage: an empirical study in jordan”. international journal of management, vol. 27, no. 3, pp. 387-404, 2010. [18] m. ariffin, n. arshad, a. r. s. shaarani and s. u. shah. “implementing knowledge transfer solution through web-based help desk system”. world academy of science engineering and technology, vol. 21, pp. 78-82, 2007. [19] e. awad and h. ghaziri. “knowledge management”. pearson education inc., prentice hall, united states, 2004. [20] k. bartol and a. srivastava. “encouraging knowledge sharing: the role of organizational reward systems”. journal of leadership and organizational studies, vol. 9, no. 1, pp. 64-76, 2002. [21] b. choi, s. k. poon and j. g. davis. “effects of knowledge management strategy on organizational performance: a complementarity theory-based approach”. omega, vol. 36, no. 2, pp. 235-251, 2008. [22] b. filej, b. skela-savič, v. h. vicic and n. hudorovic. “necessary organizational changes according to burke–litwin model in the head nurses system of management in healthcare and social welfare institutions-the slovenia experience”. health policy, vol. 90, no. 2-3, pp. 166-174, 2009. [23] j. bou-liusar and m. segarra-cipres. “strategic knowledge transfer and its implications for competitive advantage: an integrative conceptual framework”. journal of knowledge management, vol. 10, no. 4, pp. 100-112, 2006. [24] y. ching-chiao, p. b. marlow and c. s. lu. “knowledge management enablers in liner shipping”. transportation research part e: logistics and transportation review, vol. 45, no. 6. pp. 893903, 2009. [25] w. cohen and d. levinthal. “absorptive capacity: a new perspective on learning and innovation”. administrative science quarterly, vol. 35, no. 1, pp. 128-152, 1990. [26] m. dubosson and e. fragniere. “the consequences of information overload in knowledge based service economies: an empirical research conducted in geneva”. service science, vol. 1. no. 1, pp. 56-62, 2009. [27] m. eppler and j. mengis. “a framework for information overload research in organizations: insights from organization science, accounting, marketing, mis, and related disciplines”. ica working paper, university of lugano, lugano, 2003. [28] b. furlow. “information overload and unsustainable workloads in the era of electronic health records”. the lancet respiratory medicine, vol. 8, no. 3. pp. 243-244, 2020. [29] j. feliciano. “the success criteria for implementing knowledge management systems in an organization”. doctoral dissertation, pace university, usa, 2006. [30] a. ferraris, c. giachino, f. ciampi and j. couturier. “r&d internationalization in medium-sized firms: the moderating role of knowledge management in enhancing innovation performances”. journal of business research, vol. 128, pp. 711-718, 2019. [31] j. friedrich, m. becker, f. kramer, m. wirth and m. schneider. “incentive design and gamification for knowledge management”. journal of business research, vol. 106, pp. 341-352, 2020. [32] s. goh. “managing effective knowledge transfer: an integrative framework and some practice implications”. journal of knowledge al-ani et al.: impact of technological burden on knowledge management functions 10 uhd journal of science and technology | july 2022 | vol 6 | issue 2 management, vol. 6, no. 1, pp. 23-30, 2002. [33] r. grant and c. baden-fuller. “a knowledge accessing theory of strategic alliances”. journal of management studies, vol. 41, no. 1, pp. 61-84, 2004. [34] m. l. grise and b. gallupe. “information overload: addressing the productivity paradox in face-to-face electronic meetings”. journal of management information systems, vol. 16, no. 3, pp. 157-186, 2000. [35] d. gurteen. “creating a knowledge sharing culture”. vol. 2. knowledge management magazine, 1999. [36] h. biemans and c. siderius. “advances in global hydrology-crop modelling to support the un’s sustainable development goals in south asia”. current opinion in environmental sustainability, vol. 40, pp. 108-116, 2019. [37] k. himma. “a preliminary step in understanding the nature of a harmful information-related condition: an analysis of the concept of information overload”. ethics and information technology, vol. 9, no. 4. pp. 4, 2007. [38] j. hodge. “examining knowledge management capability: verifying knowledge process factors and areas in an educational organization”, doctoral dissertation, northcentral university, 2010. [39] m. ismail and z. yusof. “the impact of individual factors on knowledge sharing quality”. journal of organizational knowledge management, vol. 2010, pp. 13, 2010. [40] a. jashapara. “knowledge management an integrated approach”. pearson education, prentice-hall, hoboken, 2004. [41] k. mellahi and d. g. collings. “the barriers to effective global talent management: the example of corporate élites in mnes”. journal of world business, vol. 45, no. 2, pp. 143-149, 2010. [42] y. l. kim and w. van biesen. “fluid overload in peritoneal dialysis patients”. seminars in nephrology, vol. 37, no. 1, pp. 43-53, 2017. [43] n. leung and s. kang. “ontology-based collaborative interorganizational knowledge management network”. interdisciplinary journal of information knowledge and management, vol. 4. p. 699. 2009. [44] l. lin, x. geng and a. whinston. “a sender-receiver framework for knowledge transfer”. mis quarterly, vol. 29, no. 2, pp. 197-219, 2005. [45] k. mahesh and j. suresh. “knowledge criteria for organization design”. journal of knowledge management, vol. 13, no. 4, pp. 4151, 2009. [46] m. manovas, “investigating the relationship between knowledge management capability and knowledge transfer success”. mastery degree, concordia university, canada, 2004. [47] m. mohayidin, n. azirawani, m. kamaruddin and m. idawati. “the application of knowledge management in enhancing the performance of malaysian universities”. journal of knowledge management, vol. 5, no. 3, pp. 301-312, 2007. [48] g. b. mulder. management, husbandry, and colony health. in: the laboratory rabbit, guinea pig, hamster, and other rodents. ch. 28. academic press, cambridge, pp. 765-777, 2012. [49] g. b. mulder. “perception as information processing”. urban ecology, vol. 4, no. 2, pp. 103-118, 1979. [50] m. raoufi. “avoiding information overload-a study on individual’s use of communication tools”. proceeding of the 36th hawaii international conference on system sciences, 2003. [51] e. reiter, a. cawsey, l. osman and y. roff. “knowledge acquisition for content selection”. in: proceedings of the sixth european workshop on natural language generation, 1997, pp. 117-126. [52] f. ruff. “the advanced role of corporate foresight in innovation and strategic management-reflections on practical experiences from the automotive industry”. technological forecasting and social change, vol. 101, pp. 37-48, 2015. [53] w. seidman and m. mccauley. “optimizing knowledge transfer and use”. cerebyte, inc., lake oswego, oregon, 2005. [54] n. k. sekaran and g. b. seymann. “hospital-based quality improvement initiatives”. hospital medicine clinics, vol. 3, no. 3, pp. e441-e456, 2014. [55] j. song, h. zhan, j. yu, q. zhang and y. wu. “enterprise knowledge recommendation approach based on context-aware of time-sequence relationship”. procedia computer science, vol. 107, pp. 285-290, 2017. [56] a. tiwana. “the knowledge management toolkit: orchestrating it, strategy and knowledge platform”. 2nd ed. prentice hall, upper saddle river, 2002. [57] s. wang. “to share or not to share: an examination of the determinants of sharing knowledge via knowledge management systems”. doctoral dissertation, ohio state university, united states, 2005. [58] e. whelan and r. teigland. “transactive memory systems as a collective filter for mitigating information overload in digitally enabled organizational groups”. information and organization, vol. 23, no. 3, pp. 177-197, 2013. [59] t. wilson. “information overload: implications for healthcare services”. health informatics journal, vol. 7, no. 2, pp. 112-117, 2001. [60] r. wong and t. tiainen. “are you ready for right knowledge management strategy: identifying the potential restrains using the action space approach”. frontiers of e-business research, pp. 480-490, 2004. [61] x. xie, h. zou and g. qi. “knowledge absorptive capacity and innovation performance in high-tech companies: a multi-mediating analysis”. journal of business research, vol. 88, pp. 289-297, 2018. [62] s. zahra and g. george. “absorptive capacity: a review, reconceptualization, and extension”. academy of management review, vol. 27, no. 2, pp. 185-203, 2002. [63] x. zhang. “understanding conceptual framework of knowledge management in government (condensed version)”. presentation on un capacity-building workshop on back office management for e/m-government in asia and the pacific region, shanghai, china, 2008. [64] m. s. al-ani, s. n. jawad and s. abdelal. “knowledge management functions applied in jordanian industrial companies: study the impact of regulatory overload”. uhd journal of science and technology, vol. 5, no. 2. pp. 47-56, 2021. tx_1~abs:at/tx_2:abs~at uhd journal of science and technology | july 2022 | vol 6 | issue 2 147 1. introduction the availability of data has increased dramatically since the big data era began, and it is predicted that this trend will continue in the years to come. a thorough research is being done to make appropriate use of this knowledge. big data and big data analytics have created opportunities for businesses and scholars that were previously unthinkable. research in artificial intelligence on how to leverage readily available data is producing fascinating and important results. there are several sources of big data, and one of the most well-known ones is social networks, including twitter. twitter is a microblogging social network that enables users to post short messages (up to 280 characters) called tweets. users may interact with one another on twitter by responding to tweets, referencing other users in their tweets, or retweeting another user’s message. users can also follow each other to keep up with what other people are saying on twitter. all registered users have access to the social network’s services through a web page, mobile apps, and an application programming interface (api). the latter method of access has produced an ecosystem of applications that enhance the user’s experience of information consumption and aggregation. however, this has aided in the development of systems for account management and automated tweet publishing [1]. fully-automated accounts are called bots. they can retweet exciting and relevant material for specific communities or aggregate tweets about a topic. one area that requires attention is bot detection analysis. since, around 48 million twitter accounts have been maintained by automated programs dubbed bots, accounting for up to 15% of all twitter accounts [2]. certain bots are helpful for a numerous task, including automatically publishing news and academic articles and aiding in emergency circumstances. nonetheless, twitter bots have been used for malicious real-time twitter data analysis: a survey hakar mohammed rasul, alaa khalil jumaa technical college of informatics, sulaimani polytechnic university, sulaimani 46001, kurdistan region, iraq a b s t r a c t internet users are used to a steady stream of facts in the contemporary world. numerous social media platforms, including twitter, facebook, and quora, are plagued with spam accounts, posing a significant problem. these accounts are created to trick unwary real users into clicking on dangerous links or to continue publishing repetitious messages using automated software. this may significantly affect the user experiences on these websites. effective methods for detecting certain types of spam have been intensively researched and developed. effectively resolving this issue might be aided by doing sentiment analysis on these postings. hence, this research provides a background study on twitter data analysis, and surveys existing papers on twitter sentiment analysis and fake account detection and classification. the investigation is restricted to the identification of social bots on the twitter social media network. it examines the methodologies, classifiers, and detection accuracies of the several detection strategies now in use. index terms: twitter, data analysis, twitter streaming application programming interface, sentiment analysis, bot detection corresponding author’s e-mail: technical college of informatics, sulaimani polytechnic university, sulaimani 46001, kurdistan region, iraq. e-mail: hakar.mohammed.r@spu.edu.iq received: 17-06-2022 accepted: 11-11-2022 published: 21-12-2022 access this article online doi: 10.21928/uhdjst.v6n2y2022.pp147-155 e-issn: 2521-4217 p-issn: 2521-4209 copyright © 2022 rasul and jumaa. this is an open access article distributed under the creative commons attribution non-commercial no derivatives license 4.0 (cc by-nc-nd 4.0) s u r v e y uhd journal of science and technology rasul and jumaa: real-time twitter data analysis 148 uhd journal of science and technology | july 2022 | vol 6 | issue 2 purposes, such as spreading malware or manipulating public opinion on a certain subject. bot identification software is predicated on the premise that the behavior of a human account is distinct from that of a bot. to quantify these discrepancies, representative factors including the statistical distribution of the terms used in tweets, the frequency of daily posts, and the number of individuals who followed the user may be employed [3]. apache spark data analysis on twitter will be required to do that. as a result, the portion that follows in this essay will examine similar efforts on twitter data analysis employing apache spark and bot identification, as well as the available tools. 2. background information 2.1. twitter twitter is a microblogging and social networking website that enables users to post and receive 280-character messages called “tweets.” registered users may send tweets and follow other users. unregistered users may browse public tweets on twitter without having an account [4]. over 300 million individuals use twitter on a regular basis. more than 500 million tweets each day are sent in 33 different languages [5]. one of twitter’s best benefits is the capacity for communication and sharing with other users. by sharing links, pictures, and videos with their followers, people and businesses may interact with them [6]. this section explains some of twitter features: 1. follow: to follow someone on twitter, you must subscribe to their tweets or site updates. another twitter user who has followed you is referred to as a “follower.” other twitter users you’ve decided to follow on the platform are referred to as “following.” [7] 2. @: in tweets, the @ symbol is used to identify usernames. the @ sign before a username (like @hakarrasul) creates a connection to that twitter user’s profile [8]. 3. reply: a tweet in response to a tweet from another person. to answer to a tweet, users often click the “reply” box or icon adjacent to it. @username is always the first character in a reply [9]. 4. retweet: the act of forwarding another user’s tweet is denoted as “retweeting.” in essence, you are sharing another user’s tweet in your profile while properly acknowledging the message’s original writer [10]. 5. mention: this term refers to tweets that contain a username. @replies are a type of mention as well [11]. 6. hashtag: the # symbol is used in tweets to denote topics or keywords. hashtags are limited to letters and numbers (no punctuation). other twitter users may use a hashtag you tweet to search for it. any twitter user may generate a hashtag at any moment [12]. 7. direct messages: these tweets, sometimes referred to as direct messages or simply “messages,” are confidential between the transmitter and recipient. when you start a tweet with “d username” to identify the recipient, the tweet becomes a direct message (dm). you should be following someone to send them a direct message [13]. 8. trends: a subject recognized by twitter’s algorithm as among the hottest subjects on the network right now [14]. 9. favorites: to add a tweet to your favorites, click the yellow icon next to the tweet. tweets you’ve favorite will stay in your list until you delete them [15]. 2.2. twitter streaming api the twitter api now includes a streaming api in addition to two separate rest apis. the streaming api provides real-time access to tweets that were sampled and filtered. the api is http-based, with data accessible through get, post, and delete requests. the streaming api allows you to access subsets of public status descriptions, such as answers and mentions from public accounts, in near-real time. protected users’ status descriptions and direct messages are no longer viewable. the streaming api may filter status descriptions based on quality criteria, which are influenced, in addition to, by frequent and repeated status updates [16]. the api requires a valid twitter account and employs simple http authentication. data may be obtained in both xml and the shorter json format. the parsing of json data got from the streaming api is straightforward: each object is delivered on a separate line, with a carriage return at the conclusion [17]. twitter streaming data allow every user to learn about what is going on in the globe at any given moment. the twitter streaming api provides access to a huge quantity of tweets in real time [18]. a python package called twitter4j is available to access the streaming api and download twitter data to analyze data from the twitter api. this data has been filtered using a list of provided keywords. this research will use apache spark, a distributed data processing system with many workers, and master nodes. this cluster can manage millions of records and is scalable. map reduction on spark might be used to filter out the massive amount of data. for each tweet in the data, a json object will be included in the input file. on the spark frame structure, this file will be uploaded. the mapper classifies all files in the directory according to the filter specified once rasul and jumaa: real-time twitter data analysis uhd journal of science and technology | july 2022 | vol 6 | issue 2 149 the spark frame structure has been duplicated and distributed across several nodes. these cleaned tweets will go through data mining techniques, allowing for a one-to-one analysis of data that will be useful for making difficult judgments [19]. 2.3. analysis process on twitter data there are several steps to be performed to analyze twitter data. fig. 1 illustrates the phases of analyzing twitter data. 2.3.1. dataset collection to gather real-time data, an application must be developed that uses the twitter api to capture the information of people who recently tweeted about the issue and construct a user-based feature set data frame [21]. 2.3.2. processing tweets this stage involves removing unnecessary material from tweets in the style of regular expressions [22]. 2.3.3. feature selection here, some of features should be considered, such as the userbased and content-based. these features have to be selected to enhance the detection and classification process [21]. 2.3.4. classification in this step, the user must be checked in real-time whether it is a bot or human, which may be accomplished by training and testing the proposed model. the proposed model can be build using one of the machine learning algorithms. after applying the machine learning algorithm on a preprocessed, existed, and labeled dataset, a model can be created. then, this model can be used to predict if the streamed twitter that we got from twitter is human or bot [22]. 3. methodology this section will provide the clarification of the searching, filtering, and stages that were employed throughout this paper’s research stage. 3.1. research sections in section i, questions like (what is big data, what is twitter, and what is the connection between twitter and big data?) has been answered. then, section ii explained twitter and its important elements; twitter api and its use; and the phases of twitter data processing. section iii gives a methodology about how this paper been organized and the methods that have been used to gathered information. section iv provides a survey methodology and has been divided into two parts survey of articles about sentiment analysis and survey of articles about bot detection and classifications. 3.2. search query this paper aims to summarize the current state of the realtime twitter data analysis topic and discuss the findings presented in recent research papers. hence, those keywords have been used. (“twitter data”) and (“real-time or “bots”) and (“sentiment analysis” or “bot classification” or “data extraction” or “preprocessing” or “text-mining” or “webmining”) and (“challenges” or “problems” or “patterns”). 3.3. selection of sources google scholar and elsevier have been used for applying the search queries and the databases that have been considered were ieeexplore digital library, springerlink journal, elsevier, and science direct. 3.3.1. selection phases each article that has been chosen to be used in this paper has been gone through these processes: the first phase of article selection is applying the search queries. then, select only the articles have been published between 2016 and 2021. after that, the title of the research and the list of index terms had been considering to see if it includes the keyword “twitter, data analysis.” the next step was reading the abstract and the conclusion of the paper, and selecting the paper according to its abstract and conclusion then the relatively of its body to them. finally, the last phase was considering the journal’s indexing and if they are peer reviewed or not. 4. survey methodology when coming to twitter data analysis, there are various types of analysis that might be done on the collected data such as sentiment analysis, tweets classification, and fake tweets detection. hence, this survey will be categorized into two sections (a) sentiment analysis and (b) tweets classification and bot detection. table 1 shows list the studies that have been surveyed in this section.fig. 1. twitter data analysis process [20]. rasul and jumaa: real-time twitter data analysis 150 uhd journal of science and technology | july 2022 | vol 6 | issue 2 table 1: list of studies that have been reviewed in section iv title (s) author (s) technique (s) result (s) year sentiment analysis “sentiment analysis and classification of indian farmers’ protest using twitter data” ashwin sanjay neogi, kirti anilkumar garg, ram krishn mishra, yogesh k dwivedi bag of words and tf-idf bag of words was more effective than tf-idf. 2022 “an optimal deep learning-based lstm for stock price prediction using twitter sentiment analysis” t. swathi, n. kasiviswanath, a. ananda rao tlbo-lstm precision: 0.95, recall 0.85, accuracy: 0.94, f1-score 0.90 2022 “twitter sentiment analysis during covid-19 outbreak” akash dutt dubey nrc emotion lexicon the majority of individuals around the globe are optimistic. 2020 “detection of fake tweets using sentiment analysis” c. monica, n. nagarathna rule-based prediction accuracy: 0.97, f1-score: 0.73, precision: 1.00, recall: 0.97 2020 “sentiment analysis of twitter data during critical events through bayesian networks classifiers” gonzalo a.ruz, pablo a. henríquez, aldo mascareño bayes factor accuracy: 0.85, precision: 0.92, recall: 0.77, f1-score: 0.82 2020 “twitter sentiment analysis based on ordinal regression” shihab elbagir saad, jing yang multinomial logistic regression (softmax), support vector regression (svr), decision trees (dts), and random forest (rf) accuracy: 0.91, f1-score: 0.85 using decision tree. 2019 classification and bot detection “the rise of social bots,” “online human-bot interactions: detection, estimation, and characterization,” “deep neural networks for bot detection,” “evolution of bot and human behavior during elections,” “measuring bot and human behavioral dynamics” emilio ferrara et al. session features accuracy: 0.97 2016 2017 2018 2019 2020 “a deep learning model for twitter spam detection” zulfikar alom. barbara carminati, elena ferrari deep learning accuracy: 0.99, recall: 0.98, f1-score: 0.93 2020 classification and bot detection “twitter bot detection using bidirectional long short-term memory neural networks and word embeddings” feng wei, uyen trang nguyen recurrent neural networks, specifically bidirectional long short-term memory (bilstm) accuracy: 0.92, precision: 1.00, recall: 0.85, f1-score: 0.92 2019 “social network polluting contents detection through deep learning techniques” fabio martinelli, francesco mercaldo, antonella santone combination of word embedding and deep learning precision: 0.79, recall: 0.73, f1-score: 0.76 2019 “deepscan: exploiting deep learning for malicious account detection in location-based social networks” qingyuan gong, yang chen, xinlei he, zhou zhuang, tianyi wang, hong huang, xin wang, xiaoming fu long short-term memory (lstm) neural network precision: 0.95, recall: 0.97, f1-score: 0.96 2018 “measuring bot and human behavioral dynamics” iacopo pozzana, emilio ferrara extra trees (et), dt, random forests (rf), adaptive boosting (ab), and knn et and rf had the greatest cross-validated average performance 0.86 2018 “deep neural networks for bot detection” sneha kudugunta, emilio ferrara deep neural network based on contextual long short-term memory (lstm) accuracy: 0.96, precision: 0.96, recall: 0.96, f1-score: 0.96 2018 “classification of twitter accounts into automated agents and human users” zafar gilani, ekaterina kochmar, jon crowcroft random forests classifier accuracy: 0.86, precision: 0.85, recall: 0.82, f1-score: 0.83 2017 (contd...) rasul and jumaa: real-time twitter data analysis uhd journal of science and technology | july 2022 | vol 6 | issue 2 151 4.1 sentiment analysis on twitter a computer finding the mood of a word, phrase, or tweet is quite challenging. to ascertain the polarity of the words and perform sentiment analysis, human participation is required. since it is used to evaluate people’s sentiments, views, and emotions, this form of analysis is sometimes referred to as “opinion mining.” it is done by evaluating each word’s attitude and classifying it as either positive, negative, or neutral. in addition, there are other ways to do sentiment analysis, including by employing a lexicon, machine learning, deep learning, or a combination of machine learning and lexiconbased approaches. in the lines that follow, recent studies on sentiment analysis on twitter will be reviewed. neogi et al. [23] acquired data from the microblogging website twitter on farmer protests to comprehend the global views shared by the public. they categorized and analyzed the attitudes based on over 20,000 tweets about the demonstration using algorithms. using bag of words and tf-idf for their investigation, they observed that bag of words performed better than tf-idf. in addition, they used naive bayes, decision trees, random forests (rfs), and support vector machines and found that rf provided the most accurate categorization. given that millions of individuals shared their thoughts about the protests, one of the study’s limitations is that they may have retrieved a rather high number of tweets. a greater quantity of tweets may have been useful in revealing a variety of emotions. using twitter data, swathi et al. [24] provide a novel teaching and learning-based optimization (tlbo) model with long short-term memory (lstm)-based sentiment analysis for stock price prediction. due to the short length and peculiar grammatical patterns of tweets, data pre-processing is required to eliminate irrelevant information and put it into a readable format. in addition, the lstm model is used to categorize tweets into positive and negative opinions about stock values. they help explore the correlation between tweets and stock market values. the adam optimizer is used to set the learning rate of the lstm model to enhance its prediction performance. in addition, the tlbo model is used to properly adjust the output unit of the lstm model. on twitter data, experiments are conducted to improve the forecasting ability of the tlbo-lstm model for stock prices. the experimental results of the tlbo-lstm model outperform the state-of-the-art approaches in a variety of respects. the tlbo-lstm model gave an excellent result, with a maximum accuracy of 95.33%, a recall of 85.28%, and an f-score of 90%. the tlbo-lstm model outperformed the competition by attaining a superior accuracy of 94.73%. dubey [25] used twitter sentiment analysis to ascertain how residents in different countries are coping with the covid-19 outbreak. the research analyzed tweets from 12 different countries. these tweets were gathered between march 11, and march 31, 2020, and are associated to covid-19 in some manner. the tweets were acquired, pre-processed, and then subjected to sentiment and text mining analysis. the study’s findings show that, although the most people worldwide are optimistic and hopeful, there are instances of fear, sadness, and disdain around the globe. the study analyzed tweets from the selected nations using the nrc emtoion lexicon. the nrc lexicon of word-emotion associations has 10,170 lexical units that examine not just positive and negative polarity, but also the eight emotions established by plutchik. on average, 50,000 tweets were used in the study from each nation every 4 days. the collection was conducted using the r package rtweet. covid-19, coronavirus, corona, stay home stay safe, and covid-19 pandemic were the keywords used to gather the tweets. while collecting the tweets, the retweets and responses were filtered out to prevent repetition. when the whole database was in hand, data cleaning was done, title (s) author (s) technique (s) result (s) year “detecting automation of twitter accounts: are you a human, bot, or cyborg?” zi chu, steven gianvecchio, haining wang, sushil jajodia bayesian classification overall system accuracy: 96.0 2012 “detecting spam bots in online social networking sites: a machine learning approach” alex hai wang decision tree (dt), neural network (nn), support vector machines (svm), naive bayesian (nb), and k-nearest neighbors, are used to detect spam bots (knn) accuracy: 0.91, precision: 0.91, recall: 0.91, f1-score: 0.91 using nb 2010 table 1: (continued) rasul and jumaa: real-time twitter data analysis 152 uhd journal of science and technology | july 2022 | vol 6 | issue 2 which included the removal of white spaces, punctuation, stop words, and the conversion of tweets to lower case. following data cleansing, the tweets were analyzed using the nrc emtoion lexicon using the get_nrc sentiment function. after scoring tweets on feelings and emotions, a corpus was built to generate a word cloud for each nation. however, a drawback of the study is that the nrc emtoion language does not include sarcasm and irony as emotions. in another study, monica and nagarathna [26] give users who have recently written about a certain topic a model that analyzes how they feel about it based on real-time data. they use this algorithm to create a sentiment score for each user based on content-based criteria to detect twitter spam. the suggested method applies a custom rule-based algorithm for bot detection and compares it to a number of different algorithms such as mlp, decision tree, and rf to establish the model’s effectiveness in detecting spam accounts. the twitter api was used to collect real-time data for this investigation. the data extraction procedure includes extracting the characteristics required for the research, preprocessing, and sentiment analysis. then, using the fake prediction algorithm, mlp, decision tree, and rf, the data are categorized to determine how many of them are authentic and legitimate users. they resulted that the rulebased fake prediction system achieved the score of accuracy of 0.97, which was superior to the existing machine learning classifiers. the study has two major limitations. first, the group of users from which data had been collected is small. second, english was the only language examined for analysis. using data from the 2010 chile earthquake and the 2017 catalan independence vote, ruz et al. [27] examined five classifiers (one of which is a variation of the tan model) and evaluated their effectiveness on two twitter datasets. they are considering bayesian network classifiers for sentiment analysis on two spanish-language datasets: the 2010 chilean earthquake and the 2017 catalan independence vote. to automatically manage the amount of edges supported by training instances in the bayesian network classifier, they employ a bayes factor technique, resulting in networks that are more realistic. given a significant number of training instances, the findings demonstrate the efficacy of the bayes factor measure and its competitive prediction performance when compared to support vector machines and rfs. in addition, the generated networks enable the identification of word-to-word relationships, so providing valuable qualitative information for understanding the key characteristics of event dynamics from a historical and social perspective. even though there are not enough training examples, the research achieves that the event dynamics may be understood using qualitative information from tan and bbf tan. furthermore, the generated networks may be applied to convey a tale about the important event that was studied. however, this study may be enhanced by applying the bayesian network classifier and grounded theory. along the same line, saad and yang [28] effort to undertake a complete twitter sentiment analysis using machine learning techniques and ordinal regression. the suggested technique comprises pre-processing tweets and then generating a relevant feature using a feature extraction method. the scoring and balancing aspects come next, and they may be categorized in a number of different ways. the suggested system uses rf, multinomial logistic regression (softmax), decision trees (dts), and support vector regression (svr) methods for sentiment analysis categorization. this system’s real implementation is dependent on a twitter dataset made available through the nltk corpus resources. according to experimental data, the proposed solution may reliably detect ordinal regression using machine learning methods. furthermore, the results suggest that decision trees outperform all other algorithms in terms of delivering the best outcomes. the proposed system consists of four key components. the first module is data acquisition, which is the method of gathering labeled tweets for sentiment analysis; the second module is preprocessing, which is the method of converting and refining tweets into a data set that might easily be used for further analysis. the third module emphases the extraction of relevant features for classification model construction. following that, the method for balancing and evaluating tweets is presented. the final module sorts tweets into high positive, moderate positive, neutral, moderate negative, and high negative categories using a quantity of machine learning classifiers. based on the study results, svr and rf have almost the same accuracy, which is superior to the multinomial logistic regression classifier. the decision tree, however, is the most accurate, with a score of 91.81%. based on the findings of the trials, the suggested model can accurately detect ordinal regression in twitter using machine learning methods. 4.2. fake account detection and classification twitter bots are software-controlled automated twitter accounts, while they are taught to perform duties similar to those carried out by regular twitter users, such as like tweets and following other users. twitter bots can be applied for a number of beneficial reasons, including broadcasting critical material such as weather crises in real time, publishing useful content in bulk, and producing automated direct message responses. however, twitter bots might be used for negative purposes such spreading fake news campaigns, spamming, compromising others’ privacy, and sock-puppetry. the rasul and jumaa: real-time twitter data analysis uhd journal of science and technology | july 2022 | vol 6 | issue 2 153 following paragraphs will be a survey of resent researches on twitter bot detection and classification. kudugunta and ferrara [29] used both conventional machine learning classifiers and deep learning techniques to identify bots on twitter, both at the account and tweet levels. they used smote with data augmentation using (1) edited nearest neighbors (enn) and (2) tomek links to address the unbalanced dataset. a collection of classifiers, including logistic regression, sgd classifier, rf classifier, adaboost classifier, and mlp, was first trained using a minimum set of features. second, they suggested a deep learning architecture, contextual lstm, to discriminate between tweets made by actual people and those generated by bots. the design of contextual lstm incorporates both tweet text and account metadata. it is a system with various inputs and outputs that produces accurate categorization results. alom et al. [30] also proposed two deep learning techniques using convolutional neural networks (cnns) for identifying spam on twitter at both the account and tweet levels. first, they developed a text-based classifier composed of an embedding and a cnn layer to determine whether or not a particular tweet belongs to a spammer. next, they suggested a combined classifier that utilizes both a text-based classifier and a neural network on users’ information for identifying spammers at the account level on twitter. for their tests, they used two twitter datasets and compared the performance of their proposed machine learning and deep learningbased techniques to that of current state-of-the-art machine learning and deep learning-based approaches. wei and nguyen [31] used a deep learning architecture consisting of an embedding layer, three bidirectional lstm layers, and a fully linked layer to produce the final output for identifying whether tweets on twitter were created by actual individuals or bots. they attained performance comparable to that of current cutting-edge bot detection systems. martinelli et al. [32] developed a simplified deep learning method for determining if a single tweet was produced by a spammer or not. in the tests, the authors developed many mlp classifiers with a range of zero to four hidden layers. as features (inputs to mlp classifiers), word embeddings were used. after loading pre-trained word embeddings, they specifically turned each word to a numerical vector and then averaged all words in sentences-tweets. gong et al. [33] proposed a more complex deep learning architecture and feature extraction approaches for detecting fraudulent users on dianping, a location-based social network. first, they retrieved information that may be categorized into five major groups: time-series, spatialtemporal, user-generated content, social, and demographic aspects. the time-series characteristics were then used as input for the deep learning model, which consisted of a bilstm layer followed by a fully connected layer with a softmax activation function. this model’s output consists of two probabilities (probability of legitimate and probability of malicious). the probabilities were then employed with the other data (traditional features) to train machine learning algorithms and get the final result. multiple machine learning techniques, including xgboost, rf, c4.5 decision tree, and svm, were taught. according to the f1-score, xgboost produced the best classification results. in their research, gilani et al. [34] classified twitter accounts into two categories: automated bots and real users. they gathered data using their own platform, stweeler. they gathered 2.5–3 million tweets every day and divided their data into four subsets: 10 million, one million, one hundred thousand, and one thousand, each representing the account’s popularity based on the amount of followers. for the tagging procedure, they employed human annotation and cohen’s kappa coefficient to ensure that the annotator judgments were reliable. in all, 3536 accounts were applied in the testing phase throughout the four bands. the authors retrieved 15 characteristics and used the rf classifier after completing a statistical computation. they did 5-fold cross-validation by teaching and testing in three different sets of experiments. the accuracy rate was 86.44%, the precision was 85.44%, the recall was 82.24%, and the f-measure was 83.4%. among the 15 traits, they discovered that six rated the highest. there are two issues with this study. first, it relies on humans. second, it did not use the content as one of the attributes while using nlp for content analyzing may enhance the accuracy level of the system. in a further recent work, pozzana and ferrara [35] examined four tweet metrics to determine how bots behaved during a single activity session: the number of mentions per tweet, the distance of the text in the tweet, the percentage of retweets, and the portion of answers. this study identified behavioral distinctions between human users and bot accounts that may be utilized to enhance bot detection algorithms. for example, humans are continually visible to tweets and messages from other users when engaged in online activities, boosting their chance of engaging in social contact. the authors employed five machine learning methods (extra trees (et), dt, rf, adaptive boosting (ab), and knn) to assess whether tweets were created by a bot or a human. the studies used a dataset rasul and jumaa: real-time twitter data analysis 154 uhd journal of science and technology | july 2022 | vol 6 | issue 2 of over 16 million tweets posted by over 2 million unique individuals. et and rf had the greatest cross-validated average performance 86% followed by dt and ab 83% and knn 81%. however, the research’s failure to categorize whether the bot is harmful or not might be seen as a flaw. to detect spam-bots, wang [36] employed three graph-based and three tweet-based features. the graph-based elements (such as the user’s number of friends, followers, and follower ratio) are retrieved from the user’s social network, whereas the tweet-based elements (such as the number of duplicate tweets, http links, and replies/mentions) are retrieved from the user’s most recent 20 tweets. the dataset applied to evaluate this approach includes 25,847 persons, around 500k tweets, and approximately, 49m followers/friends taken from publicly accessible twitter data. several classification techniques, including decision tree (dt), neural network (nn), support vector machines (svm), naive bayesian (nb), and k-nearest neighbors, are used to detect spam bots (knn). with 91% accuracy, 91% recall, and 91% f-measure, the nb classifier achieved the best outcomes. chu et al. [37] classified twitter users into three groups based on attributes retrieved from tweet content, tweeting behavior, and account proprieties: bot, human, and cyborg. the authors thought that bot character is less sophisticated than human behavior. they used an entropy rate to identify the difficulty of a process, with low rates indicating a regular process, medium rates indicating a difficult process, and high rates indicating a random process. the body of the tweet is utilized to create text patterns of recognized spam on twitter. other account-related factors, for example the percentage of external urls, the safety of links, the date of account registration, and so on, are also applied in the classification. the rf machine-learning algorithm is applied to assess these factors to determine whether a twitter account is a human, bot, or cyborg. the classifier’s effectiveness is tested using a dataset of 500,000 different twitter users. the total true positive rate for this strategy was 96.0% on average. in addition, following multiple experiments, ferrara et al. [38] generated an artificial intelligence program to spot bots on twitter depending on variations in patterns of tasks among legitimate and fake accounts. they examined two distinct data sets of twitter users who were grouped as bots or humans manually and by a pre-existing methods. the manually validated data collection included 8.4 million tweets from 3500 human accounts and 3.4 million tweets from 5000 bots. according to the study, human users reacted to other tweets 4 to 5 times more often than bots. over the course of an hour, genuine users become more engaged, with the proportion of responses growing. the length of human users’ tweets decreased as the sessions went. according to ferrara, the quantity of information conveyed is decreasing. the author privileges that the change is related to cognitive tiredness, which causes individuals to be less inclined to exert mental effort in developing new material over time. bots, in contrast, exhibit no change in their engagement or the quantity of material they tweet time to time. 5. conclusion this research covers papers on the analysis of real-time twitter data, including classification and identification of bots and real-time sentiment analysis. to do this, the literature on twitter sentiment analysis and bot identification and classification was analyzed. in addition, the research evaluates twitter’s platform characteristics, streaming api, and data analysis stages. according to the publications examined for this study’s sentiment analysis, several academics have used opinion analysis to determine the negative and positive feelings of twitter users. according to the studied articles, readers’ sarcasm and irony were never effectively evaluated. according to the publications examined in this article, the length of tweets and a decrease in the amount of information communicated, which may be evaluated by detecting the tweet’s interactivity, are patterns of behaviors that can be used to distinguish between actual and fraudulent twitter accounts that this paper offers researchers with information on the categories of twitter bots. in addition, the paper analyzes current twitter analytic techniques and latest twitter bot detecting systems. as a follow-up to this study, our feature research will use twitter sentiment analysis to enhance bot detection classification. references [1] d. m. kancherla. “a hybrid approach for detecting automated spammers in twitter”. international educational applied research journal, vol. 3, no. 9, 2707-2719, 2019. [2] j. rodríguez-ruiz, j. i. mata-sánchez, r. monroy, o. loyolagonzález and a. lópez-cuevas. “a one-class classification approach for bot detection on twitter”. computers and security, vol. 91, p. 101715, 2020. [3] o. loyola-gonzález, r. monroy, j. rodríguez, a. l. cuevas and j. i. sánchez. “contrast pattern-based classification for bot detection on twitter”. ieee access, vol. 7, pp. 45800-45817, 2019. [4] n. a. azeez, o. atiku, s. misra, a. adewumi, r. ahuja and r. damaševičius. “detection of malicious urls on twitter”. advances rasul and jumaa: real-time twitter data analysis uhd journal of science and technology | july 2022 | vol 6 | issue 2 155 in electrical and computer technologies, vol. 672, pp. 309-318, 2020. [5] j. chen. “twitter metrics: how and why you should track them”. sprout social, united states. 2021. available from: https:// sproutsocial.com/insights/twitter-metrics/. [last accessed on nov 2021 23]. [6] s. arifuzzaman and n. s. sattar. “covid-19 vaccination awareness and aftermath: public sentiment analysis on twitter data and vaccinated population prediction in the usa”. applied sciences, vol. 11, no. 14, p. 6128, 2021. [7] o. inya. “egungun be careful, na express you dey go: socialising a newcomer-celebrity and co-constructing relational connection on twitter nigeria”. journal of pragmatics, vol. 184, pp. 140-151, 2021. [8] h. piedrahita-valdés, d. piedrahita-castillo, j. bermejo-higuera, p. guillem-saiz, j. r. bermejo-higuera, j. guillem-saiz, j. a. sicilia-montalvo and f. machío-regidor. “vaccine hesitancy on social media: sentiment analysis from june 2011 to april 2019”. vaccines, vol. 9, no. 1, p. 28, 2019. [9] a. c. breu. “from tweetstorm to tweetorials: threaded tweets as a tool for medical education and knowledge dissemination”. seminars in nephrology, vol. 40, no. 3, pp. 273-278, 2020. [10] c. wukich. “connecting mayors: the content and formation of twitter information networks”. urban affairs review, vol. 58, pp. 3367, 2020. [11] n. aguilar-gallegos, l. e. romero-garcía, e. g. martínezgonzález, e. i. garcía-sánchez and j. aguilar-ávilaa. “dataset on dynamics of coronavirus on twitter”. data in brief, vol. 30, p. 105684, 2020. [12] s. boon-itt and y. skunkan. “public perception of the covid-19 pandemic on twitter: sentiment analysis and topic modeling study”. jmir public health surveill, vol. 6, no. 4, p. e21978, 2020. [13] v. cheplygina, f. hermans, c. albers, n. bielczyk and i. smeets. “ten simple rules for getting started on twitter as a scientist”. plos computational biology, vol. 16, no. 2, p. e1007513, 2020. [14] r. chandrasekaran, v. mehta, t. valkunde and e. moustakas. “topics, trends, and sentiments of tweets about the covid-19 pandemic: temporal infoveillance study”. journal of medical internet research, vol. 22, no. 10, p. e22624, 2020. [15] p. surowiec and c. miles. “the populist style and public diplomacy: kayfabe as performative agonism in trump’s twitter posts”. public relations inquiry, vol. 10, no. 1, pp. 5-30, 2021. [16] i. a. mohammed and a. s. abbas. “twitter apis for collecting data of influenza viruses, a systematic review”. 2021 international conference on communication and information technology (icict), vol. 12, pp. 256-261, 2021. [17] s. wu, m. a. rizoiu and l. xie. “variation across scales: measurement fidelity under twitter data sampling”. fourteenth international aaai conference on web and social media, vol. 14, no. 1, pp. 715-725, 2020. [18] h. ledford. “how facebook, twitter and other data troves are revolutionizing social science”. nature, vol. 582, no. 7812, pp. 328330, 2020. [19] z. pehlivan, j. thièvre and t. drugeon. “archiving social media: the case of twitter”. the past web. springer, cham. pp. 43-56, 2021. [20] i. nazeer, s. k. gupta, m. rashid and a. kumar. “use of novel ensemble machine learning approach for social media sentiment analysis”. analyzing global social media consumption. information science reference, hershey. pp. 61-28, 2020. [21] r. al bashaireh, m. zohdy and v. sabeeh. “twitter data collection and extraction: a method and a new dataset, the utd-mi”. icisdm 2020: proceedings of the 2020 the 4th international conference on information system and data mining, pp. 71-76, 2020. [22] r. p. mehta, m. a. sanghvi, d. k. shah and a. singh. “sentiment analysis of tweets using supervised learning algorithms”. first international conference on sustainable technologies for computational intelligence. vol. 1045, pp. 323-338, 2019. [23] a. s. neogi, k. a. garga, r. k. mishraa, y. k. dwivedib. “sentiment analysis and classification of indian farmers’ protest using twitter data”. international journal of information management data insights, vol. 1, no. 1, p. 100019, 2022. [24] t. swathi, n. kasiviswanath and a. a. rao. “an optimal deep learning-based lstm for stock price prediction using twitter sentiment analysis”. applied intelligence, vol. 52, pp. 1367513688, 2022. [25] a. d. dubey. “twitter sentiment analysis during covid-19 outbreak”. jaipuria institute of management, vol. 9, pp. 71-76, 2020. [26] c. monica and n. nagaraju. detection of fake tweets using sentiment analysis”. sn computer science, vol. 1, no. 2, p. 89, 2020. [27] g. a. ruz, p. a. henríquez and a. mascareño. “sentiment analysis of twitter data during critical events through bayesian networks classifiers”. future generation computer systems, vol. 106, pp. 92-104, 2020. [28] s. e. saad and j. yang. “twitter sentiment analysis based on ordinal regression”. ieee access, vol. 7, pp. 163677-163685, 2019. [29] s. kudugunta and e. ferrara. “deep neural networks for bot detection”. information sciences, vol. 467, pp. 312-322, 2018. [30] z. alom, b. carminati, e. ferrarib. “a deep learning model for twitter spam detection”. online social networks and media, vol. 18, p. 100079, 2020. [31] f. wei and u. t. nguyen. “twitter bot detection using bidirectional long short-term memory neural networks and word embeddings”. 2019 first ieee international conference on trust, privacy and security in intelligent systems and applications (tpsisa), pp. 101-109, 2019. [32] f. martinelli, f. mercaldo and a. santone. “social network polluting contents detection through deep learning techniques”. 2019 international joint conference on neural networks (ijcnn), pp. 1-10, 2019. [33] q. gong, y. chen, x. he, z. zhuang, t. wang, h. huang, x. wang and x. fu. “deepscan: exploiting deep learning for malicious account detection in location-based social networks”. ieee communications magazine, vol. 56, no. 11, pp. 21-27, 2018. [34] z. gilani, e. kochmar and j. crowcroft. “classification of twitter accounts into automated agents and human users”. association for computing machinery, vol. 17, p. 489-496, 2017. [35] i. pozzana, e. ferrara. “measuring bot and human behavioral dynamics”. human computer interaction, vol. 1, 1-11, 2018. [36] a. h. wang. “detecting spam bots in online social networking sites: a machine learning approach”. in: s. foresti and s. jajodia (eds.), data and applications security and privacy xxiv. vol. 6166, pp. 335-342, 2010. [37] z. chu, s. gianvecchio, s. jajodia h. wang. detecting automation of twitter accounts: are you a human, bot, or cyborg?”. ieee transactions on dependable and secure computing, vol. 9, pp. 811-824, 2012. [38] i. pozzana and e. ferrara. “measuring bot and human behavioral dynamics”. frontiers in physics, vol. 8, no. 125, p. 32, 2020.