Big Data in Telecom Industry: Effective Predictive Techniques on CDRs Big Data in Telecom Industry: Effective Predictive Techniques on CDRs Sara ElElimy and Samir Moustafa∗ Computational and Data Science and Engineering, Skolkovo Institute of Science and Technology Abstract Mobile network operators start to face many challenges in the digital era, especially with high demands from customers. Since the mobile network operators have considered a source of big data traditional techniques are not effective with new era big data, internet of things (IoT) and 5G, as a result handling effectively different big datasets becomes a vital task for operators with the continuous growth of data and moving from long term evolution(LTE) to 5G therefore, there is an urgent need for sufficient big data analytic to predict future demands, traffic, and network performance to fulfill the requirements of the fifth generation of mobile network technology. In this paper, we introduce data science techniques using machine learning and deep learning algorithms: the auto-regressive integrated moving average(ARIMA) Bayesian-based curve fitting, and recurrent neural network(RNN) is employed for a data-driven application to mobile network operators. The main framework included in models is an identification parameter of each model, estimation, prediction, and final data-driven application of this prediction from business and network performance applications. These models are applied to Telecom Italian Big Data challenge call detail records (CDRs) datasets. The performance of these models is found out using specific well-known evaluation criteria that show that ARIMA (machine learning-based model) is more accurate as a predictive model in such a dataset as the RNN (deep learning model). Received on 20 April 2020; accepted on 29 May 2020; published on 04 June 2020 Keywords: Big Data Analytics, Machine Learning, CDRs, 5G. Copyright © 2020 Sara ElElimy et al., licensed to EAI. This is an open access article distributed under the terms of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/), which permits unlimited use, distribution and reproduction in any medium so long as the original work is properly cited. doi:10.4108/eai.13-7-2018.164919 1. Introduction Operators of mobile networks began to move to the fifth generation from the fourth generation, which is an upcoming and promising solution for meeting the requirements of wireless broadband. Additionally, they have started looking for some innovative solutions for facing challenges and providing a satiable customer ∗Corresponding author. Email: samir.mohamed@skoltech.ru experience with the management of the complex net- work by efficient backhaul resource handing [1]. Tele- com organizations and researchers have been study- ing a diversity of techniques for big data manage- ment adequately for discovering unknown knowledge and patterns from the collected information obtained from operators and help organizations in providing smart services for achieving reduced expenditure and resources. 1 EAI Endorsed Transactions on Smart Cities Research Article EAI Endorsed Transactions on Smart Cities 06 2020 - 07 2020 | Volume 4 | Issue 11 | e1 http://creativecommons.org/licenses/by/3.0/ mailto: With the fast uptake in mobile applications and services, requesting demands for infrastructures in wireless network. For 5G requirements and KPIs are to support exploding in mobile traffic, provide low latency so this raised need for real-time decision and network resources management and optimization to maximize and increase customer satisfaction and enhance user experience. Using traditions methods to achieve these requirements and overcome different problems become a challenge to telecoms. Tradition techniques start to be useless in this area so industry and academia start to search and create more effective new techniques to deal with this tremendously increase of data and raise the question of how the telecoms deal with: 1. Enormous data sizes (Various systems generated a huge amount of log data and reached Giga-Tera byte). 2. Different sources (generated from different sources e.g., routers, switches, applications, operating systems, etc.). 3. Heterogeneity (Different format, structures, terms of terminology, etc.). These questions and challenges are the main problems statement for this work, and how telecoms benefit from applying ML/DL on different datasets, and what kind of application can be achieved using these techniques that are exiting and traditional ones. In this paper, we are investigating the analysis and application-driven by big data in the telecommunica- tion industry concerning operators of mobile networks for the fifth generation and current networks in their operational and business aspects, implementing differ- ent ML/DL techniques driven by big data on data gath- ered from a telecommunication network and applying different models of prediction for predicting traffic. Moreover, in the end, how different results and appli- cations are brought by big data analytic in comparison with traditional methods. Also, it will be discussed how they are beneficial for business and operational activities, companies, and how this can be utilized and in which types of applications. 2. Analytic Tools and Data Sources for Telecoms 2.1. Telecom Data Sources Operators of mobile networks form a source and carrier of big data because of the penetration of mobile users have increased significantly [2], and organizations utilized traditional techniques before transactions from the analytic of big data. These techniques pay less attention to operational data, and they do not concentrate significantly on transnational data. The analytic of big data is essential in several ways in comparison with traditional methods. For instance, the compressor transmits data, and useful data are defined by the analytic of big data [3]. In large part of an application, decision-making in real-time is a benefit of using analytic of big data by monitoring the development and infrastructure of network performance. Several smart services will be supported and provided by MNOs with the analysis of sources and types of data [4]. Classifies sources of data for Telecoms as operator and subscriber data, external and internal data sources [3], core network levels, cell, subscriber, and KPI deep classification for different networks [5]. When it comes to analytic tools, some of the main tools, as defined by the previous studies, include methods of machine learning modeling, data mining, and statistical modeling [6]. Actually, with current development and improvement in data analytic, networks based on big data have formed an attractive area of research for numerous researchers around the globe [7], [8]. Additionally, in the industrial sector, researchers recently developed and studied frameworks for big data management in an efficient manner in mobile networks. 2.2. Contribution of CDRs or Call Details Records In mobile operators, CDRs were considered essential in for finical aspects. However, in the period of big data, applications driven by it are obtaining attention by researchers in industrial and scientific aspects because datasets of CDRs are full of information associated with communication among numerous users along with how, when, and with whom they are communicating. The analysis of CDRs datasets has become quite a significant and exciting research area [9] because 2 Sara ElElimy and Samir Moustafa EAI Endorsed Transactions on Smart Cities 06 2020 - 07 2020 | Volume 4 | Issue 11 | e1 numerous uses associated with these datasets provided by it for different purposes of research resulting in the improvement of dataset management techniques, development of analytic techniques, and analysis types from several perspectives with the use of big- data methods. When it comes to telecom operators, Orange is recognized as one of the biggest, and the first challenge, "D4D Challenge" was launched in 2013. They invited different candidates through this challenge from around the globe.addition to it, and access was provided to massive datasets of CDRs for developing objectives of their customer satisfaction and infrastructures as a source of gaining more revenues. Successful outcomes have resulted in scientific work, which encouraged the organization to launch a second challenge during the mobile conference of NET in April 2015 [9]. In Europe, Telecom Italian is also a recognized mobile operator that faces the same challenges of big data, and2014, Big Data Challenge’s first edition was launched by its [10]. 3. Techniques and Methodology In the analysis of these datasets, different techniques and methods are utilized. Some of the techniques utilized in this work include data visualization, prediction, and clustering. We followed the framework for obtaining the optimum outcomes from datasets. Pre-processing is the first step, and it is considered an essential step while using massive data, and in understanding the hidden patterns existing in the data. The next step is concerned with defining analysis type and necessary tools for it, the application type is driven by it, and which type of information might be needed for it. Finally, based on the results, the best applications are determined for this analysis. 3.1. Data Set Millions of records are included in a dataset between December and November 2013. In 2014, these datasets were a component of the Big Data Challenge of Telecom Italian. It was quite ironic and included different types of telecommunications, including electricity data, weather forecasting, news, and social networking. Telecom Italian has formed an original dataset with the connotation of some specific labs. The institutes included in them are: • Fondazione Bruno Kessler. • EIT ICT Labs. • Trento and Trento RISE Institute. • Milan Polytechnic University. • MIT Media Labs. Before the first dataset is released, the attention of partakers is considered. The demand is nevertheless being increased at the competition’s end for datasets, which has become an initiative or measure towards "Open Big Data." Datasets, following [10], were freely published for improving the dataset used in the society. Telecom Italian generated a dataset that is a consequence of evaluation or calculation upon the call detail records for subscribers of Milano City. CDRs record user activities for billing and network management, but our research focuses on the use of dataset for different applications rather than utilizing it for traditional activity. Information included in dataset described in [10], it consists of main eight variables: • Square ID: the Square ID, which is the portion of Milan GRID. • Time Interval: The start of the time interval can be stated as the number of milliseconds passed till 1st January 1970 from the Unix Epoch at UTC. In addition, of 10 minutes (600000 milliseconds) to this value, the time interval can be achieved. • Country Code: It is the local code of a country for phones. • SMS-in Activity: The SMS activity is receiving the inside square ID throughout the time interval • SMS-out Activity: The SMS activity is sending the inside square ID throughout the time interval. • Call-in Activity: The Calls activity is receiving the inside square ID throughout the time interval. • Calls-out Activity: The SMS activity is issuing the inside square ID throughout the time interval. 3 Big Data in Telecom Industry: Effective Predictive Techniques on CDRs EAI Endorsed Transactions on Smart Cities 06 2020 - 07 2020 | Volume 4 | Issue 11 | e1 • Internet Traffic Activity: The Internet Traffic activity is issuing the inside square ID throughout the time interval and by the state of the user all these activities are recognized from the country code. We have a few types of Call Detail Records for generating the datasets which are related to these activities: Before the first data-set is released, the attention of partakers is considered. The demand is nevertheless being increased at the competition’s end for data-sets, which has become an initiative or measure towards "Open Big Data." Datasets, following [10], were freely published for improving the dataset used in the society. Information included in dataset described in [10], it consists of main eight variables: • Received SMS: Every time when a user receives an SMS. • Sent SMS: Every time when a user sent an SMS. • Incoming Call: Every time when a user receives a call. • Outgoing Call: Every time when a user issued a call. • Internet: Every time when a user starts or end an internet connection. Throughout the similar internet connection one of the below restrictions is reached : • 15 Minutes after producing the final CDR • 5 MB after producing the final CDR This Data-set was formed by accumulating the above stated records, to deliver Internet Traffic, SMSs and Calls activities. The level of collaboration between users and mobile network is calculated through this. For instance, more SMS sending by a user results in more activity of the SMSs sent by the user. The SMSs and Calls activities are having the similar scale of sizes “Therefor they are analogous to each other”. According to (Data Telecom, 2014), Data-sets are combined in four-sided cells gird, as shown in Figure 1. Figure 1. “The area of a Milan is composed of a grid overlay, which is 1,000 squares having the size of 235*235 meters.” The grid is probable with WGS84 (EPSG:4326) standard 3.2. Methods and Models In these sections, the adopted methods are explained: • Data visualization: using the right type of visu- alization brings insight into the data analysis pro- cess. Explanatory Data Analysis(EDA) executed in a proper order to study and expound the dataset. The aim of conducted data analysis, to discover the restriction of data, data patterns, and which unavailable or missing variables. • Clustering: Clustering procedures, in the data mining field, constitute some important meth- ods [11] due to their significant-high abilities for deducing connections among different data objects. Scientists have primarily utilized them for investigating datasets for the tracing of mobile. On different networks acquired from mobile networks, K-means is implemented the most, and in other works, including [12] and [13], it provides satisfactory results. The techniques of clustering are accepted, either a separated approach or hierarchical approaches. Hierarchical techniques arrange items into a 4 Sara ElElimy and Samir Moustafa EAI Endorsed Transactions on Smart Cities 06 2020 - 07 2020 | Volume 4 | Issue 11 | e1 Figure 2. Diagrams to show the Explanatory Data Analysis(EDA) relations. hierarchical structure, which can visually be represented diagrammatically. Hierarchical algorithms can follow an organized method or separated one. However, partitioned clustering algorithms e.g., ISODATA and K- means, directly group objects into numbers of categories K.A relevant comment is that hierarchical algorithms can also be used in categorizing objects into a definite number of categories, which can be finished by ending the algorithm at the required point/level. In all instances, there is no stipulated rule to determine the definite number of categories, the decision still remains either ascertained definitely relying on the accordance to certain clustering quality measures or knowledge about the data. inner- cluster distances. • Standardization: Standardizing a vector most often means subtracting a measure of location and dividing by a measure of scale. For example, if the vector contains random values with a Gaussian distribution, you might subtract the mean and divide by the standard deviation, thereby obtaining a “standard normal” random variable with mean 0 and standard deviation 1, So standardizing the internet traffic before modeling will help in prediction. Table 1. Show the ARIMA model parameters. White noise ARIMA(0,0,0) Random walk ARIMA(0,1,0) with no constant Random walk with drift ARIMA(0,1,0) with constant Auto-regression ARIMA(p,0,0) Moving average ARIMA(0,0,q) • Prediction: For mobile operators, it is considered necessary in making decisions associated with network optimization, and as a part of ML. ARIMA model is one of the most renowned algorithms of prediction, as explained in [14]. It is significant for time series data in both static and practical manner. yt = C + p∑ i=1 ϕiyt−i + �i (1) The following are special models from ARIMA: yt−i and �i are respectively the actual value and the random error at the time t, ϕi(i = 1, 2, 3, . . . ,p) are the model parameter and is a constant, the integer is known as the order of the model [15]. RNN model is another adopted model, model with many layers on the basis of short and long-term memory is referred to as LSTM. A common LSTM unit is composed of a cell, an input gate, an output gate, and a forget gate. The cell remembers values over arbitrary time intervals, and the three gates regulate the flow of information into and out of the cell [16]. It consists 5 Big Data in Telecom Industry: Effective Predictive Techniques on CDRs EAI Endorsed Transactions on Smart Cities 06 2020 - 07 2020 | Volume 4 | Issue 11 | e1 outputs to next layer ymc (τ + 1) youtm(τ + 1) output gating h(x) smc (τ) smc (τ + 1) CEC 1.0 memorising yinm(τ + 1) input gating g(x) yv(τ) yi(τ + 1) Figure 3. A standard Term Short Memory (LSTM) memory block, and the cell output is calculated by multiplying the cell state by the activation of the output gate. of memory blocks, and it can be trained with the use of black propagation. In this model, the issue of the gradient is gradually decreased [17]. ft = σ(Xt ∗Uf + Ht−1 ∗Wf ) (2) Ct = tanh(Xt ∗Uc + Ht−1 ∗Wc) (3) It = σ(Xt ∗Ui + Ht−1 ∗Wi) (4) Ot = σ(Xt ∗Uo + Ht−1 ∗Wo) (5) Figure 4. Daily activity. Ct = ft ∗Ct−1 + It ∗Ct (6) Ht = Ot ∗ tanh(Ct) (7) Xt = Input Vector , Ht−1 = Previous Cell output Ct−1 = Previous Cell Memory, Ht= Current Cell output , Ct = Current Cell Memory. W,U = weight vectors for forget gate (ft), candidate (C),i/p gate (I) and o/p(O) [18]. Both ARIMA and RNN are performed in a better manner in comparison with others for time series prediction [19]. 3.3. Analysis of Data and Prediction Process Generally, the base of our analysis is the data-intensive approach, and different techniques of machine learning are applied on datasets of CDRs because it contributes to the value of both business and scientific aspects. Three analyses have been performed in our work: First analysis : The highest daily activity is identified in this analysis during a specific day. In addition to it, peak hours within a day are also identified. The first analysis’s results were derived concerning total and time activity, while peak hours are 11, 10, and 9 AM, while 3 AM is not a peak activity hour. In business aspects and network development, this result is quite beneficial because it will aid in the identification of which areas needs to be developed or requires more resources. It will also help in determined which country code or square grid develops more traffic due to which companies gain more revenues by targeting customers based on their geo-location. Additionally, with resource management, it decreases its costs and expenses. 6 Sara ElElimy and Samir Moustafa EAI Endorsed Transactions on Smart Cities 06 2020 - 07 2020 | Volume 4 | Issue 11 | e1 Figure 5. Analysis of Residuals Second analysis: This analysis compares and illus- trates the weekly usage of the internet in November for three ID cells portraying different areas for categories in the city of Milan. It also included nightlife area, university area, and downtown area. It was indicated by the results that the downtown area’s peak is earlier than that of nightlife, phone calls are less in universities area on the weekends, and a decrease was experienced in the volume of calls. In optimization and resource allocation, these observations will help by defining which area is fully loaded and at what time, and it can help in defining temporary solutions for different peak hours, such as the deployment of Pico cell. Certain tests were carried out on the dataset to identify and select the proper and effective models for time series data. It is essential to discover trends, seasonality, and stationary of data. Residuals analysis provides an indication if data is statistically stationary if the data is truly random noise, it can be classified as statistically stationary from Figure 5. Another testing method is the Dickey-Fuller station- ary test, which is a quantitative test for residuals anal- ysis; its Null hypotheses represent that residual is not statically stationary. Findings and results showed that the test statics is about -7, confirmed that residuals are statistically stationary. Third analysis: In this analysis, three methods are implemented for prediction and modeling based on Table 2. Statistical Tests to show Dickey-Fuller stationary test. Result of Dickey-Fuller Test: Test Statistic -7.405407e+00 p-value 7.367220e-11 # Lags Used 1.000000e+00 Number of Observation Used 1.660000e+02 Critical Values(1%) -3.470370e+00 Critical Values(5%) -2.879114e+00 Critical Values(10%) -2.576139e+00 Figure 6. ARIMA Hourly Prediction of Internet Traffic for Cell ID 4456 internet usage. ARIMA model is the first one, LSTM is the second model, and the last model is developed on the model which was utilized in the Kaggle Competition. This model was validated on different types of data weekly for determining if modeling for a week is efficient enough for having similar results and whether it can be implemented on datasets that are collected at different time intervals. • ARIMA For the datasets of one week, the applied model is ARIMA (2, 1, 0). Three ID cells will be focused upon first for the central regions, and the obtained results are portrayed in the Figures 6 and 7. Moving on, 9998 cells were the target, as illustrated in Figure 8. • LSTM One input is included in this model for four blocks and a visible layer in the hidden layer. Meanwhile, in the output layer, there is a single input. Internet traffic prediction is shown in Figure 9 for 4456 cell ID every week. 7 Big Data in Telecom Industry: Effective Predictive Techniques on CDRs EAI Endorsed Transactions on Smart Cities 06 2020 - 07 2020 | Volume 4 | Issue 11 | e1 Figure 7. ARIMA Hourly Prediction of Internet Traffic for Cell ID 5060. Figure 8. For all cells, Internet Traffic Hourly Prediction using ARIMA Figure 9. For 4456 cell ID, Internet Traffic Hourly Prediction using ARIMA • Third Prediction Model In the Kaggle competition, this model was utilized where it was implemented on several periods in contrast without information. Generally, it is Figure 10. Downtown Area Results of Internet Traffic Figure 11. Nightlife and Downtown Areas and Internet Traffic Data Figure 12. Universities Area and Internet Traffic Data based on many datasets which are periodically set every twenty-four hours. Meanwhile, SIN behavior is exhibited by internet traffic, as portrayed in Figure 10. Moving on, this model is implemented in three areas, which are categorized from our analysis. Prediction results for nightlife and downtown are represented in Figure 11 for the area of universities in Figure 12. Three models were applied for the prediction internet traffic based on hourly and weekly data Results explained that the prediction model of ARIMA is precise for the selected cells and with a 3 percent 8 Sara ElElimy and Samir Moustafa EAI Endorsed Transactions on Smart Cities 06 2020 - 07 2020 | Volume 4 | Issue 11 | e1 test set and 70 percent data set. It recognized that 21 percent of test sets and 69 percent training sets were not sufficient enough in cell/data ID. The obtained results, for the third model, it was indicated by the obtained results that this model is accurate and suitable for all the selected datasets with the university area being an exception. This area still has some issues, and it might be associated with the mobility of community patterns. The same conclusion as previous works was obtained for different dataset periods. Thus, it was determined that this model was suitable for all datasets. Results have indicated that the application of predic- tive models and intelligent data analysis for the predic- tion of traffic are considered significant, and they play a vital role for mobile operators, which will be quite useful in the routing of traffic. It can indicate yearly prediction as well for supporting network optimiza- tion, resource allocations, self-organizing networks, and investment planning. 4. DISCUSSION For MNOs, this research is dedicated to big data management and applying ML/DL techniques in an efficient manner in the sector of data-driven apps and the telecommunication sector. Comprehending the available data, which analytic tools are eligible and must be implemented, and which type of information or data should be collected are significant for any provider of service for harvesting the best results from the data. Big data is selected and applied in this work, and t is vital to recognize that techniques of machine earning and deep learning contribute significantly to both the industrial and academic sector and playing a significant role in wireless network application like network traffic prediction using different clustering techniques, it is possible to cluster mobile users based on CDR records and generate location-based recommendation system. CDRs mining using these techniques then existing one expands its role and applications not only for finical usage, but also by extracting huge and important knowledge from this dataset introduces different application for telecoms: 1. Analyzing CDRs data can be provided demo- graphic about genders and age where we can use RNN or CNN to predict these features of mobile users. 2. RNNs are employed to determine the metro density from massive CDRs data, they propose to identify the trajectory of the customer as a sequence of locations as input to RNN- model to handle this sequential data. 3. From code number information in CDRs, it is possible to predict tourist‘s locations and make business packages. It has been proven by this practical work how benefits in the business and operational aspect of the telecommunication industry can be obtained with the effective application of techniques of Big Data instead of traditional techniques. Models like LSTM and ARIMA was applied for the prediction of traffic, and it was explained that results were quite beneficial in strategic and short plans for the operator. For the performance of our practical part, CDR database selection was based on the significance of the dataset for the MNO since it is indicated by our results that CDRs analysis has much significance beyond and currently in different areas like investment plans on the basis of optimization network, fault detection traffic prediction, network optimization, and resource allocation. For future work, we will apply ML/DL techniques on different unlabeled datasets since mos-generated data in wireless network systems have these challenge able features, which required specific techniques. Acknowledgement This research is developed on the basis of a master thesis “Methods to Efficiently Handle Big Data in 5G Networks” 20181, double degree Erasmus + program between Higher School of Economics and UAS Technikum Wien. I would like to express my sincere gratitude to my academic supervisors and the professors and lecturers at the Big Data Systems program. Improvement is made during study at Skoltech. 1https://www.hse.ru/en/edu/vkr/219430036 9 Big Data in Telecom Industry: Effective Predictive Techniques on CDRs EAI Endorsed Transactions on Smart Cities 06 2020 - 07 2020 | Volume 4 | Issue 11 | e1 References [1] Zeng, D., Gu, L. and Guo, S. (2015) Cost minimization for big data processing in geo-distributed data centers. In Cloud networking for big data (Springer), 59–78. [2] Bi, S., Zhang, R., Ding, Z. and Cui, S. (2015) Wireless communications in the era of big data. IEEE communications magazine 53(10): 190–199. [3] He, Y., Yu, F.R., Zhao, N., Yin, H., Yao, H. and Qiu, R.C. (2016) Big data analytics in mobile cellular networks. IEEE access 4: 1985–1996. [4] Zheng, K., Yang, Z., Zhang, K., Chatzimisios, P., Yang, K. and Xiang, W. (2016) Big data-driven optimization for mobile networks toward 5g. IEEE network 30(1): 44–51. [5] Imran, A., Zoha, A. and Abu-Dayya, A. (2014) Challenges in 5g: how to empower son with big data for enabling 5g. IEEE network 28(6): 27–33. [6] Boccardi, F., Heath, R.W., Lozano, A., Marzetta, T.L. and Popovski, P. (2014) Five disruptive technology directions for 5g. IEEE Communications Magazine 52(2): 74–80. [7] Ramaprasath, A., Srinivasan, A. and Lung, C.H. (2015) Performance optimization of big data in mobile networks. In 2015 IEEE 28th Canadian Conference on Electrical and Computer Engineering (CCECE) (IEEE): 1364–1368. [8] Samulevicius, S., Pedersen, T.B. and Sorensen, T.B. (2015) Most: Mobile broadband network optimization using planned spatio-temporal events. In 2015 IEEE 81st Vehicular Technology Conference (VTC Spring) (IEEE): 1– 5. [9] Blondel, V.D., Decuyper, A. and Krings, G. (2015) A survey of results on mobile phone datasets analysis. EPJ data science 4(1): 10. [10] Italia, T. (2015), Telecom italia big data challenge. URL https://dandelion.eu/datamine/open-big-data/. [11] Xu, R. and Wunsch, D. (2005) Survey of clustering algorithms. IEEE Transactions on neural networks 16(3): 645–678. [12] Soto, V. and Frías-Martínez, E. (2011) Automated land use identification using cell-phone records. In Proceedings of the 3rd ACM international workshop on MobiArch: 17–22. [13] Liu, J., Chang, N., Zhang, S. and Lei, Z. (2015) Recognizing and characterizing dynamics of cellular devices in cellular data network through massive data analysis. International Journal of Communication Systems 28(12): 1884–1897. [14] Zhang, G.P. (2003) Time series forecasting using a hybrid arima and neural network model. Neurocomput- ing 50: 159–175. [15] Adhikari, R. and Agrawal, R.K. (2013) An introductory study on time series modeling and forecasting. arXiv preprint arXiv:1302.6613 . [16] Hochreiter, S. and Schmidhuber, J. (1997) Long short- term memory. Neural computation 9(8): 1735–1780. [17] Sundermeyer, M., Schlüter, R. and Ney, H. (2012) Lstm neural networks for language modeling. In Thirteenth annual conference of the international speech communication association. [18] Staudemeyer, R.C. and Morris, E.R. (2019) Understand- ing lstm–a tutorial into long short-term memory recur- rent neural networks. arXiv preprint arXiv:1909.09586 . [19] Ho, S.L., Xie, M. and Goh, T.N. (2002) A comparative study of neural network and box-jenkins arima mod- eling in time series prediction. Computers & Industrial Engineering 42(2-4): 371–375. 10 Sara ElElimy and Samir Moustafa EAI Endorsed Transactions on Smart Cities 06 2020 - 07 2020 | Volume 4 | Issue 11 | e1 https://dandelion.eu/datamine/open-big-data/ 1 Introduction 2 Analytic Tools and Data Sources for Telecoms 2.1 Telecom Data Sources 2.2 Contribution of CDRs or Call Details Records 3 Techniques and Methodology 3.1 Data Set 3.2 Methods and Models 3.3 Analysis of Data and Prediction Process 4 DISCUSSION