Open Access proceedings Journal of Physics: Conference series Civil and Environmental Science Journal Vol. I, No. 01, pp. 027-033, 2018 27 Data generation in order to replace lost flow data using Bootstrap method and regression analysis Gatot Eko Susilo1 1Civil Engineering Dept., Universitas Lampung, Bandar Lampung, 35145, Indonesia gatot89@yahoo.ca Received 28-02-2018; revised 23-03-2018; accepted 06-04-2018 Abstract. This paper aims to find method to generate data in order to replace lost flow data in the series of discharge data in Sungai Seputih River, Lampung Province. Bootstrap simulation is used to estimate the discharge data and complete the existing discharge data. Regression analysis is also used to find the pattern of data distribution. Results of the research show that both methods are able to generate new series of flow data that the distribution is similar to available field data. Results also show that the use of statistical methods is one way to tackle the problem of data limitations due to missing or unrecorded data. The weakness of data generation using a combination of Bootstrap methods and regression analysis is the disappearance of extreme values in the data series. Existing extreme values have been modified to ideal values that satisfy certain distributions. However, careful analysis is required in using statistical method, so that the results of analysis do not deviate from the field conditions. Keywords: data generation, flow data, Bootstrap method, regression analysis. 1. Introduction Hydrology is the study of the earth's water sundry which includes the process of its occurrence, its movement, its distribution, and its relation to the environment and the living creatures. Understanding of the science of hydrology is very useful in understanding the concept of water balance on a global scale on the surface of the earth. Hydrological events such as rain and flow are recorded in the information referred to as hydrological data. Almost all water resources development activities require hydrological information for basic planning and design. If the hydrological information used is not suitable and does not meet the requirements, it may result in incorrect and inaccurate planning and design. Interpretation of the hydrological phenomenon will be carried out properly if supported by sufficient data availability. Sufficient data collection tools and consistent data collection activities are essential for generating good hydrological data. The most important hydrological data is the flow data of a river. Basically, all water resource planning requires flow data in the calculation. But since the flow recorder stations in rivers in Indonesia are not always available then the amount of discharge can be calculated by varying the rain to discharge. Rainfall data is more available in watersheds in Indonesia. The rainfall recorder station is easier to find than the flow recorder station. This is because rain stations are not only installed by the department of public work but are also installed by department of agriculture and department of transportation with various objectives. Civil and Environmental Science Journal Vol. I, No. 01, pp. 027-033, 2018 28 Various methods have been created by people to diversify rain into debit. But the accuracy of each method is still being debated. The calibration and verification process of a hydrological model is an absolute process to determine the validity of a model or method. The problem is data for calibration and verification is rarely available in Indonesia. Automation of flow recorders and rain gauges in Indonesia has not been equally distributed in Indonesia. Consequently, most of the debit or rainfall data in Indonesia are not valid enough data to be used in water resource planning. As an effort to validate the data, the planners perform validation analysis with various statistical methods. The problem of scarcity of hydrological data has been overwhelmingly faced by water resource planners. Some of them attempt to generate data to add or supplement lost data. One of the known data generation methods is Monte Carlo Simulation. Monte Carlo simulation is a simulation to determine a random number of sample data with a particular distribution. The goal of Monte Carlo simulation is to find a value close to the real value, or the value that will occur based on the distribution of the sampling data [1]. The Monte Carlo simulation involves the use of random numbers to model the system, where time does not play a substantive. Monte Carlo simulation is undertaken by artificial data generation using pseudo random numbers generator. Basically, a Monte Carlo simulation is performed based on a particular sampling distribution. The key is to identify the distribution of existing sample data. Randomly, simulations of numbers are performed so that a combination of near- fit distribution is most fit. Monte Carlo simulations have been used to generate rainfall data in stochastic hydrological modelling studies in Czech Republic [2]. This simulation has also been used as a method to quantify drainage discharge components in a stochastic drainage discharge model in South Africa [3]. In addition to the Monte Carlo Simulation method, people often use the Bootstrap method. The bootstrap method is a method used to estimate the parameters of a population suspected of the statistical value obtained from the population sample. This method is often used because it does not base on certain distribution assumptions. Bootstrap is a method that can work without the need for distribution assumptions because the original sample is used as a population [4]. Bootstrap was first introduced by Efron in 1979. Bootstrap is a method based on data simulation for statistical inference purposes [5]. The Bootstrap method is performed by random sampling with a re-sampling with replacement. Some sources state that the bootstrap sample size (d) used is less than or greater than the sample data (n). However, the most optimum and effective way to guess parameters is the size of the bootstrap instance equal to the size of the sample data. The bootstrap method is a method based on re- sampling the sample data with the condition of return on the data in completing the statistics of the size of a sample in the hope that the sample represents the actual population data. Usually re-sampling size is taken thousands of times to represent the population data. This method is great for relatively small sample data sizes [6]. Bootstrap method has been used for quantifying uncertainty on sediment loads in Germany [7]. Previously, the method was also used in estimating the uncertainties related to the sample size in research of estimation of future discharge of the Rhine River [8]. The newest one, in China Bootstrap method was used to analyze the Influence of Rainfall spatial uncertainty on hydrological simulations [9]. This paper aims to generate data in order to replace lost flow data in the series of discharge data in Sungai Seputih River, Lampung Province. The corresponding flow data will be used to calculate the availability of water in the Way Seputih River for irrigation purposes in the Seputih Irrigation Area. Due to the lack of data, the Bootstrap simulation will be used to estimate the discharge data and complete the existing discharge data. Regression analysis is also used to find the pattern of data distribution. 2. Material and Methods In this research, the Bootstrap method will be used to supplement lost discharge data in order to calculate the dependable flow to be used in calculating the allocation of irrigation water in the Pengubuan River. The river is located in the Central Lampung regions, Indonesia. The River is currently supplied water for irrigation areas which is Way Pengubuan Irrigation Area. The irrigation Civil and Environmental Science Journal Vol. I, No. 01, pp. 027-033, 2018 29 area captured 5,000 ha and 3,500 ha as potential and functional paddy field, respectively. The calculation of irrigation water allocation is an activity to calculate the balance between water availability and water requirement in the irrigation area. In order to calculate irrigation water availability, a dependable flow is calculated with 80% reliability. For the calculation, the daily average discharge data in monthly period for 10 years have to be available. In fact, the daily average discharge data at Way Pengubuan Dam is only available for year 2011 until 2017. The available data is also not a complete data because there are some missing or undocumented data. The available discharge data view can be seen in Table 1. Table 1. Existing flow data (in m3/s) of Way Pengubuan River at Way Pengubuan Dam. Year/Month 2011 2012 2013 2014 2015 2016 2017 Jan - 4.85 4.03 5.23 3.34 3.21 4.58 Feb - 5.04 5.06 4.36 5.11 4.35 3.77 Mar - 3.86 5.05 3.16 4.27 5.69 5.40 Apr - 4.39 4.99 4.41 3.67 5.87 4.76 May - 3.17 5.41 4.96 3.92 5.37 - Jun - - - - - - - Jul - 1.82 3.00 2.45 0.83 2.11 - Aug - - - - - - - Sep - - - - - - - Oct - - - - - - - Nov - - - - - - - Dec 3.60 3.46 3.78 4.15 1.13 4.69 - Source: BBWS Mesuji Sekampung (2018) To calculate the dependable flow required data flow with a data length of at least 10 years. To complete the missing data then the Bootstrap method is taken to generate the data. The data generation procedure with Bootstrap method is implemented as follows: • Calculating the maximum and minimum values of monthly data for every year of data. This procedure will result in the maximum and minimum value of January, February, March, April, May, July, and December data for year 2011 to 2017. • Generating data using Bootstrap method and complete flow data for the period of January, February, March, April, May, July, and December data for year 2011, 2017, 2018, 2019, and 2020. • Generating data using Bootstrap method and complete flow data for the period of June for year 2011 to 2020. Value of June is random value between May and July values. • To complete flow data of September, October, and November for year 2011 to 2020 then regression analysis is undertaken. Regression equation is formed for all data year and after that the shape of regression curves are modified to find ideal equation for the data of each year. Final equation of the regression then is used to generate new serial data. Once the new serial of data is found, dependable flow with 80% reliability is calculated using Weibull probability equation [10]. The probability of each daily average monthly data is calculated using following formula: 𝑃 = 𝑚 𝑛+1 (1) where, P is the probability of data that explain the reliability percentage of data, m is the number of data after sorted from maximum to minimum value, and n is number of data. Civil and Environmental Science Journal Vol. I, No. 01, pp. 027-033, 2018 30 3. Result and Discussion Result of procedure 1 is given in Table 2. The result shows the maximum and minimum value of January, February, March, April, May, July, and December data for year 2011 to 2017. Table 2. Maximum and minimum flow data (in m3/s) of Way Pengubuan River Month Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Max 5.23 5.11 5.69 5.87 5.41 - 3.00 - - - - 4.69 Min 3.21 3.77 3.16 3.67 3.17 - 0.83 - - - - 1.13 Source: Calculation Bootstrap method is used to generate data and complete flow data for the period of January, February, March, April, May, July, and December data for year 2011, 2017, 2018, 2019, and 2020. The value of particular month is actually random data between maximum and minimum value of corresponding month. Another generating data process using Bootstrap method is undertaken and complete flow data for the period of June for year 2011 to 2020. Value of June is random value between May and July values. The results are given as follows: Table 3. Result of data generation using Bootstrap method Year/Month 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 Jan 4.16 4.85 4.03 5.23 3.34 3.21 4.58 4.70 3.65 4.00 Feb 4.75 5.04 5.06 4.36 5.11 4.35 3.77 5.29 4.26 3.66 Mar 5.42 3.86 5.05 3.16 4.27 5.69 5.40 5.17 4.33 3.61 Apr 5.34 4.39 4.99 4.41 3.67 5.87 4.76 4.51 3.99 3.82 May 4.31 3.17 5.41 4.96 3.92 5.37 5.05 3.50 3.39 4.10 Jun 3.50 2.97 3.95 4.66 1.08 4.34 3.27 2.34 2.66 4.26 Jul 5.00 3.64 6.00 4.89 1.66 4.22 2.50 1.20 1.94 4.18 Aug - - - - 0.89 2.81 - 0.27 1.37 3.90 Sep - - - - - - - 0.00 1.08 3.69 Oct - - - - - - - 0.00 1.22 3.80 Nov - - - - - 2.76 - 0.62 1.92 4.17 Dec 3.60 3.46 3.78 4.15 1.13 4.69 3.60 2.40 3.32 4.13 Source: Calculation Regression analysis is undertaken to form ideal shape of data distribution. The example of regression curve formed by regression analysis is given for year 2011 in Figure 1. The curve of actual data is modified into new curve formed by regression analysis. Using the equation of the new regression analysis, a serial of new data is generated. This serial data is finally used as serial data for dependable flow calculation. Figure 1. Curve of actual data (black line and black dot) and regression curve formed by regression analysis (dashed line) for year 2011 data y = 0,03x3 - 0,567x2 + 2,559x + 1,985 R² = 1 0,0 1,0 2,0 3,0 4,0 5,0 6,0 1 2 3 4 5 6 7 8 9 10 11 12 F lo w ( m 3 /s ) Month Civil and Environmental Science Journal Vol. I, No. 01, pp. 027-033, 2018 31 Using same procedures regression curve formed by regression for each year is given as follows: Table 4. Regression equations formed by modified data Year Regression equation formed 2011 y = 0.03x3 - 0.567x2 + 2.559x + 1.985 2012 y = 0.015x3 - 0.239x2 + 0.579x + 4.483 2013 y = 0.025x3 - 0.484x2 + 2.317x + 2.097 2014 y = 0.011x3 - 0.181x2 + 0.4x + 4.784 2015 y = 0.035x3 - 0.646x2 + 2.666x + 1.494 2016 y = 0.046x3 - 0.893x2 + 4.465x - 0.786 2017 y = 0.025x3 - 0.464x2 + 2.053x + 2.42 2018 y = 0.031x3 - 0.545x2 + 2.009x + 3.206 2019 y = 0.023x3 - 0.409x2 + 1.676x + 2.36 2020 y = 0.022x3 - 0.411x2 + 1.66x + 2.343 Source: Calculation Using equations above the final generated data is given in Table 5. Dependable flow is undertaken by sorting daily average monthly flow data. Weibull probability equation is used to calculate the value of reliability for each month. The results of reliability calculation Weibull probability equation are given in Table 5 below. Table 5. Serial flow data based on regression curve m Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec P 1 5.01 5.29 5.81 5.73 4.96 4.26 4.18 3.90 3.69 3.80 4.17 4.13 0.09 2 4.84 5.08 5.37 5.22 4.71 3.98 3.18 2.46 1.97 1.87 2.30 3.69 0.18 3 4.70 5.00 5.37 5.07 4.36 3.79 2.63 2.03 1.74 1.68 2.13 3.44 0.27 4 4.03 4.95 5.17 4.81 4.21 3.43 2.49 1.95 1.54 1.55 1.92 3.41 0.36 5 4.01 4.94 5.08 4.51 3.69 3.41 2.49 1.53 1.27 1.37 1.92 3.32 0.45 6 3.81 4.87 4.65 4.19 3.63 3.04 2.41 1.50 1.08 1.22 1.90 2.94 0.55 7 3.96 4.81 4.62 4.06 3.50 2.66 1.97 1.37 0.96 0.88 1.50 2.89 0.64 8 3.65 4.52 4.47 3.99 3.39 2.59 1.94 1.33 0.60 0.56 1.46 2.53 0.73 9 3.55 4.26 4.33 3.94 3.28 2.34 1.20 0.27 0.00 0.00 0.62 2.40 0.82 10 2.83 3.66 3.61 3.85 3.05 1.79 0.51 0.00 0.00 0.00 0.00 0.94 0.91 Source: Calculation Using interpolation technique dependable flow with 80% reliability for each month are presented in Figure 2. The resulting dependable flow above illustrates the pattern of rainy and dry seasons occurring in the study area. Therefore, it can be concluded that the data can be statistically used as a material calculation of water allocation in the area concerned. For irrigation purposes, usually the dependable flow value is calculated based on a period of 15 days. To calculate the dependable flow with a period of 15 days, it can be done by taking two random numbers whose average is the value in the corresponding month. For example, to calculate the value of dependable flow in the period January I and January II, we have to take two random numbers where the average of the two random numbers is the January dependable flow value. The weakness of data generation with a combination of Bootstrap methods and regression analysis is the diminished extreme value of a distribution. Existing extreme values have been modified to ideal values that satisfy certain distributions. The influence of climate anomalies such as El Nino must be closely watched because the minimum extreme values sometimes appear this year. Minimal extreme values will affect the dependable flow calculation. Sometimes a minus number appears in the Civil and Environmental Science Journal Vol. I, No. 01, pp. 027-033, 2018 32 generated data. If the minus value appears in serial data then the value must be replaced with the value of zero because logically there is no minus flow value. Figure 2. Dependable flow with 80% reliability of Pengubuan River Basically, the best way to calculate dependable flow is to collect as much historical data as possible. But to get a lot of data flow and complete is not easy in Indonesia. Therefore, statistical analysis is the best way to do it. Statistical analysis is probably the best way to process a small amount of data. But statistical analysis should be accompanied by empirical analysis to test the accuracy of the preceding analysis. 4. Conclusions Data generation in order to replace lost flow data in the series of discharge data in Sungai Seputih River, Lampung Province has been analyzed. The results show that the combination of Bootstrap method and regression analysis is able to generate new data whose distribution is similar to available field data. However, complete data is a key requirement in a water resource plan. Statistical methods can be taken to solve the problem of data availability. However, careful analysis is required in using statistical methods so that the results of analysis do not deviate from the field conditions. Acknowledgements The author would like to express his deep gratitude to Mrs. Eka Desmawati and Mr. Ankavisi Nalaralagi from BBWS Mesuji Sekampung for their support of this research especially in providing hydrological data. References [1] Huang, H. 2018. Monte Carlo simulation using Excell (In Bahasa Indonesia), Globalstats Academic Publication. http://www.globalstatistik.com/simulasi-monte-carlo-dengan-excel/. March 19th (19:05). [2] Březková, L., Starý, S., and Doležal, P. The Real-time Stochastic Flow Forecast. Soil & Water Res., 5(2): 49–57. [3] Flores, G. 2015. A stochastic model for sewer base flows using Monte Carlo simulation. Master Thesis, Stellenbosch University, South Africa. [4] Sungkono, J. 2015. Bootstrap re-sampling observation on estimation parameter regression using software R (In Bahasa Indonesia). Magistra No. 92 Year XXVII. [5] Efron, B. and Tibshirani, R. J. 1993, An introduction to the Bootstrap, Chapman and Hall, New York. [6] Monalisa, A. 2016. Use of Bootstrap resampling method for simulation data time series model using ARIMA, Undergraduate thesis (In Bahasa Indonesia), University of Jember. 3,58 4,39 4,41 3,98 3,32 2,46 1,38 0,57 0,19 0,14 0,83 2,46 0,00 0,50 1,00 1,50 2,00 2,50 3,00 3,50 4,00 4,50 5,00 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec D is c h a r g e ( m 3 /s ) Month Civil and Environmental Science Journal Vol. I, No. 01, pp. 027-033, 2018 33 [7] Slaets, J. I. F., Piepho, H., Schmitter, P., Hilger, T. and Cadisch, G. 2017. Quantifying uncertainty on sediment loads using Bootstrap confidence intervals. Hydrol. Earth Syst. Sci., 21: 571–588. [8] Lenderink, G., Buishand, A. and Deursen, W. 2007. Estimates of future discharge of the river Rhine using two scenario methodologies: direct versus delta approach. Hydrol. Earth Syst. Sci., 11(3): 1145–1159. [9] Zhang, A., Shi, H., Li, T. and Fu, X. 2018. Analysis of the influence of rainfall spatial uncertainty on hydrological simulations using the Bootstrap method. Atmosphere 9(71): 2–24. [10] Weibull, W. 1951. A statistical distribution function of wide applicability, J. Appl. Mech.-Trans. ASME, 18(3): 293–297.