INTERNATIONAL JOURNAL OF COMPUTERS COMMUNICATIONS & CONTROL Online ISSN 1841-9844, ISSN-L 1841-9836, Volume: 15, Issue: 2, Month: April, Year: 2020 Article Number: 3868, doi.org/10.15837/ijccc.2020.2.3868 CCC Publications Weighted Random Search for CNN Hyperparameter Optimization R. Andonie, A.C. Florea Razvan Andonie 1. Department of Computer Science Central Washington University, USA 2. Department of Electronics and Computers Transilvania University of Braşov, Romania *Corresponding author: andonie@cwu.edu Adrian-Cătălin Florea Department of Electronics and Computers Transilvania University of Braşov, Romania acflorea@unitbv.ro Abstract Nearly all model algorithms used in machine learning use two different sets of parameters: the training parameters and the meta-parameters (hyperparameters). While the training parameters are learned during the training phase, the values of the hyperparameters have to be specified before learning starts. For a given dataset, we would like to find the optimal combination of hyperparam- eter values, in a reasonable amount of time. This is a challenging task because of its computational complexity. In previous work [11], we introduced the Weighted Random Search (WRS) method, a combination of Random Search (RS) and probabilistic greedy heuristic. In the current paper, we compare the WRS method with several state-of-the art hyperparameter optimization methods with respect to Convolutional Neural Network (CNN) hyperparameter optimization. The criterion is the classification accuracy achieved within the same number of tested combinations of hyperparameter values. According to our experiments, the WRS algorithm outperforms the other methods. Keywords: Hyperparameter optimization, supervised learning, random search, convolutional neural networks. 1 Introduction Machine learning models usually have two types of parameters: the training parameters and hyperparame- ters. The training parameters are learned during the training phase, whereas the values of the hyperparameters have to be specified before learning starts. For instance, the hyperparameters of neural networks typically specify the architecture of the network (number and type of layers, number and type of nodes, etc). For a given dataset, we would like to find the optimal combination of hyperparameter values, in a reasonable amount of time. This process is called hyperparameter optimization. Formally, the aim of hyperparameter optimization is to find the hyperparameters of a given model that return the best performance as measured on a validation set. This optimization can be represented as [3]: doi.org/10.15837/ijccc.2020.2.3868 2 x∗ = arg min x∈ℵ f(x), (1) where f(x) is the objective function to minimize (such as RMSE or error rate) evaluated on the validation set; x∗ is the set of hyperparameters that yields the lowest value of the score, and x can take on any value in the domain ℵ. In simple terms, we want to find the model hyperparameters that yield the best score on the validation set metric. More detailed, eq (1), can be written as: x∗ = arg min x∈ℵ f(x,γ∗; Svalidation) (2) Eq (2) includes an inner optimization used to find γ∗, the optimal value of γ, for the current x value: γ∗ = arg min γ∈Γ f(x,γ; Strain) (3) where Svalidation and Strain denote the validation and training datasets respectively; γ is the set of learned parameters in the domain Γ through minimization of the training error. Eq (1) refers to a "validation set", without further details. We have to evaluate the expectation of the score over an unknown distribution and we usually approximate this expectation using a three-way split, dividing the dataset into a training, validation, and test dataset. Having a training-validation pair for hyperparameter tuning allows us to keep the test set "independent" for model evaluation. We will use three-way split in our implementations and experiments, but we will omit such details in our presentation, since they are standard. The validation process can be also generalized at the model level. In this case, at meta-level we have several possible models (i.e., feedforward neural network, random forest, support vector machine, etc), each with its own set of hyperparameters. The process of finding the best-performing model from a set of models that were produced by different hyperparameter settings is called model selection [32]. In this case, we search for the best model and, within the model, for the best combination of hyperparameter values. Many optimizers attempt to optimize both the choice of the model and the hyperparameters of the model: Auto-WEKA, Hyperopt-sklearn, AutoML, auto-sklearn, etc. For simplicity, we will consider here only optimization of hyperparameters within the framework of the same model. Evaluating the objective function to find the score can be computationally very expensive. Each time we try different hyperparameters, we typically re-train the model, make predictions on the validation data, and then calculate the validation metric. We call this re-training + evaluation for one combination of hyperparameter values a trial. Lately, there has been high interest in the area of hyperparameter optimization, related to the importance and complexity of the deep learning architectures. The challenge is that in deep learning, we typically deal with large training sets. In addition, we may also encounter high-dimensional search spaces, which according to the curse of dimensionality principle have to be covered by an exponential increasing number of points (in this case, combinations of hyperparameters). Therefore, deep learning hyperparameter optimization algorithms are computationally intensive. For instance, to optimize a deep learning architecture for the CIFAR-10 and ImageNet datasets required 2,000 GPU days of reinforcement learning [43] or 3,150 GPU days of evolution [33]. The performance of a hyperparameter optimization method is determined by several factors: • F1. The execution time of each trial. This time depends on the size and the dimensionality of the training set. • F2. The number of evaluated combinations of hyperparameters (the number of trials). • F3. The performance of the search procedure. Within the same number of trials, different optimiza- tion methods can achieve different scores, depending on how "smart" they are. This depends on which combinations of hyperparameter values are tested and in which order. We restrict the discussion to F3 and our the focus is to increase the score of a classifier within the same computational budget. In [11] we introduced the WRS method, an improved RS method which can be used in hyperparameter optimization. Instead of a blind RS search, the WRS method uses information from previous trials to guide the search process toward next interesting trials. WRS combines RS and a probabilistic greedy heuristic. We formally proved that the WRS method finds the global optimum faster than RS: on average, WRS converges faster than RS. We also proved that WRS is convergent. Experimentally, for the same number of trials, the WRS algorithm produced significantly better results than RS on the CIFAR-10 dataset1. So far, the WRS method was never compared with other optimization methods, except RS. 1http://www.cs.toronto.edu/ kriz/cifar.html doi.org/10.15837/ijccc.2020.2.3868 3 In the current contribution, we compare the WRS method with several state-of-the-art hyperparameter optimization methods. Since CNNs are presently by far the most popular deep neural architecture, our experi- ments focus on this architecture. We compare the classification accuracy achieved within the same number of trials. As a result, the WRS optimization algorithm proves to be competitive, even when compared with some of the best known methods. The paper proceeds as follows. Section 2 presents the state-of-the-art of CNN hyperparameter optimization methods, whereas Section 3 is a brief overview of the current software implementing these methods. Section 4 describes in an intuitive way the WRS algorithm. Our main contribution is Section 5, where we present experimental results, comparing WRS with several state-of-the-art methods. The paper is then concluded in Section 6. 2 Related work: Methods for CNN Hyperparameter Optimization To set the stage for WRS, we first introduce the currently most used methods for hyperparameter optimiza- tion, with focus on CNN architectures. An overview of fundamental methods for hyperparameter optimization can be found, for instance, in [4, 21]. The best known methods are: Grid Search (GS), Random Search (RS), Bayesian Optimization (BO), Nelder- Mead (NM), Simulated Annealing (SA), Particle Swarm Optimization, and Evolutionary Algorithms. Until recently, the most commonly used hyperparameter optimization strategy was a combination of GS and manual tuning. More complex techniques are now largely available in a variety of software packages. RS consists in drawing samples from the parameter space following a particular distribution for each of the parameters. Using the same number of trials, RS generally yields better results than GS or more complicated hyperparameter optimization methods. Especially in higher dimensional spaces, the computation resources re- quired by RS methods are significantly lower than for GS [22]. Recent attempts to optimize the RS algorithm are: Li’s et al. Hyperband [23], Domhan et al. [9], and Florea et al. [10], where we introduced a dynami- cally computed stopping criterion for RS, reducing the number of trials without reducing the generalization performance. By evaluating hyperparameters that appear more promising from past results, Bayesian methods can find better model settings than RS or GS in fewer iterations. A reviews of BO can be found in [35]. Sequential Model-Based Optimization (SMBO) methods are a formalization of BO. The sequential optimization refers to running trials one after another, each time trying better hyperparameters and updating a probability model (a surrogate). There are several variants of SMBO methods, based on Gaussian Processes, Random Forest Regression, Tree-structured Parzen Estimators (TPE), etc. In low-dimensional problems with numerical hyperparameters, the best available hyperparameter optimiza- tion methods use BO [36]. However, BO is restricted to problems of moderate dimension [35]. In current CNN architectures, the number of hyperparameters has increased significantly. These hyperpa- rameters include: number of layers, filter size, number of feature maps, stride, pooling regions and pooling sizes, and the number of units in each fully-connected layer. The number of these parameters can be the in the order of tenths or even hundreds. Not only must a hyperparameter optimization algorithm optimize over variables which are discrete, ordinal, and continuous, but it must simultaneously choose which variables to optimize - a difficult task. Currently, no work covers optimization of every hyperparameter in designing a CNN architecture [1]. We briefly refer in the following to some recent CNN hyperparameter optimization results. The SMBO was recently used for CNN hyperparameter optimization [39]. By learning the hyperparameter response function, SMBO can pick the next promising hyperparameter combination. In [36], BO with Gaussian Processes are used to optimize nine hyperparameters of a CNN, including the learning rate, epochs, the four weights of the CNN layers, and the parameters of response contrast normalization (width, scale, and power). Another approach of this type is [16]. Evolutionary approaches for CNN optimization approaches can be found in [28, 41]. In high-dimensional hyperparameter optimization, tree-based models are presently considered more success- ful: TPE [4] and Random Forests [14]. Very recent deep learning hyperparameter optimization techniques are [17], [25], and [9]. These techniques attempt to reduce as much as possible the number of evaluated combina- tions of hyperparameter values. However, for large training sets, even if the number of evaluated combinations is reduced, the total training time increases very much, becoming the complexity bottleneck. Hinz et al. applied several algorithms to optimize the architecture of CNNs for image inputs with increasing resolution [13]. That way, they conceptually used the same input data, but in different input dimensions. They found that the same hyperparameters are important for a given dataset, independent of the image resolution. Furthermore, the optimal value for most hyperparameters also seems to be independent of the image resolution. Consequently, they proposed an approach to speed up the hyperparameter optimization procedure by starting on smaller images and increasing the resolution during the optimization procedure. doi.org/10.15837/ijccc.2020.2.3868 4 Albelwi at al. recently proposed a general method for CNN hyperparameter optimization [2]. Their ap- proach is based on a) a preprocessing step, when they reduce the training set using instance selection; b) the NM optimization method. For the CIFAR-10, CIFAR-100, and MINIST datasets, the following seven hyperparam- eters (each with its corresponding range of possible values) were considered: depth, number of fully-connected layers, number of filters, kernel sizes, number of pooling layers, pooling region sizes, number of neurons in fully-connected layers. Other recent deep learning hyperparameter optimization techniques are: • Deterministic RBF surrogates [16]. • Covariance Matrix Adaptation Evolution Strategy [25]. • The hyperparameter optimization can be accelerated by the extrapolation of learning curves [9]. • CoDeepNEAT [29] is a evolutionary approach for hyperparameter optimization. • Gradient-based relaxation of the search, which allows a gradient-based optimization [26], [24], and [40]. Companies invest today massively in hyperparameter optimization for deep learning. A recent result from Google Brain [43] employs reinforcement learning to train a Long Short-Term Memory (LSTM) architecture generator to build a language that describes new deep learning architectures. The LSTM is trained via a policy gradient method to maximize the generation of new architectures. The research explores this method to generate CNNs. Another recent paper from Google Brain [34] employs an evolutionary algorithm using mutation operators that are inspired by “rules of thumb” coming from various deep learning papers. The evolutionary algorithm uses repeated pairwise competitions of random individuals, and selects from the pair the better performing individual. These Google results show that Google is also looking for unconventional approaches, using their impressive computing capabilities. 3 Software for Hyperparameter Optimization Currently, there are many software packages which include hyperparameter optimization methods. The user can choose between several methods (sometimes called solvers), or continue with the default one. Popular packages in this category are [3]: • LIBSVM [8], BayesianOptimization2, Spearmint3, and pyGPGO4. • Hyperopt-sklearn provides automatic algorithm configuration of the scikit-learn library. It can be used for both model selection and hyperparameter optimization. Hyperopt-sklearn has the following implemented solvers [6]: RS, SA, and TPE. • Optunity5, a Python library containing various optimizers for hyperparameter tuning: GS, RS, PSO, NM, Covariance Matrix Adaptation Evolutionary Strategy, TPE, and Sobol sequences. Following the idea of Auto-Weka [19], several automated machine learning tools (AutoML) were recently developed. Their ultimate goal is to automatize the end-to-end process of applying machine learning to real- world problems, even for people with no major expertise in this field. Examples of AutoML tools are: MLBox, auto-sklearn, Tree-Based Pipeline Optimization Tool, H2O, AutoKeras, and TransmogrifAI. Some top companies offer commercial cloud-based AutoML services: • Google Cloud AutoML6, enables developers with limited machine learning expertise to train high-quality models specific to their business needs. • Microsoft Azure ML7, which can be used to streamline the building, training, and deployment of machine learning models. • Amazon SageMaker Automatic Model Tuning8, which can launch multiple training jobs, with different hyperparameter combinations, based on the results of completed training jobs. In our experiments, we use the Optunity and Hyperopt packages. 2https://github.com/fmfn/BayesianOptimization 3https://github.com/HIPS/Spearmint 4https://github.com/hawk31/pyGPGO 5http://optunity.readthedocs.io/en/latest/ 6https://cloud.google.com/automl/ 7https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/tune-model-hyperparameters 8https://aws.amazon.com/sagemaker/ doi.org/10.15837/ijccc.2020.2.3868 5 4 The WRS Method This section describes the WRS algorithm and offers some insight into the intuition behind the proposed approach. Full details can be found in [11]. The code is publicly available on GitHub9. In the RS approach, hyperparameter optimization translates into the optimization of a objective function F of d variables by generating random values for its parameters and evaluating the function for each of these values [5]. The function computes some quality measure or score of the model (e.g., accuracy), and the variables correspond to the hyperparameters. The assumption is that we want to maximize F by executing a given number N of trials (steps). RS generates random combinations of hyperparameter values at each iteration k, by generating independent random values for each of the d dimensions, each in its own range of values, according to a probability distribution Pi (i = 1, . . . ,d), known beforehand. WRS generates the values for each dimension using a probability of change pi (i = 1, . . . ,d): the current best value for each dimension i is used with probability 1 −pi. If (xk+11 ,x k+1 2 , . . . ,x k+1 d−1,x k+1 d ) is the combination of hyperparameter values evaluated at step k + 1 in the WRS algorithm, we define random variable Xk+1 with values (xk+11 ,x k+1 2 , . . . ,x k+1 d−1,x k+1 d ) and probabilities 1 = p1 > p2, . . . ,> pd−1 > pd: (xk+11 ,x k+1 2 , . . . ,x k+1 d−1,x k+1 d ), with probability pd (xk+11 ,x k+1 2 , . . . ,x k+1 d−1,x k d), with probability pd−1 −pd . . . (xk+11 ,x k+1 2 , . . . ,x k d−1,x k d), with probability p2 −p3 (xk+11 ,x k 2, . . . ,x k d−1,x k d), with probability 1 −p2 (4) A value xki in the definition of X k+1 means that xk+1i = x k i and for dimension i we do not generate a new value at step k + 1. The main idea behind the WRS algorithm is that a subset of candidate values that already produced a good result has the potential, in combination with new values for the remaining dimensions, to lead to better values of the objective function. Instead of always generating new values (like in RS), the WRS algorithm uses for a certain number of dimensions the so far best obtained values. The exact number of dimensions that actually change at each iteration is controlled by the probabilities of change assigned to each dimension. Since we want WRS to have a good coverage of the variation of the objective function, we determine the importance (the weight) of each hyperparameter by computing the variation of the objective function. In order to do this, we first gather some test values, by evaluating function F for a set of randomly chosen inputs. This is basically equivalent to running RS for a predefined number of steps N0, significantly smaller than N. We use the obtained data to run an instance of fANOVA [15], which gives us the weight of the d variables of F with respect to the variation of this function. In order to cover as much as possible the variation of the objective function, we assign a greater probability of change to the variables with greater weight. For a parameter that is responsible for only a small part of the variation of F, we consider that keeping a temporary optimum, once identified, has the potential to be a more efficient approach. Hence, the probability of change for such a variable will be small. Algorithm 1 describes one WRS step, corresponding to a trial, applied to the maximization of function F. Xk is the best point identified after k iterations, while F(Xk) is the value of F for this input. The probabilities pi(i = 1, . . . ,d) are the probabilities of change for each of the variables of F, and ki(i = 1, . . . ,d) are the minimum number of values to be generated for each variable. Pi(x)(i = 1, . . . ,d) are the probability distributions used to generate new values for each of the variables. Our assumption in RS and WRS is that the input variables are statistically independent. Computing the value of F is usually the most time consuming part of the algorithm. In the specific case of hyperparameters optimization, this involves a trial: a train + a test iteration on the model. To avoid recomputing the function’s value for the same point, we check if for each iteration at least one variable changed. This explains why at least one of the values for pi is set to one. If we chose all pi to be one, we obtain in particular the classical RS implementation. Therefore, WRS is a generalization of RS. 9https://github.com/acflorea/goptim/tree/keras doi.org/10.15837/ijccc.2020.2.3868 6 Algorithm 1 A WRS Step - Objective Function Maximization [11] Input: F; (Xk, F(Xk)); pi,ki,Pi(x), i = 1, . . . ,d Output: (Xk+1,F(Xk+1)) 1: Randomly generate p, uniform in (0,1) 2: for i = 1 to d do 3: if (pi ≥ p or k ≤ ki) then 4: // either the probability condition is met or more samples are needed 5: Generate xk+1i according to Pi(x) 6: else 7: xk+1i = x k i 8: end if 9: end for 10: // usually this is the most time consuming step 11: Compute F(Xk+1) 12: if F(Xk+1) ≥ F(Xk) then return (Xk+1,F(Xk+1)) 13: elsereturn (Xk,F(Xk)) 14: end if Algorithm 2 WRS - Objective Function Maximization [11] Input: F; N; Pi(x), i = 1, . . . ,d Output: (XN,F(XN )) 1: // Phase 1 - Run RS 2: for k = 1 to N0 < N do 3: Perform RS step, compute (Xk,F(Xk)) 4: end for 5: // Intermediate phase, determine input for WRS 6: Determine the probability of change pi, i = 1, . . . ,d 7: Determine the minimum number of required values ki, i = 1, . . . ,d 8: // Phase 2 - Run WRS 9: for k = N0 + 1 to N do 10: Perform WRS Step described in Algorithm 1, compute (Xk,F(Xk)) 11: end forreturn (XN,F(XN )) Algorithm 2 describes the entire WRS flow. For a well chosen set of ki(i = 1, . . . ,d) values, WRS has a greater probability than RS to find the global optimum at any step n. In [11], we determined how to obtain such a set. On average, within the same computational budget (number of iterations), WRS identifies the global optimum faster than RS. In other words, on average, WRS converges faster than RS. The theoretical convergence of WRS is proved in [11]. 5 WRS for CNN Hyperparameter Optimization CNN architecture optimization is currently an unofficial benchmark for testing hyperparameter optimiza- tion methods. We will describe in the following the application of the WRS method on a CNN architecture optimization problem. The main goal is to compare its performance against some of the best know hyperparam- eters optimization algorithms. The choice of a CNN architecture is justified by the popularity of CNN [31] and the ever increasing interest in identifying optimal CNN architectures[12, 20, 38, 42]. Since a complete train-test cycle for a CNN architecture can be extremely time consuming, identifying better algorithms for this class of problems is a hot topic. When comparing two algorithms, "better" means at least one of the following: a) using less trials for a similar score; b) improving the score within the same computational budget (same number of trials). The base architecture of the network used in our experiments is shown in Fig. 1. We attempt to optimize the following hyperparameters: • The number of convolution layers (C) - an integer value in the set {3, 4, 5, 6}; doi.org/10.15837/ijccc.2020.2.3868 7 • The number of fully-connected layers (F) - an integer value in the set {1, 2, 3, 4}; • The number of output filters for each convolution layer (C1, . . . , C6) - an integer value in the range [100, 1024]; • The number of neurons for each fully connected layer (F1, . . . , F4) - an integer value in the range [1024, 2048]. Input Convolution Convolution Max. Pooling Convolution Convolution Max. Pooling Convolution Fully Connected Fully Connected Fully Connected Fully Connected Output Figure 1: The CNN architecture [11] On purpose, we use the same problem setup as the one described in [11]. We perform 300 trials for all algorithms, including WRS. We use CIFAR-10 10 without data augmentation to train the CNN models and compare them in terms of achieved accuracy. The experiments run on an IBM S822LC cluster with IBM POWER8 nodes, NVLink and NVidia Tesla P100 GPUs11. On average, the execution time of a trial takes approximately 37 minutes, which leads to a bit less than eight days of running time for each of the 7 tested algorithms. We compare the obtained results against the following standard techniques: • Random Search (RS) • Nelder-Mead (NM)[30] • Particle Swarm Optimization (PSO)[18] • Sobol Sequences (SS)[37] • Bayesian Optimization (BO)[27] • Tree-structured Parzen Estimator (TPE)[4] For RS, NM, PSO and SS we use their implementation from the Optunity [44] package, BO comes with it’s own implementation, whereas for TPE we use the Hyperopt[7] package. For WRS we run an initial phase of 110 trials and use the obtained values to rank the hyperparameters and determine the probability of change for each hyperparameter. The obtained values are listed in Table 1. Table 1: Parameter weights (first line) and probabilities (second line) for CNN[11] C F C1 C2 C3 C4 C5 C6 F1 F2 F3 F4 7.4 11.85 0.51 0.79 1.62 0.73 2.26 1.26 26.28 0.87 3.22 1.75 0.28 0.45 0.02 0.03 0.06 0.03 0.09 0.05 1.00 0.03 0.12 0.07 Fig. 2 shows the least squares five degree polynomial fit on the accuracy results obtained for each of the 300 trials using: WRS, RS, NM, PSO, SS, BO, and TPE. The plot depicts the values obtained at each iteration. Table 2 lists the best result obtained by each of the tested algorithms, the average accuracy, and the standard deviation across all 300 iterations and (between parenthesis) across the last 100 iterations. 10http://www.cs.toronto.edu/ kriz/cifar.html 11http://www.cwu.edu/faculty/turing-cwu-supercomputer doi.org/10.15837/ijccc.2020.2.3868 8 0 50 100 150 200 250 300 0.70 0.72 0.74 0.76 0.78 0.80 0.82 0.84 WRS RS NM PSO SS BO TPE Figure 2: Least squares five degree polynomial fit on RS, NM, PSO, SS, BO, TPE vs. WRS accuracy for CIFAR-10 on 300 trials. The plot shows the values obtained at each iteration Table 2: Obtained CNN accuracy on CIFAR-10 Optimizer Best Result Average SD WRS 0.85(0.85) 0.79(0.79) 0.09(0.09) RS 0.81(0.81) 0.75(0.8) 0.04(0.04) NM 0.81(0.81) 0.77(0.77) 0.03(0.03) PSO 0.83(0.82) 0.78(0.79) 0.03(0.03) SS 0.82(0.82) 0.75(0.72) 0.05(0.04) BO 0.83(0.82) 0.75(0.75) 0.04(0.03) TPE 0.81(0.79) 0.71(0.71) 0.04(0.04) The CNN architecture that yields the best accuracy, according to WRS, has six convolution layers and one fully-connected layer, with C1 = 736,C2 = 508,C3 = 664,C4 = 916,C5 = 186,C6 = 352 and F1 = 1229. According to Table 2, WRS clearly outperforms all other considered algorithms with respect to accuracy, within the same computational budget (number of trials). Among the tested algorithms, WRS yields the highest standard deviation of the results across all 300 iterations, which means that it covers better the variation of the objective function than the other algorithms. This validates our greedy hyperparameter selection procedure used in the WRS algorithm. From Fig. 2 it can be observed that, in case of WRS, the accuracy has the tendency to improve as the number of iterations increases. This validates the hypothesis that using the best known value for some of the hyperparameters, instead of generating a new one at each iteration, tends to achieve increasingly better results. The same hypothesis is validated by looking at the results for the last 100 iterations in Table 2. We also notice from Fig. 2 that, whereas for some of the baseline algorithms (e.g., RS, SS, BO) the accuracy seems to decrease or have little variation as the number of iterations increases, the accuracy of the WRS algorithm has the tendency to increase with an increasing number of iterations. 6 Conclusions and Open Problems According our experimental results, the WRS optimization algorithm outperforms several state-of-the art hyperparameter optimization methods. Since we made our code publicly available, we hope that further exper- iments on other architectures will confirm these results. However, we are aware that determining the optimal architecture for machine learning models has an addi- tional challenge: the architecture differs for each dataset and therefore requires adjustments for each individual doi.org/10.15837/ijccc.2020.2.3868 9 case. Does this mean that in general we cannot optimize the hyperparameters of a given model? For some given problems, recent automatic methods have produced results exceeding those accomplished by human experts [4, 36]. We should be optimistic, but we should be also aware of the scalability limits, since in the most general case, we are dealing with a computationally hard optimization problem. References [1] Albelwi, S.; Mahmood, A. (2016). Analysis of instance selection algorithms on large datasets with deep convolutional neural networks. In 2016 IEEE Long Island Systems, Applications and Technology Conference (LISAT), 1–5, 2016. [2] Albelwi, S.; Mahmood, A. (2017). A framework for designing the architectures of deep convolutional neural networks. Entropy, 19(6), 2017. [3] Andonie, R. (2019). Hyperparameter optimization in learning systems. Journal of Membrane Computing, 2019. [4] Bergstra, J.; Bardenet, R.; Bengio, Y.; Kégl, B. (2011a). Algorithms for Hyper-Parameter Optimization. In 25th Annual Conference on Neural Information Processing Systems (NIPS 2011), volume 24 of Advances in Neural Information Processing Systems, Granada, Spain. Neural Information Processing Systems Foun- dation, 2011. [5] Bergstra, J.; Bardenet, R.; Bengio, Y.; Kégl, B. (2011b). Algorithms for hyper-parameter optimization. In Shawe-Taylor, J., Zemel, R. S., Bartlett, P. L., Pereira, F. C. N., and Weinberger, K. Q., editors, NIPS, 2546–2554, 2011. [6] Bergstra, J.; Komer, B.; Eliasmith, C.; Yamins, D.; Cox, D. D. (2015). Hyperopt: a Python library for model selection and hyperparameter optimization. Computational Science and Discovery, 8(1), 014008, 2015. [7] Bergstra, J.; Yamins, D.; Cox, D. D. (2013). Making a science of model search: Hyperparameter optimiza- tion in hundreds of dimensions for vision architectures. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ICML’13, JMLR.org, I–115–I–123, 2013. [8] Chang, C.-C.; Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. [9] Domhan, T.; Springenberg, J. T.; and Hutter, F. (2015). Speeding up automatic hyperparameter optimiza- tion of deep neural networks by extrapolation of learning curves. In Proceedings of the 24th International Conference on Artificial Intelligence, IJCAI’15, AAAI Press, 3460–3468, 2015. [10] Florea, A. C.; Andonie, R. (2018). A dynamic early stopping criterion for random search in svm hyperpa- rameter optimization. In Iliadis, L., Maglogiannis, I., and Plagianakos, V., editors, Artificial Intelligence Applications and Innovations, Cham. Springer International Publishing, 168–180, 2018. [11] Florea, A.-C.; Andonie, R. (2019). Weighted random search for hyperparameter optimization. International Journal of Computers Communications & Control, 14(2), 154–169, 2019. [12] He, K.; Zhang, X.; Ren, S.; Sun, J. (2016). Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770–778, 2016. [13] Hinz, T.; Navarro-Guerrero, N.; Magg, S.; Wermter, S. (2018). Speeding up the hyperparameter opti- mization of deep convolutional neural networks. International Journal of Computational Intelligence and Applications, 17(02), 1850008, 2018. [14] Hutter, F.; Hoos, H.; Leyton-Brown, K. (2014a). An efficient approach for assessing hyperparameter im- portance. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, 754–762, 2014. [15] Hutter, F.; Hoos, H.; Leyton-Brown, K. (2014b). An efficient approach for assessing hyperparameter im- portance. In Proceedings of International Conference on Machine Learning 2014 (ICML 2014), 754–762, 2014. [16] Ilievski, I.; Akhtar, T.; Feng, J.; Shoemaker, C. A. (2016). Hyperparameter optimization of deep neural networks using non-probabilistic RBF surrogate model. CoRR, abs/1607.08316, 2017. doi.org/10.15837/ijccc.2020.2.3868 10 [17] Ilievski, I.; Akhtar, T.; Feng, J.; Shoemaker, C. A. (2017). Efficient hyperparameter optimization for deep learning algorithms using deterministic RBF surrogates. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA., 822–829, 2017. [18] Kennedy, J.; Eberhart, R. C. (1995). Particle swarm optimization. In Proceedings of the IEEE International Conference on Neural Networks, pages 1942–1948, 1995. [19] Kotthoff, L.; Thornton, C.; Hoos, H. H.; Hutter, F.; Leyton-Brown, K. (2017). Auto-WEKA 2.0: Auto- matic model selection and hyperparameter optimization in WEKA. Journal of Machine Learning Research, 18(25), 1–5, 2017. [20] Krizhevsky, A.; Sutskever, I.; Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS’12, pages 1097–1105, USA. Curran Associates Inc., 2012. [21] Larson, J.; Menickelly, M.; Wild, S. M. (2019). Derivative-free optimization methods. Acta Numerica, 28, 287–404, 2019. [22] Lemley, J.; Jagodzinski, F.; Andonie, R. (2016). Big holes in big data: A monte carlo algorithm for detecting large hyper-rectangles in high dimensional data. In 2016 IEEE 40th Annual Computer Software and Applications Conference (COMPSAC), 1, 563–571, 2016. [23] Li, L.; Jamieson, K. G.; DeSalvo, G.; Rostamizadeh, A.; Talwalkar, A. (2016). Efficient hyperparameter optimization and infinitely many armed bandits. CoRR, abs/1603.06560, 2016. [24] Liu, H.; Simonyan, K.; Yang, Y. (2018). DARTS: differentiable architecture search. CoRR, abs/1806.09055. 2018. [25] Loshchilov, I.; Hutter, F. (2016). CMA-ES for hyperparameter optimization of deep neural networks. CoRR, abs/1604.07269, 2016. [26] Luo, R.; Tian, F.; Qin, T.; Chen, E.; Liu, T.-Y. (2018). Neural architecture optimization. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R., editors, Advances in Neural Information Processing Systems 31, Curran Associates, Inc, 7816–7827, 2018. [27] Martinez-Cantin, R. (2014). Bayesopt: A bayesian optimization library for nonlinear optimization, exper- imental design and bandits. Journal of Machine Learning Research, 15, 3915–3919, 2014. [28] Miikkulainen, R.; Liang, J.; Meyerson, E.; Rawal, A.; Fink, D.; Francon, O.; Raju, B.; Shahrzad, H.; Navruzyan, A.; Duffy, N.; Hodjat, B. (2019). Chapter 15 - evolving deep neural networks. In Kozma, R., Alippi, C., Choe, Y., and Morabito, F. C., editors, Artificial Intelligence in the Age of Neural Networks and Brain Computing, Academic Press, 293 – 312, 2019. [29] Miikkulainen, R.; Liang, J. Z.; Meyerson, E.; Rawal, A.; Fink, D.; Francon, O.; Raju, B.; Shahrzad, H.; Navruzyan, A.; Duffy, N.; Hodjat, B. (2017). Evolving deep neural networks. CoRR, abs/1703.00548, 2017. [30] Nelder, J. A.; Mead, R. (1965). A Simplex Method for Function Minimization. Computer Journal, 7, 308–313, 1965. [31] Patterson, J.; Gibson, A. (2017). Deep Learning: A Practitioner’s Approach. O’Reilly Media, Inc., 1st edition, 2017. [32] Raschka, S. (2018). Model evaluation, model selection, and algorithm selection in machine learning. CoRR, abs/1811.12808, 2018. [33] Real, E.; Aggarwal, A.; Huang, Y.; Le, Q. V. (2018). Regularized evolution for image classifier architecture search. CoRR, abs/1802.01548, 2018. [34] Real, E.; Moore, S.; Selle, A.; Saxena, S.; Suematsu, Y. L.; Le, Q. V.; Kurakin, A. (2017). Large-scale evolution of image classifiers. CoRR, abs/1703.01041, 2017. [35] Shahriari, B.; Swersky, K.; Wang, Z.; Adams, R. P.; de Freitas, N. (2016). Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1), 148–175, 2016. , doi.org/10.15837/ijccc.2020.2.3868 11 [36] Snoek, J.; Larochelle, H.; Adams, R. P. (2012). Practical Bayesian optimization of machine learning al- gorithms. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States., 2960–2968, 2012. [37] Sobol, I. (1976). Uniformly distributed sequences with an additional uniform property. USSR Computa- tional Mathematics and Mathematical Physics, 16(5), 236 – 242, 1976. [38] Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. (2015). Going deeper with convolutions. In Computer Vision and Pattern Recognition (CVPR), 2015. [39] Talathi, S. S. (2015). Hyper-parameter optimization of deep convolutional networks for object recognition. In 2015 IEEE International Conference on Image Processing (ICIP), 3982–3986, 2015. [40] Wu, B.; Dai, X.; Zhang, P.; Wang, Y.; Sun, F.; Wu, Y.; Tian, Y.; Vajda, P.; Jia, Y.; Keutzer, K. (2018). Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search. CoRR, abs/1812.03443, 2018. [41] Young, S. R.; Rose, D. C.; Karnowski, T. P.; Lim, S.-H.; Patton, R. M. (2015). Optimizing deep learning hyper-parameters through an evolutionary algorithm. In Proceedings of the Workshop on Machine Learning in High-Performance Computing Environments, MLHPC ’15, New York, NY, USA. ACM, 4, 1–5, 2015. [42] Zeiler, M. D.; Fergus, R. (2014). Visualizing and understanding convolutional networks. In Fleet, D., Pajdla, T., Schiele, B., and Tuytelaars, T., editors, Computer Vision – ECCV 2014, Cham. Springer International Publishing, 818–833, 2014. [43] Zoph, B.; Le, Q. V. (2016). Neural architecture search with reinforcement learning. CoRR, abs/1611.01578, 2016. [44] [online] Optunity. Available: http://optunity.readthedocs.io/en/latest/. Accessed: 2017-09-01. Copyright c©2020 by the authors. Licensee Agora University, Oradea, Romania. This is an open access article distributed under the terms and conditions of the Creative Commons Attribution-NonCommercial 4.0 International License. Journal’s webpage: http://univagora.ro/jour/index.php/ijccc/ This journal is a member of, and subscribes to the principles of, the Committee on Publication Ethics (COPE). https://publicationethics.org/members/international-journal-computers-communications-and-control Cite this paper as: Andonie, R.; Florea, A.-C.(2020). Weighted Random Search for CNN Hyperparameter Optimization, Interna- tional Journal of Computers Communications & Control, 15(2), 3868, 2020. https://doi.org/10.15837/ijccc.2020.2.3868