An Energy Efficient Crypto Suit for Secure Underwater Sensor Communication using Genetic Algorithm Vol. 5, No. 1 | January – June 2022 SJET | P-ISSN: 2616-7069 |E-ISSN: 2617-3115 | Vol. 5 No. 1 January – June 2022 54 Multi-Scale Pooling In Deep Neural Networks For Dense Crowd Estimation Ali Raza Radhan1, Fareed Ahmed Jokhio2, Ghulam Hussain1,3, Kamran Javed4, Arsalan Ahmed1 Abstract: State-of-the-art-methods for counting persons in dense crowded places lack in estimating accurate crowd density due to following reasons. They typically apply the same filters over a complete image or over big image patches. Only then the perspective distortion can be compensated by estimating local scale. It is achieved by training an additional classifier with the optimal kernel size chosen from limited choices. These methods are restricted to the context they are applied on because they are not end-to-end trainable; cannot justify quick scale changes because they allocate a single scale to big image patches; and can only utilize a narrow range of receptive fields for the networks to be of a feasible size. In this study, we bring in an end-to- end trainable deep architecture that merges features achieved from multiple kernels of different sizes and learns various essential features such as quick scale changes and to utilize the right context at each image location. This technique flexibly encodes scale of related information to precisely predict crowd density. The training and validation loss of the proposed approach is 5% and 4% lower than the state-of-the-art context aware method, respectively. Keywords: Perspective Distortion, local scale, image patches, crowd counting, deep learning. 1 Introduction Crowd counting has become an interesting topic in the recent years for researchers due to its broad applications, including crowd monitoring, traffic control, public safety, and event planning, video surveillance and city management. Over the last few years, the density map generation methods have been developed to count people in a scene by training regressors to estimate people density per unit area and get total number of counts by integration without detecting people individually. Currently deep learning methods [28-33] have become prevailing tool for crowd counting, due to powerful learning ability of 1Dept. Electronic Engineering, Quaid-e-Awam University, Larkana, Pakistan. 2Dept. Computer System Engineering, Quaid-e-Awam University, Nawabshah, Pakistan 3Dept. Electronic Engineering, Sungkyunkwan University, Suwon, South Korea 4National Centre of Artificial Intelligence (NCAI), Saudi Data and Artificial Intelligence Corresponding Author: fajokhio@quest.edu.pk convolutional neural networks (CNNs). Although crowd counting algorithms have been broadly examined by previous methods [20-22, 28- 33], but handling large density variance in crowd images which causes occlusion and perspective distortion that still remained a challenging problem. As illustrated in fig.1, the densities of a crowd vary significantly from low crowd (e.g. Venice dataset) to extremely dense crowd (Shanghai Part A dataset). Such large variation in density of people is a great challenge for CNN models and creates problems for predicting accurate density map. Most deep learning-based approaches [30-36] use same filters and mailto:fajokhio@quest.edu.pk Multi-Scale Pooling In Deep Neural Networks For Dense Crowd Estimation (pp. 54 - 63) Sukkur IBA Journal of Emerging Technologies - SJET | Vol. 5 No. 1 January – June 2022 55 pooling operations on a complete image. These depend upon fix sizes of receptive fields. However, one ought to alter receptive field size over an image to get better results. Stunning advancement has been attained by learning a density map by designing multi- scale structures [39] or accumulating multi- scale features [40,41], which shows capacity to manage with density variation is tuff for crowd counting approaches and remains a huge challenge. Figure 2 compares the Mean Absolute Error of several techniques on three standard benchmark crowd counting datasets having different crowd densities. Result indicates the strength of proposed method to high scale variance. Low Medium High Fig. 1. Crowd Types Such large variation in density of people is a great challenge for CNN models and creates problems for predicting accurate density map. Most deep learning-based approaches [30-36] use same filters and pooling operations on a complete image. These depend upon fix sizes of receptive fields. However, one ought to alter receptive field size over an image to get better results. Stunning advancement has been attained by learning a density map by designing multi- scale structures [39] or accumulating multi- scale features [40,41], which shows capacity to manage with density variation is tuff for crowd counting approaches and remains a huge challenge. Figure 2 compares the Mean Absolute Error of several techniques on three standard benchmark crowd counting datasets having different crowd densities. Result indicates the strength of proposed method to high scale variance. In this paper, we introduce a deep learning method that gets features from different receptive field sizes and learn the importance of each feature at different image locations and accounts for rapid scale changes. Our method can alleviate the problem of density variation, occlusion, and perspective distortion by using multi-scale pooling operation and give better results than other state-of-art methods. This is contrast to crowd counting methods that work on the density variation as in [19, 42], but different only in the loss function as we get the accurate multi-scale features with minimum error. Experiments are done on several standard crowds counting benchmark datasets such as ShanghaiTech Part A, Part B and Venice which show large density variation as shown in figure 1. 2 Letrature Review Most of the initial research was centered on detection-based crowd counting, where rectangular bounding boxes were used to detect persons in the scene [1] and this information was used to count the number of persons [2]. People are counted either by whole body detection or by different body parts-based detection. Same class detection techniques [3,4] generally are conventional pedestrian detection ways which train a Multi-Scale Pooling In Deep Neural Networks For Dense Crowd Estimation (pp. 54 - 63) Sukkur IBA Journal of Emerging Technologies - SJET | Vol. 5 No. 1 January – June 2022 56 classifier using attributes (such as Haar wavelets [5], histogram-oriented gradients [4], edgelet [6] and shapelet [7]) taken out from a full body. Several learning methods such as Support Vector Machines, boosting [8] and random forest [9] have been used with different level of success. Though fruitful in small scale crowd scenes, these methods are badly affected in dense crowd places. Analysts have endeavored to address this issue by embracing part-based detection methods [10, 11], where one applies boosted classifiers for particular body parts such as the head and shoulder to approximate the people counts in that specific area [2]. Though part-based detectors were used to reduce the problems of overlapping, these mechanisms had not fruitful results in the presence of highly congested crowds and high background litter. Fig. 2. The performance comparison of the state-of-the-art methods To solve these problems researchers tried to count by regression where they apply a mapping between features brought out from local image patches to their researcher’s counts [12, 13]. By counting using regression, these strategies dodge reliance on teaching detectors which could be a generally complex task. In recent research, Idrees et al. [23] recognized that in congested crowds, no any detection method is effective enough to give precise data for counting people due to problems of overlapping, low quality images and perspective distortion. Furthermore, they noticed that there exists a spatial relationship that can be utilized to oblige the count approximation in adjacent local regions. With these ideas in mind, they suggest finding features using various methods that gather different information. By treating thickly populated crowds of people as irregular and non-homogeneous surface, they applied Multi-Scale Pooling In Deep Neural Networks For Dense Crowd Estimation (pp. 54 - 63) Sukkur IBA Journal of Emerging Technologies - SJET | Vol. 5 No. 1 January – June 2022 57 Fourier analysis along with head detections and Scale-Invariant Feature Transform (SIFT) interest point based counting in local adjacent regions. The three sources, i.e., Fourier, interest points and head detection are then merged with their respective confidences and counts at localized patches are determined independently. As the previous methods were successful in addressing the problems of occlusion and clutter, they mostly neglected important spatial information as they were regressing on the global count. In variance, Lempitsky et al. [24] proposed a method to learn a linear mapping between local patch features and object density maps, thereby absorbing spatial information in the learning procedure. Perceiving that it is hard to learn a linear mapping, Pham et al. [25] proposed a method to learn a non-linear mapping between local patch features and density maps. They take multiple image patches using random forest regression to vote for densities of various target objects to learn a non-linear mapping. Similar to the above technique, Wang and Zou [26] proposed a fast method for density estimation based on subspace learning viewing the computational complication point of view of the existing methods. In a recent approach, Xu and Qiu [27] perceiving that the existing crowd density estimation strategies utilized a smaller set of features results in limiting their capability to perform better. They put forward a method to increase the accuracy of crowd density estimation by utilizing a large set of features. As the regression methods used by previous methods (based on Gaussian process regression or Ridge regression) are computationally complex and are not able to operate very high-dimensional features, they utilized random forest as the regression model whose tree structure is speedy and flexible. Unlike conventional methods to random forest construction, they inserted random projection in the tree nodes to tackle the problem of dimensionality and to introduce randomness in the tree structure. Now density-based counting methods are mostly superseded by convolutional neural networks-based methods where instead of looking at the patches of an image, researchers form an end-to-end regression method using CNNs. Wang et al. [28] applied CNNs firstly for the task of crowd density estimation. Wang et al. gave an end-to-end deep CNN regression model for counting persons from images in highly dense crowds. In addition, to minimize false responses background like trees and buildings in the images, training data is increased with extra negative samples having ground truth count is fixed as zero. In another approach, Fu et al. introduced to divide the image into one of the five categories: very high density, high density, medium density, low density and very low density rather than estimating density maps, they utilized a combination of two classifiers to acquire boosting in which the first classifier samples misclassified images and the other one reclassifies rejected samples. The approach of [29, 30] utilizes image patches taken out at multiple scales as source to a multi stream network. They then either combine the features for final density prediction [29] without continuous scale changes or introduce an adhoc term in the training loss function [30] predict consistency across scales forcibly. This, however, does not take contextual information into the features achieved by the algorithm, so has limited effect. Whereas approach [31] learns multi- scale features, by utilizing various receptive fields, they merge all of these features to estimate the density. While the earlier approaches account for scale, they neglect the reality that the suitable scale changes smoothly over the image should be controlled dynamically. This was proposed in [32] by weighting various density maps generated from input images at different scales. As the density map at each scale just relies upon features extracted from a specific scale, and consequently be ruined by the absence of versatile-scale reasoning. Here, we content that one should rather extract features at various scales and figure out how to adaptively merge them. While this, basically was also the inspiration of [33] , which train an additional classifier to dole out the best receptive field for each image patch, these Multi-Scale Pooling In Deep Neural Networks For Dense Crowd Estimation (pp. 54 - 63) Sukkur IBA Journal of Emerging Technologies - SJET | Vol. 5 No. 1 January – June 2022 58 techniques stay restricted in many significant manners. Firstly, they depend on classifiers, which need pre-training the network before training the classifier, and in this way it is not end-to-end trainable. Secondly, they commonly assign a single scale to a whole image patch that can still be large and in this manner don't represent quick scale changes. Lastly, the range of receptive field sizes they reliance remain limited in part because using much bigger ones would need using more complex architectures, which may be difficult to train given sort of networks being used. In this study, we bring in an end-to-end trainable deep architecture that merges features achieved from multiple kernels of different sizes and learns various essential features such as quick scale changes and to utilize the right context at each image location. This technique flexibly encodes scale of related information to precisely predict crowd density. The training and validation loss is relatively low than above methods. 3 Method 3.1 Overview As discussed above, we target the perspective distortion problem in crowd counting methods where density of people varies from low to extreme level. For decreasing the training and validation loss and to improve the generalization of the model with density variation, we trained it on optimal number of epoch and batch size to get the final density map. Fig. 3. The proposed methodology 3.2 Scale Learning Module We aim to generate an estimated density map that is very similar to ground-truth density map of the given set of M training images with corresponding ground-truth density maps. Images are given to initial point of our network having first ten layers of pre- trained VGG-16 network to get Vgg feature maps as shown in fig.2. These Vgg features are the base to our learning-scale model. As discussed earlier in section 2, Vgg-16 network use same filters and pooling operations on a complete image and depends upon fix sizes of receptive fields. So, it is less efficient in rapid scale change scenarios. To solve this, we alter receptive field size over an image to get learning-scale features by applying multi- scale pooling 1X1, 2X2, 3X3 and 6X6 on Vgg features. This forms a pyramid structure. Each pooling Multi-Scale Pooling In Deep Neural Networks For Dense Crowd Estimation (pp. 54 - 63) Sukkur IBA Journal of Emerging Technologies - SJET | Vol. 5 No. 1 January – June 2022 59 feature shows a density map, hence four different densities extract information from the given image. These multi-scale densities are then up sampled to the size of convolution layers by linear interpolation and finally combined to give the weighted features. In the end weighted features are concatenated with Vgg features to predict the final density map of underlying image as shown in Figure 3. Scale Learning Network is illustrated in figure 3. RGB images are at input to a front network that contains the first 10 layers of the VGG-16 network. The resulting Vgg features are organized in blocks of various sizes by average pooling succeeded by a 1×1 convolutional layer. They are then up- sampled back to the original feature size to form the weighted features. Contrast or weighted features are further utilized to memorize the weights for the learning-scale features that are then fed to a back-end network to deliver the ultimate density map. 3.3 Training Details and Loss Function Our network is end-to-end trainable which involves L2 loss defined as 𝐿(𝜎) = 1 2𝐵 ∑ { 𝐷𝑖 𝑔𝑡 − 𝐷𝑖 𝑝𝑟𝑒 } 2 2 𝐵𝑖=1 () Where B is the batch size, 𝝈 is the non- linear mapping parameter that maps an input image 𝑰i to a predicted density map 𝑫𝒊 𝒑𝒓𝒆 . 𝑫𝒊 𝒈𝒕 is the ground-truth density map that we get as in [19] by convolving an image having ones at people head’s locations and zeros elsewhere with a Gaussian kernel 𝑵𝒈𝒕 (𝒑\𝝁, 𝜹𝟐) we have ∀𝑝 ∈ 𝐼𝑖 , 𝐷𝑖 𝑔𝑡 (p\𝐼𝑖 ) = ∑ 𝑁𝑔𝑡𝑐𝑖𝑗=1 (𝑝\𝜇 = 𝑃𝑖 𝑗 , 𝛿 2) () Where 𝝁 and 𝜹 are the mean and standard deviation of the normal distribution. To reduce the loss, we apply stochastic Gradient Descent with batch size 1 for ShanghaiTech Part_A dataset because it has various size images and got impressive results after training the model for 150 epochs. For other two Venice and ShanghaiTech_Part_B fixed image size datasets, we use Adam with batch size 16 and trained the model for 100 epochs 4 Experiments 4.1 Evaluation Matrices We apply two standard error terms, i.e. Mean Absolute Error (MAE) and Mean Squared Error (MSE) for evaluation purpose and compare our results with other methods. These are defined as 𝑀𝐴𝐸 = 1 𝐾 ∑ |𝑥𝑖 − �̂�𝑖 | 𝐾 𝑖=1 , 𝑀𝑆𝐸 = √ 1 𝐾 ∑ (𝑥𝑖 − �̂�𝑖 ) 2𝐾 𝑖=1 , () Where K is the total number of images of test images, 𝒙𝒊 represents the ground-truth and 𝒙𝒊 is the predicted number of people in the 𝒊 𝒕𝒉 image. 4.2 Benchmark Datasets and Ground- Truth Data We took three standard benchmark datasets including ShanghaiTech Part_A, ShanghaiTech Part_B and Venice to compare our method with other approaches. ShanghaiTech [27]. The shanghaiTech crowd counting dataset comprises of 1198 images with approximately 330,165 people in them. It is divided in two parts ShanghaiTech Part_A with 482 images which are randomly taken from internet and ShanghaiTech Part_B having 716 images taken from busy streets of metropolitan area in Shanghai city. In part_A 300 images and in Part_B 400 images are reserved for training set. The remaining images of both parts are used for testing purposes. The ground-truth density maps for part_A were generated by adaptive Gaussian kernels and for part_B by fixed kernels. Venice [19]. Venice dataset contains 167 fixed size 1280X720 resolution images from Piazza San Marco in Venice. In this dataset 80 images forms a training data set, and 87 images are used for testing purpose. The images of Venice dataset are more calibrated than Multi-Scale Pooling In Deep Neural Networks For Dense Crowd Estimation (pp. 54 - 63) Sukkur IBA Journal of Emerging Technologies - SJET | Vol. 5 No. 1 January – June 2022 60 ShanghaiTech. The ground-truth density maps are generated by fixed Gaussian kernels as in ShanghaiTech part_B data set. 4.3 Ablation Study Ablation study is mainly performed on ShanghaiTech and Venice datasets, as there could be the more problem of rapid scale change due to density variation. Here we confirm the benefits of specific features. Table 1 shows comparison of Mean Absolute Error and Root Mean Square Error of different approaches on three standard benchmark crowd counting datasets having people density variation. Results in Figure 4 indicate the strength of proposed method on low, medium, and high types of crowds. TABLE I. COMPARATIVE RESULTS ON THE VENICE DATASET & SHANGHAITECH DATASET VARIATION IN CROWD COUNTING. WE SUMMED UP Venice Shanghai Part-A Shanghai Part-B Method MAE RMSE MAE RMSE MAE RMSE Zhang et al.[32] - - 181.8 277.7 32.0 49.8 MCNN [33] 145.4 147.3 110.2 173.2 26.4 41.3 Switch-CNN[34] 52.8 59.5 90.4 135.0 21.6 33.4 CP-CNN[36] - - 73.6 106.4 20.1 30.1 ACSCP[42] - - 75.7 102.7 17.2 27.4 Liu et al.[37] - - 73.6 112.0 13.7 21.4 D-ConvNet[41] - - 73.5 112.3 18.7 26.0 IG-CNN[31] - - 72.5 118.2 13.6 21.1 Ic-CNN[43] - - 68.5 116.2 10.7 16.0 CSRNet[38] 35.8 50.0 68.2 115.0 10.6 16.0 Multi-Scale Pooling In Deep Neural Networks For Dense Crowd Estimation (pp. 54 - 63) Sukkur IBA Journal of Emerging Technologies - SJET | Vol. 5 No. 1 January – June 2022 61 SANet[30] - - 67.0 104.5 8.4 13.6 Context aware[19] 20.5 29.9 62.3 100.0 7.8 12.2 Ours approach 19.8 29.3 61.5 100.0 7.2 12.0 Fig. 4. The performance of the proposed approach on three different density level scenes for crowd density estimation 5 Conclusion and Future perspectives In this paper, we propose a learning scale model by applying multi-scale pooling network to solve the problem of density four different scale density maps of an image to generate the final one. Experiments were per- formed on three different standard benchmark datasets having density variations and the generalization ability of the model was quite VGG: 132 OURS MSPN: 174 GT: 184 VGG: 223 OURS MSPN: 273 GT: 285 VGG: 2810 OURS MSPN: 2221 GT: 2333 Multi-Scale Pooling In Deep Neural Networks For Dense Crowd Estimation (pp. 54 - 63) Sukkur IBA Journal of Emerging Technologies - SJET | Vol. 5 No. 1 January – June 2022 62 impressive than other state-of-art methods. These datasets were formed by fixed cameras, so in future we will work on the images taken from moving cameras e.g. drones. We will enhance our model to process consecutive images and their ground- truth data simultaneously Funding: This research received no external funding. Conflicts of Interest: The authors declare no conflict of interest. REFERENCES [1] Dollar, P., Wojek, C., Schiele, B., Perona, P., 2012. Pedestrian detection: An evaluation of the state of the art. IEEE transactions on pattern analysis and machine intelligence 34, 743–761. [2] Li, M., Zhang, Z., Huang, K., Tan, T., 2008. Estimating the number of people in crowded scenes by mid based foreground segmentation and head- shoulder detection, in: Pattern Recognition, 2008. ICPR 2008.19th International Conference on, IEEE. pp. 1–4. [3] Leibe, B., Seemann, E., Schiele, B., 2005. Pedestrian detection in crowded scenes, in: Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, IEEE. pp. 878–885. [4] Tuzel, O., Porikli, F., Meer, P., 2008. Pedestrian detection via classification on riemannian manifolds. IEEE transactions on pattern analysis and machine intelligence 30, 1713–1727. [5] Viola, P., Jones, M.J., 2004. Robust real-time face detection. International journal of computer vision 57, 137–154. [6] Wu, B., Nevatia, R., 2005. Detection of multiple, partially occluded humans in a single image by bayesian combination of edgelet part detectors, in: Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on, IEEE. pp. 90–97. [7] Sabzmeydani, P., Mori, G., 2007. Detecting pedestrians by learning shapelet features, in: Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on, IEEE. pp. 1–8. [8] Viola, P., Jones, M.J., Snow, D., 2005. Detecting pedestrians using patterns of motion and appearance. International Journal of Computer Vision 63, 153–161. [9] Gall, J., Yao, A., Razavi, N., Van Gool, L., Lempitsky, V., 2011. Hough forests for object detection, tracking, and action recognition. IEEE transactions on pattern analysis and machine intelligence 33, 2188–2202. [10] Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D., 2010. Object detection with discriminatively trained part- based models. IEEE transactions on pattern analysis and machine intelligence 32, 1627–1645. [11] Wu, B., Nevatia, R., 2007. Detection and tracking of multiple, partially occluded humans by bayesian combination of edgelet based part detectors. International Journal of Computer Vision 75, 247– 266. [12] Chan, A.B., Vasconcelos, N., 2009. Bayesian poisson regression for crowd counting, in: 2009 IEEE 12th International Conference on Computer Vision, IEEE. pp. 545–551. [13] Ryan, D., Denman, S., Fookes, C., Sridharan, S., 2009. Crowd counting using multiple local features, in: Digital Image Computing: Techniques and Applications, 2009. DICTA’09., IEEE. pp. 81–88. [14] Babu Sam, D., Surya, S. and Venkatesh Babu, R., 2017. Switching convolutional neural network for crowd counting. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5744-5752). [15] Zhang, L., Shi, M. and Chen, Q., 2018, March. Crowd counting via scale-adaptive convolutional neural network. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV) (pp. 1113-1121). IEEE. [16] Chen, J., Su, W. and Wang, Z., 2020. Crowd counting with crowd attention convolutional neural network. Neurocomputing, 382, pp.210-220. [17] Zhang, Y., Zhou, D., Chen, S., Gao, S. and Ma, Y., 2016. Single-image crowd counting via multi- column convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 589-597). [18] Sam, D.B. and Babu, R.V., 2018, April. Top- down feedback for crowd counting convolutional neural network. In Thirty- second AAAI conference on artificial intelligence. [19] Liu, W., Salzmann, M. and Fua, P., 2019. Context- aware crowd counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 5099-5108). [20] Zhang, C., Li, H., Wang, X. and Yang, X., 2015. Cross-scene crowd counting via deep convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 833-841). [21] Hu, Y., Chang, H., Nian, F., Wang, Y. and Li, T., 2016. Dense crowd counting from still images with convolutional neural networks. Journal of Visual Communication and Image Representation, 38, pp.530-539. [22] Boominathan, L., Kruthiventi, S.S. and Babu, R.V., 2016, October. Crowdnet: A deep convolutional network for dense crowd counting. In Proceedings of the 24th ACM international conference on Multimedia (pp. 640-644). Multi-Scale Pooling In Deep Neural Networks For Dense Crowd Estimation (pp. 54 - 63) Sukkur IBA Journal of Emerging Technologies - SJET | Vol. 5 No. 1 January – June 2022 63 [23] Idrees, H., Saleemi, I., Seibert, C., Shah, M., 2013. Multi-source multiscale counting in extremely dense crowd images, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2547–2554. [24] Lempitsky, V., Zisserman, A., 2010. Learning to count objects in images, in: Advances in Neural Information Processing Systems, pp. 1324–1332. [25] Pham, V.Q., Kozakaya, T., Yamaguchi, O., Okada, R., 2015. Count forest: Co-voting uncertain number of targets using random forest for crowd density estimation, in: Proceedings of the IEEE International Conference on Computer Vision, pp. 3253– 3261. [26] Wang, Y., Zou, Y., 2016. Fast visual object counting via example-based density estimation, in: Image Processing (ICIP), 2016 IEEE International Conference on, IEEE. pp. 3653–3657. [27] Xu, B., Qiu, G., 2016. Crowd density estimation based on rich features and random projection forest, in: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE. pp. 1–8. [28] Wang, C., Zhang, H., Yang, L., Liu, S., Cao, X., 2015. Deep people counting in extremely dense crowds, in: Proceedings of the 23rd ACM international conference on Multimedia, ACM. pp. 1299–1302. [29] Daniel Onoro-Rubio and Roberto J. L´opez- Sastre. Towards Perspective-Free Object Counting with Deep Learning. In European Conference on Computer Vision, pages 615– 629, 2016. [30] Xinkun Cao, ZhipengWang, Yanyun Zhao, and Fei Su. Scale Aggregation Network for Accurate and Efficient Crowd Counting. In European Conference on Computer Vision, 2018. [31] Deepak Babu Sam, Neeraj N. Sajjan, R. Venkatesh Babu, and Mukundhan Srinivasan. Divide and Grow: Capturing Huge Diversity in Crowd Images with Incrementally Growing CNN. In Conference on Computer Vision and Pattern Recognition, 2018. [32] Cong Zhang, Hongsheng Li, Xiaogang Wang, and Xiaokang Yang. Cross-Scene Crowd Counting via Deep Convolutional Neural Networks. In Conference on Computer Vision and Pattern Recognition, pages 833–841, 2015. [33] Yingying Zhang, Desen Zhou, Siqin Chen, Shenghua Gao, and Yi Ma. Single-Image Crowd Counting via Multi-Column Convolutional Neural Network. In Conference on Computer Vision and Pattern Recognition, pages 589– 597,2016. [34] Deepak Babu Sam, Shiv Surya, and R. Venkatesh Babu. Switching Convolutional Neural Network for Crowd Counting.In Conference on Computer Vision and Pattern Recognition, page 6, 2017. [35] Feng Xiong, Xinjian Shi, and Dit-Yan Yeung. Spatiotemporal Modeling for Crowd Counting in Videos. In International Conference on Computer Vision, pages 5161–5169, 2017. [36] Vishwanath A. Sindagi and Vishal M. Patel. Generating High-Quality Crowd Density Maps Using Contextual Pyramid CNNs. In International Conference on Computer Vision,pages 1879–1888, 2017. [37] Xialei Liu, Joost van de Weijer, and Andrew D. Bagdanov. Leveraging Unlabeled Data for Crowd Counting by Learning to Rank. In Conference on Computer Vision and Pattern Recognition, 2018. [38] Yuhong Li, Xiaofan Zhang, and Deming Chen. CSRNet: Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes. In Conference on Computer Vision and Pattern Recognition, 2018. [39] Wei Liu, Dragomir Anguelov, Dumitru Erhan, ChristianSzegedy, Scott E. Reed, Cheng-Yang Fu, and Alexander C.Berg. SSD: Single Shot Multibox Detector. In European Conference on Computer Vision, 2016. [40] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A Deep Convolutional Encoder- Decoder Architecture for Image Segmentation. arXiv Preprint, 2015. [41] Zenglin Shi, Le Zhang, Yun Liu, and Xiaofeng Cao. Crowd Counting with Deep Negative Correlation Learning. In Conference on Computer Vision and Pattern Recognition, 2018. [42] Zan Shen, Yi Xu, Bingbing Ni, Minsi Wang, Jianguo Hu, and Xiaokang Yang. Crowd Counting via Adversarial Cross-Scale Consistency Pursuit. In Conference on Computer Vision and Pattern Recognition, 2018. [43] Viresh Ranjan, Hieu Le, and Minh Hoai. Iterative Crowd Counting. In European Conference on Computer Vision,2018