Microsoft Word - brain_7_issue1_version1 74 Measuring Customer Behavior with Deep Convolutional Neural Networks Veaceslav Albu Institute of Mathematics and Computer Science, 5 Academiei, Chisinau, Republic of Moldova, MD 2028 vaalbu@googlemail.com Abstract In this paper we propose a neural network model for human emotion and gesture classification. We demonstrate that the proposed architecture represents an effective tool for real- time processing of customer's behavior for distributed on-land systems, such as information kiosks, automated cashiers and ATMs. The proposed approach combines most recent biometric techniques with the neural network approach for real-time emotion and behavioral analysis. In the series of experiments, emotions of human subjects were recorded, recognized, and analyzed to give statistical feedback of the overall emotions of a number of targets within a certain time frame. The result of the study allows automatic tracking of user’s behavior based on a limited set of observations. Keywords: Deep Neural Networks, Computer Vision, Emotion Classification, Gesture Classification 1. Introduction Recognition of human behavior can be most efficiently achieved by visually detecting facial features and specific body movements, such as gestures. Using computer vision and machine learning algorithms for processing these features, recorded by infrared cameras, we can classify emotional states and behavioral patterns of multiple targets. The aim of this paper is to provide statistical observations and measurements of human behavior during the standard interaction with a user interface of a commonly used software. For academic purposes, we have chosen a very limited number of emotional states and behavioral patterns by studying only type of such standard interaction: the interaction of a user with typical ATM equipment, since it provides us with very distinctive patterns of ‘typical’ and ‘non-typical’ behavior and facial expressions. During this study, we observed the behavior of human subjects during standard interaction with the ATM versus non- standard interaction. Automated analysis of these behaviors with the machine learning techniques allowed us to train a complex convolutional neural network (CNN) to classify behavior of a user by classification both body movements and facial features. Such a feedback can provide important measures for user response to an interaction with any chosen system with a limited number of gestures involved. We use infrared cameras to automatically detect features and the movements of the limbs in order to classify user behavior into typical or untypical for the kind of task he is performing. We restrict ourselves to only one type of interaction; however, this kind of classification task is very useful in the number of applications, where the number of gestures of the human is limited, such as: • Customers at the various types of automated machines. For this category of users, the algorithm can be used for detection of unusual/fraudulent behavior to decrease the workload of the closed-circuit television (CCTV), or video surveillance, operators who monitor users of these machines: o customer at the ATM machine; o customer at the ticket machine in the underground; o customer at the automated cashier in the countries, where such payment type is widely used; • Drivers. For this category of users, the algorithm can be used for detection of dangerous actions and preventing the unwanted consequences, such as sleeping, loss of attention etc.: o train driver in the train line/underground; o track driver; o automobile driver; V. Albu - Measuring Customer Behavior with Deep Convolutional Neural Networks 75 • Workers. Here, we could classify correct vs. incorrect actions, identify such unwanted stated as loss of attention, sickness, tiredness etc. o assembly line workers; o construction workers (e.g. on high buildings, underground, mines). The aim of current paper is to analyze the person's actions during the interaction with a user interface and implement the algorithm, which will be able to classify the human behavior (normal vs. abnormal) in real time (Perez-Sala et al., 2014). The processing of facial features with infrared cameras for academic and industrial purposes is rapidly developing: it is used in gaming system domain (MacCormick, 2013; Vera et al., 2011), as well as in the security systems 1 . In this study, we also use infrared cameras, imbedded in the Kinect Microsoft system, however the usage of other types of infrared cameras is also possible. The output from the infrared camera (point cloud) is used as an input to the neural network architecture, which classify the user’s behavior based on his gestures and facial expressions. User’s movements are classified into two categories (typical and non-typical), whereas facial expressions are classified into six basic emotions: anger, disgust, fear, happiness, sadness, and surprise (Erkman & Friesen, 1978). To solve the problem of emotion and gesture recognition, we use convolutional neural networks (CNN), which proved to be extremely effective for classification of the large amount of data. Despite the similarity between artificial neural network and convolutional neural network, CNN is more effective because it uses alteration of convolutional and subsampling layers. The contribution of the current study is mainly in the application of the algorithms: we combine the particular type of NNs with the infrared input for recognition and classification of facial features and body movements. As far as we aware, this type of application has not been mentioned in the literature so far. 2. Background 2.1. Literature review The type of the convolutional network, described in this study, was first proposed by Fukushima in the theoretical model, called “Neocognitron” (Fukushima, 1980). One of the earliest representatives of this class of models, Neocognitron is a hierarchical network in which feature complexity and translation invariance were alternately increased in different layers of simple and complex cells. The recognition occurred in different levels of processing hierarchy by a template match, and a pooling operation over units tuned to the same feature but at different positions. Starting with the Neocognitron for translation-invariant object recognition, several hierarchical models, employing pooling mechanisms and convolutional layers, have been proposed. The concept of pooling of units tuned to transformed versions of the same object or feature was subsequently proposed by Perrett and Oram (Perrett & Oram, 1993). They proposed a scheme, which provides 3D object recognition through 2D shape description. This scheme involves viewer-centred description of objects. Shape components were used as features for object comparison. First, these components were used to activate representations of the approximate appearance of one object type at one view, orientation and size. The invariance was achieved through a series of independent analyses with a subsequent pooling of results, which are performed at each pooling stage. Therefore, the system performed parallel processing with computations performed in a series of hierarchical steps. In 1997 similar type of neural network was proposed by Logothetis et al (Logothetis et al., 1994). They constructed a regularization RBF network for 3D object recognition, based on radial- basis functions (RBFs). This model had a continuation, proposed by Riesenhuber and Poggio, widely known as HMAX model. The structure of HMAX model is similar to Fukushima’s Neocognitron with its feature complexity-increasing simple-cell (S) layers and invariance-increasing complex cell (C) layers (Riesenhuber & Poggio, 1999; Riesenhuber & Poggio, 2000). HMAX uses another type of pooling mechanism, which is called MAX operation. This mechanism allows to 1 Aurora face recognition system http://www.facerec.com/deep-learning/ BRAIN. Broad Research in Artificial Intelligence and Neuroscience Volume 7, Issue 1, March 2016, ISSN 2067-3957 (online), ISSN 2068 - 0473 (print) 76 increase invariance in the complex cell layers. It uses the following principle: the most strongly activated afferent of the C-cell determines the response of the pooling unit, providing the ability to isolate essential feature from background and thus build feature detectors invariant to scale and translation changes. More complex features in higher levels of HMAX are built from simpler features. The tolerance to deformations in their local arrangement is achieved by the invariance properties of the lower level units. From the point of view of biological plausibility, visual areas of the primate cortex in this model are considered as a number of modules (V1-V4, IT, PFC), modelled as a hierarchy of increasingly sophisticated representations, naturally extending the model of simple to complex cells of Hubel and Wiesel (Hubel & Wiesel, 1962). An important breakthrough of CNNs came with the widespread use of the backpropagation learning algorithm for multi-layer feed-forward NNs. LeCun et al. (LeCun et al., 1998) presented the first CNN that was trained by backpropagation and applied it to the problem of handwritten digit recognition. The term Convolutional Neural Network refers to NN models that are similar to the one proposed by LeCun et al., which is actually a simpler model than the Neocognitron and its extensions, mentioned above. Since 2012, when deep neural network first demonstrated their performance, they were used in a large number of applications for computer vision. Alex Krizhevsky et al. trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different 1000 different classes (Krizhevsky, Sutskever, & Hinton, 2012). On the test data, they achieved extremely small error rates, which were considerably better than the previous state-of-the-art. The proposed neural network had 60 million parameters and 650,000 neurons, consisted of five convolutional layers, some of which are followed by max- pooling layers, and three fully-connected layers with a final 1000-way softmax. The softmax function, or normalized exponential, is a generalization of the logistic function that converts a K- dimensional vector z of arbitrary real values to a K-dimensional vector sigma(z) of real values in the range (0, 1) that add up to 1. In neural network simulations, the softmax function is often implemented at the final layer of a network used for classification. 2.2. Related work Recently, the field of applications of deep learning for sensor and infrared cameras processing is rapidly evolving. Recently, similar algorithm was applied for night vision systems in vehicles (Wang et al., 2016). In that study, vehicle candidates for classification are detected from the infrared frame, contours are generated by using a local adaptive threshold based on maximum distance. The obtained vehicle candidates are verified using a deep belief network (DBN) based classifier. Also deep neural networks were recently applied for detection of faces in airports for robust feature recognition 2 . The system recognizes faces and compares them with the database of passengers. It is implemented in Heathrow and Manchester airports. 3. Methodology 3.1. Model Architecture A deep neural network (DNN) is an artificial neural network (NN) with multiple hidden layers of units between the input and output layers. Similar to shallow ANNs, DNNs can model complex non-linear relationships. DNN architectures, e.g., for object detection and parsing generate compositional models where the object is expressed as a layered composition of image primitives (LeCun et al., 1989). The extra layers enable composition of features from lower layers, giving the potential of modeling complex data with fewer units than a similarly performing shallow network. 2 Aurora face recognition system http://www.facerec.com/deep-learning/ V. Albu - Measuring Customer Behavior with Deep Convolutional Neural Networks 77 Figure 1. CNN architecture (adopted from Krizhevsky et al. ’12) The architecture of a CNN can be described as following. A small pixel region goes to input neurons and then connects to a first convolution hidden layer (Figure1). There we can see a set of learnable filters, which are activated during the presentation some particular type of feature in pixel region in the input. On this phase, CNN does shift invariance, which is carried by feature map. Subsampling layer goes next. There we have two processes: local averaging and sampling. As a result, we get declining resolution of feature map. To correspond this task CNN needs supervised learning. Before starting the experiment, we gave a set of labeled videos with different emotional experience. The system analyses images and finds similar features. Then the system creates a map, where it arranges videos in accordance with similar features. Thereby, images with similar emotions form certain class. To test the system, we add other videos and correct the system when it refers them improperly. The proposed model consists of four convolutional layers, followed by max-pooling layers, and three fully-connected layers with a final classificatory presented with MLP (with six basic outputs, corresponding to basic emotions for emotion classification and two outputs for motion classification for typical and non-typical behavior). The input data was presented as infrared camera output. 3.2. Model implementation and training The computations were performed on Python, using Theano CNN library. Theano is a Python library that allows defining, optimizing, and evaluating mathematical expressions involving multi-dimensional arrays efficiently (Bastien et al., 2012). The model was trained with the trained data and model evaluation was performed on the test data with the the k-fold cross-validation (for details, see next subsection). The computations were performed on the Amazon EC2 machine 3 . 3.3. Model evaluation The validation of the neural network model was performed with the leave one out cross validation (LOOCV) technique. The use of LOOCV was essential for appropriate estimation of optimal level of regularization and parameters (connection weights) of neural network obtained (1). Cross-validation is a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set. LOOCV is a particular case of leave-p-out cross- validation. Leave-p-out cross-validation (LpOCV) involves using p observations as the validation set and the remaining observations as the training set. This is repeated on all ways to cut the original sample on a validation set of p observations and a training set. LpO cross-validation requires to learn and validate C n p times (where n is the number of observations in the original sample). In Leave-one-out cross-validation we assume p = 1. However, for our purpose LOOCV appeared to be relatively slow. Therefore, the validation of the CNN network results was performed with the K-fold cross-validation technique (Golub & Van loan, 1996). In k-fold cross-validation, the original sample 3 https://portal.aws.amazon.com BRAIN. Broad Research in Artificial Intelligence and Neuroscience Volume 7, Issue 1, March 2016, ISSN 2067-3957 (online), ISSN 2068 - 0473 (print) 78 is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. The cross-validation process is then repeated k times (the folds), with each of the k subsamples used exactly once as the validation data. The k results from the folds can then be averaged (or otherwise combined) to produce a single estimation. The advantage of this method over repeated random sub-sampling (see below) is that all observations are used for both training and validation, and each observation is used for validation exactly once. 10-fold cross-validation is commonly used, but in general k remains an unfixed parameter. When k=n (the number of observations), the k-fold cross-validation is exactly the leave-one-out cross-validation. 4. Results and discussion 4.1. Infrared input processing In this study, we used one of the approaches to recognition of gestures is body tracking: classification of the body movements. One of the classification techniques for this method is pattern recognition: i.e. special video/infrared camera recognizes human actions: waving, jumping, hand gestures etc. Among the first successful representatives of this technology are Kinect from Microsoft (MacCormick, 2013). The Kinect uses structured light and machine learning as follows: • The depth map is constructed by analyzing a speckle pattern of infrared laser light. • Body parts are inferred using a randomized decision forest, learned from over 1 million training examples. • Starts with 100,000 depth images with known skeletons (from a motion capture system). • Transforms depth image to body part image. • Transforms the body part image into a skeleton. 4.2. Psychological experiments We conducted a series of experiments in order to evaluate how effectively the proposed system can detect normal vs. abnormal behavior of customer during interaction with ATM. For the purposes of experiment, we developed an ATM simulation software that was used in the stand-alone terminal. During the interaction session, body movements of users and facial expressions were recorded by a camera, mounted on the top of the terminal. These records were later evaluated by human observers; and behavior, displayed on these records, was classified as typical or non-typical. In order to preserve the uniformity of the data, we showed the videos on the same equipment which were used during the ATM experiment session. Thirty healthy subjects, age 21-37, with normal or corrected-to-normal vision, participated in the experiment. Simultaneously, the data from two series of experiments was processed with an infrared camera and used as an input to the CNN algorithm. Each subject performed 10 sessions with the ATM-simulation software and 5 video session. During each session, the recognition of the upper-body movements (in the range of the camera, mounted on the top of the typical ATM machine) was performed together with facial features classification and recognition. Among thirty subjects, we used 22 as examples of ‘normal’ behavior and 8 as examples of ‘abnormal’ behavior. 4.3. Comparison to related work The field of application of DNN to sensor input recognition is relatively new, but rapidly developing. A large number of studies exist, but to our knowledge, this particular application has not been studies so far. Among similar works, we can mention our previous works, in which we have applies RBFn-SOM model to the same problem (Veaceslav & Cojocaru, 2015; Veaceslav, 2016). In comparison to this type of architecture, we have managed to achieve a better accuracy (1,5 – 2%), but the computational costs of application of DNNs is much higher. 4.3. Results In this study, we developed a NN model for recognition of body movements on two types (typical and non-typical) and facial expression accuracy (of 18% and 38% error rate, respectively). These results are achieved independently and combined afterwards with a simple classification V. Albu - Measuring Customer Behavior with Deep Convolutional Neural Networks 79 algorithm. To improve system’s results, the proposed model requires a large amount of training data, which we cannot be easily obtained. Therefore, the natural continuation of current research would be conducting further field tests to obtain more training data and improve performance. References Perez-Sala, X., Escalera, S., Angulo, C., & Gonzalez, J. (2014). Survey on Model Based Approaches for 2D and 3D Visual Human Pose Recovery. Sensors, vol. 14, pp. 4189-4210. MacCormick, J. (2013). How does Kinect work. John MacCormick, “How does the kinect work?” Retrieved from http://users.dickinson.edu/~jmac/selected-talks/kinect.pdf. Vera, L., Gimeno, J., Coma, I., & Ferńandez, M. (2011). Augmented mirror: interactive augmented reality system based on Kinect,” Human-Computer Interaction–INTERACT 2011, pp. 483– 486, 2011. Aurora face recognition system http://www.facerec.com/deep-learning/ Erkman, P. & Friesen, W. (1978). Facial action coding system: A technique for the measurement of facial movement. Consulting Psychologists Press, Palo Alto, 1978. Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for mechanism pattern recognition unaffected by shift in position, Biological Cybernetics, 36(4):193-202. Perrett, D. I., & Oram, M. W. (1993). Neurophysiology of shape processing. Image and Vison Computing, 11(6), 317-333. Riesenhuber, M. & Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nature Neuroscience 2:1019–1025. Riesenhuber, M. & Poggio, T. (2000). Models of Object Recognition. Nature Neuroscience 3(supp.): 1199-1204. Hubel, D. & Wiesel, T. (1962). Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex, Journal of Physiology, 160(1): 106–154.2. LeCun, Y.A., Bottou, L., Orr, G.B. & Müller, K.R. (1998). Efficient BackProp, Neural networks: Tricks of the trade, pp. 9-48. Krizhevsky, A., Sutskever, I., & Hinton, G. (2012.) ImageNet Classification with Deep Convolutional Neural Networks. Proc. Neural Information and Processing Systems. Available at http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural- networks.pdf LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., & Jackel, L.D. (1989) Backpropagation Applied to Handwritten Zip Code Recognition, Neural Computation, vol. 1, pp. ,541-551. Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I., Bergeron, A., Bouchard, N., Warde-Farley, D., & Bengio, Y. (2012). “Theano: new features and speed improvements”. NIPS 2012 deep learning workshop. Golub, G.H. & Van loan, C.G. (1996). Matrix Computations (3 ed.) Ithaka, NY, p. 784. Wang, H., Cai, Y., Chen, X., & Chen, L. (2016). Night-Time Vehicle Sensing in Far Infrared Image with Deep Learning, Journal of Sensors, v.2016, p. 8. Veaceslav, A. & Cojocaru, S. (2015). Measuring human emotions with modular neural networks and computer vision based applications. Computer Science Journal of Moldova, vol.23, no.1(67), pp. 40-61. Veaceslav, A. (2016). Measuring human emotions with modular neural networks. The proceedings of the 7th International Multi-Conference on Complexity, Informatics and Cybernetics: IMCIC 2016, March 8 - 11, 2016, Orlando, Florida, USA. Logothetis, N., Bricolo, E., Poggio, T. (1994). 3D Object Recognition: A Model of View-Tuned Neurons. Retrieved from http://papers.nips.cc/paper/1296-3d-object-recognition-a-model-of- view-tuned-neurons.pdf