International Journal of Interactive Mobile Technologies (iJIM) – eISSN: 1865-7923 – Vol  17 No  16 (2023)


iJIM | Vol. 17 No. 16 (2023) International Journal of Interactive Mobile Technologies (iJIM) 129

iJIM | eISSN: 1865-7923 | Vol. 17 No. 16 (2023) | 

JIM International Journal of Interactive Mobile Technologies 

Saisanthiya, D., Supraja, P. (2023). Heterogeneous Convolutional Neural Networks for Emotion Recognition Combined with Multimodal Factorised  
Bilinear Pooling and Mobile Application Recommendation. International Journal of Interactive Mobile Technologies (iJIM), 17(16), pp. 129–142. https:// 
doi.org/10.3991/ijim.v17i16.42735

Article submitted 2023-05-01. Resubmitted 2023-06-09. Final acceptance 2023-06-17. Final version published as submitted by the authors.

© 2023 by the authors of this article. Published under CC-BY.

Online-Journals.org

PAPER

Heterogeneous Convolutional Neural Networks  
for Emotion Recognition Combined with Multimodal 
Factorised Bilinear Pooling and Mobile Application 
Recommendation

ABSTRACT
The field of emotion recognition has garnered considerable interest due to its diverse appli-
cations in mental health, personalised advertising and enhancing user experiences. This 
research paper introduces a unique and innovative method for emotion recognition by inte-
grating heterogeneous convolutional neural networks (CNNs) with multimodal factorised 
bilinear pooling. Furthermore, the paper also incorporates the integration of mobile applica-
tion recommendations as part of the overall approach. The proposed method leverages the 
power of CNNs to extract high-level features from different modalities, including facial expres-
sions, speech signals and physiological signals. By using heterogeneous CNNs, each modality 
is processed independently to capture modality-specific emotional cues effectively. To fuse the 
extracted features, multimodal factorised bilinear pooling is employed, which captures the 
complex interactions between different modalities while reducing the computational com-
plexity. This pooling technique efficiently combines the modality-specific features, resulting 
in a compact and discriminative representation of the emotional state. In addition to emotion 
recognition, this paper also introduces the integration of mobile app recommendations. By 
leveraging the recognised emotion, the system recommends relevant mobile applications that 
are tailored to the user’s emotional state. This integration enhances user experience and facil-
itates emotion regulation through the utilisation of appropriate mobile apps. Experimental 
evaluations are conducted on benchmark emotion recognition datasets, including the DEAP 
and MAHNOB_HCI datasets. The findings of the study highlight the effectiveness of the pro-
posed methodology in terms of accuracy and robustness, surpassing existing approaches in 
the field. Additionally, the integration of the mobile app recommendation system showcases 
encouraging outcomes by offering personalised recommendations tailored to the user’s emo-
tional state.

D. Saisanthiya(*), 
P. Supraja

Department of Networking 
and Communications, School 
of Computing, SRM Institute 
of Science and Technology, 
Kattankulathur, Chennai, 
Tamil Nadu, India

saisantd@srmist.edu.in

https://doi.org/10.3991/ijim.v17i16.42735

https://online-journals.org/index.php/i-jim
https://online-journals.org/index.php/i-jim
https://doi.org/10.3991/ijim.v17i16.42735
https://doi.org/10.3991/ijim.v17i16.42735
https://online-journals.org/
https://online-journals.org/
mailto:saisantd@srmist.edu.in
https://doi.org/10.3991/ijim.v17i16.42735


 130 International Journal of Interactive Mobile Technologies (iJIM) iJIM | Vol. 17 No. 16 (2023)

Saisanthiya and Supraja

KEYWORDS
heterogeneous CNN, bilinear pooling, mobile application, recommendation system, mul-
timodal data

1	 INTRODUCTION

Emotion, being an integral part of human experience, plays a crucial role in 
our daily lives, influencing our behaviour, decision-making processes and overall 
well-being. Emotion recognition has emerged as a fascinating field of study within 
the broader domain of affective computing, aiming to develop smart systems capa-
ble of recognising and responding to human perceptions. In contrast to conventional 
methods that primarily rely on visual cues like facial expressions, there is an increas-
ing focus on integrating physiological signals to improve the precision and reliability 
of emotion recognition systems. This shift in approach reflects the growing interest in 
leveraging physiological data to enhance the accuracy and robustness of such systems.

Among the various physiological signals that hold promise in the domain of emo-
tion recognition, electroencephalography (EEG) has emerged as a powerful modal-
ity. EEG records the electrical activity of the brain and provides valuable insights 
into cognitive and affective states. Its non-invasiveness, high temporal resolution 
and direct measurement of neural activity make it an ideal candidate for capturing 
emotional responses in real-time. However, emotion recognition using EEG signals 
poses significant challenges due to the complex and dynamic nature of emotions, the 
inherent variability across individuals and the presence of noise in the recordings. 
To overcome these challenges, recent research efforts have focused on adopting a 
multimodal approach that combines multiple sources of information, such as facial 
expressions, physiological signals and behavioural cues, to increase the accuracy 
and strength of emotion credit systems.

Traditional emotion recognition approaches have often relied on unimodal data, 
such as facial expressions or speech signals, leading to limited accuracy and robust-
ness. Recognising that emotions are complex and multi-dimensional phenomena, we 
leverage multiple modalities, including facial expressions, speech signals and phys-
iological data, to capture a comprehensive representation of human affective states. 
The proposed framework adopts heterogeneous CNNs that are tailor-made for each 
modality to effectively extract discriminative features from diverse input sources.

To fuse the information from different modalities, we employ multimodal fac-
torised bilinear pooling (MFB) technique. It is a powerful technique that captures 
cross-modal interactions and exploits the complementary nature of the modalities. By 
modelling the interactions between features, this pooling method effectively integrates 
multimodal information while preserving the unique characteristics of each modal-
ity, thereby improving the discriminative power of the model. Additionally, we extend 
the scope of our framework beyond emotion recognition by incorporating mobile 
application recommendation. Leveraging the insights gained from emotion recog-
nition, we propose a recommendation mechanism that suggests mobile applications 
tailored to the user’s emotional state. By bridging the gap between affective comput-
ing and mobile application recommendation, we aim to enhance the user experience 
and promote personalised interaction between users and their mobile devices.

The contributions of this paper are threefold: First, we propose a novel frame-
work that combines heterogeneous CNNs, multimodal factorised bilinear pooling 

https://online-journals.org/index.php/i-jim


iJIM | Vol. 17 No. 16 (2023) International Journal of Interactive Mobile Technologies (iJIM) 131

Heterogeneous CNNs for Emotion Recognition Combined with Multimodal Factorised Bilinear Pooling and Mobile  Application Recommendation

and mobile application recommendation to advance the field of emotion recogni-
tion. Second, we conduct extensive experiments on benchmark datasets to demon-
strate the effectiveness of our approach in accurately recognising emotions across 
multiple modalities. Third, we present a comprehensive evaluation of the mobile 
application recommendation component, showcasing the potential of our frame-
work to provide personalised recommendations based on the user’s emotional state.

2	 LITERATURE	REVIEW

Emotion recognition has emerged as a significant research area due to its poten-
tial applications in various domains, including healthcare, human-computer inter-
action and entertainment. Traditional approaches to emotion recognition have 
primarily focused on unimodal data sources such as facial expressions or speech sig-
nals. However, the complex and multi-dimensional nature of emotions necessitates 
the integration of multiple modalities to achieve a comprehensive understanding 
of human affective states. In this literature review, we explore existing research in 
emotion recognition, highlight the limitations of unimodal approaches and empha-
sise the need for multimodal frameworks.

Unimodal approaches to emotion recognition have demonstrated promising 
results in specific contexts. Facial expression analysis, for instance, has been exten-
sively studied, leveraging techniques such as facial action coding systems, geomet-
ric features, or deep learning-based methods [1]. These methods excel at capturing 
visual cues but often struggle with the high inter- and intra-subject variability in 
facial expressions, as well as the influence of external factors such as lighting con-
ditions or occlusions. Similarly, speech-based emotion recognition has been widely 
explored, utilising features like prosody, pitch and spectral content [2]. While speech 
provides valuable information about emotional states, it is susceptible to noise, vari-
ations in speech patterns and individual differences in vocal expression, making 
it challenging to achieve robust and accurate emotion recognition solely based on 
speech signals.

To overcome these constraints, scholars have progressively shifted their focus 
towards multimodal strategies that combine various data sources, including facial 
expressions, speech and physiological signals. One popular modality is electroen-
cephalography (EEG), which directly measures brain activity and offers insights into 
cognitive and affective states. EEG-based emotion recognition has shown promise 
due to its high temporal resolution and non-invasiveness [3]. However, it suffers 
from challenges related to signal noise, individual differences and the need for 
advanced processing techniques. In recent years, CNNs have had a transformative 
impact on the domain of computer vision, showcasing exceptional capabilities in 
tasks related to image recognition. Researchers have extended CNNs to handle mul-
timodal data, allowing for the fusion of information from different modalities. One 
effective technique for multimodal fusion is factorised bilinear pooling, which mod-
els interactions between features across modalities and captures their joint repre-
sentations [4]. By combining the strengths of CNNs and factorised bilinear pooling, 
researchers have achieved significant improvements in multimodal emotion recog-
nition, thereby enhancing the discriminative power of the models.

One area where heterogeneous CNNs have shown promise is in multimodal 
learning, where information from different modalities, such as images, text and 
audio, is combined to enhance the learning process. For example, in image caption-
ing, heterogeneous CNNs can integrate visual features extracted from images with 

https://online-journals.org/index.php/i-jim


 132 International Journal of Interactive Mobile Technologies (iJIM) iJIM | Vol. 17 No. 16 (2023)

Saisanthiya and Supraja

textual features to generate more accurate and meaningful captions [5]. Similarly, in 
video analysis, combining visual and audio information using heterogeneous CNNs 
has led to improved action recognition and event detection [6]. Another application 
domain where heterogeneous CNNs have gained attention is medical image analy-
sis. Medical images often come in different modalities, such as MRI, CT, or PET scans, 
and require specialized processing techniques. Heterogeneous CNNs have been 
employed to integrate information from multiple modalities to improve disease clas-
sification, tumour segmentation and diagnosis accuracy [7].

In a recent study, a novel multimodal fusion approach was introduced, which 
integrated audio and visual information using a linear latent-space mapping. The 
researchers utilised a Dempster-Shafer theory-based evidence fusion technique to 
project features into a cross-modal space and combine them with the textual modality. 
The evaluation conducted on the DEAP [8] dataset demonstrated the superiority of this 
approach compared to other comparative methods. Over the past few years, signifi-
cant advancements have been made in multimodal emotion recognition, with CNNs 
playing a pivotal role. For example, an ensemble CNN (ECNN) was proposed to extract 
features from different modalities and utilise a voting strategy to create an ensemble 
model for fusing and classifying multimodal signals. Likewise, a hierarchal fusion CNN 
(HFCNN) [9] was developed to extract and combine emotion-related convolutional fea-
tures from multimodal signals in an end-to-end manner. Additionally, a CNN architec-
ture was applied to extract facial and EEG features, followed by a voting strategy for 
emotion classification. The continuous progress in integrating CNNs and other deep 
learning techniques has resulted in remarkable achievements in multimodal emotion 
recognition, particularly in improving the feature extraction and fusion processes. 
These advancements open up exciting opportunities for further exploration and 
advancement in the field of multimodal deep learning for emotion recognition [10].

Bilinear pooling has emerged as a prominent technique in various fields in 
recent years. For example, in the domain of acoustic scene classification, Kek et al. 
proposed a method that employed a dual-flow CNN structure incorporating both 
time and frequency information. By leveraging bilinear pooling, they successfully 
fused acoustic features extracted from two CNNs, showcasing the potential of this 
approach [11]. To effectively recognise driver’s emotions, Du et al. introduced CBLNN 
(Foldable Bidirectional Neural Network with Long-term and Short-term Memory), 
a novel deep learning framework. This framework utilised multimodal factorised 
bilinear pooling (MFB) to combine emotion information derived from geometric 
facial features and heart rate features. Their results demonstrated real-time and 
efficient emotion detection capabilities [12]. Moreover, Nguyen et al. presented a 
fusion model based on bilinear pooling that integrated feature vectors encompass-
ing facial expression, posture, physical action and voice. Through their proposed 
fusion strategy, they achieved effective interaction among the elements of each com-
ponent vector, capturing the complex and intrinsic relationships among the differ-
ent modalities. Numerous studies have affirmed the efficacy of bilinear pooling in 
integrating multimodal signal characteristics, thereby enhancing the performance 
of multimodal emotion recognition systems. These findings underscore the advan-
tages of bilinear pooling in diverse multimodal emotion recognition tasks.

Furthermore, integrating emotion recognition with mobile application recom-
mendation systems opens exciting possibilities for personalised user experiences. By 
leveraging the detected emotional states, these systems can adapt recommendations 
to match the user’s current affective context [13]. This integration not only enhances 
user engagement but also provides opportunities for context-aware and emotionally 
intelligent mobile applications.

https://online-journals.org/index.php/i-jim


iJIM | Vol. 17 No. 16 (2023) International Journal of Interactive Mobile Technologies (iJIM) 133

Heterogeneous CNNs for Emotion Recognition Combined with Multimodal Factorised Bilinear Pooling and Mobile  Application Recommendation

In summary, the literature review highlights the limitations of unimodal 
approaches to emotion recognition and emphasises the need for multimodal frame-
works. The integration of various modalities, such as facial expressions, speech and 
EEG signals, enables a more comprehensive understanding of human emotions. 
Leveraging the power of CNNs and factorised bilinear pooling further enhances the 
discriminative capabilities of the models. Additionally, the combination of emotion 
recognition with mobile application recommendation systems allows for person-
alised and context-aware user experiences. The proposed framework in this paper 
aims to address these challenges by employing heterogeneous CNNs, multimodal 
factorised bilinear pooling and mobile application recommendation, contributing to 
the advancement of emotion recognition and its practical applications.

3	 METHODOLOGY

The HC-MFB multimodal emotion recognition model, as depicted in Figure 1, con-
sists of four sequential tasks: EEG signal channel selection, heterogeneous feature 
extraction, multimodal fusion and classification. One of the key steps in this model 
is the utilization of the Normalized Mutual Information (NMI) method to identify 
the most relevant channel from the available EEG signal channels. The Hierarchical 
Convolutional Neural Networks (HCNNs) are employed for extracting heteroge-
neous features from each modality. These features are then combined using the 
Modality Fusion Block (MFB) to capture the complementary information across dif-
ferent modalities. To classify emotions, an ensemble strategy is applied to leverage 
the distinctive characteristics present in various EEG signal bands. By incorporating 
these components, the HC-MFB model aims to effectively recognise emotions using 
multimodal inputs, enhancing the overall performance and accuracy of the emotion 
recognition system. From this classification, mobile applications will recieve recom-
mendation for their wellness.

Fig. 1. The proposed HC_MFB model

3.1	 EEG	selection	of	channels

The selection of EEG channels is an important step in emotion recognition, as 
EEG signals provide valuable insights into cerebral cortex activity. However, utilising 

https://online-journals.org/index.php/i-jim


 134 International Journal of Interactive Mobile Technologies (iJIM) iJIM | Vol. 17 No. 16 (2023)

Saisanthiya and Supraja

the full set of EEG channels may result in redundant data, potentially leading to 
decreased accuracy in emotion recognition. To address this, we employ the NMI 
method proposed in reference [14] to identify a subset of EEG signal channels.

 MI(X, Y) = H(X) + H(Y) - H(X, Y) (1)

The NMI measures the interdependence between two variables and is calcu-
lated based on mutual information. Specifically, the mutual information between 
two channels, X and Y, can be expressed as the difference between their combined 
entropy and the sum of their individual entropies (1)

 NMI(X, Y) = MI(X, Y) / H(X) + H(Y) (2)

To normalize the mutual information and obtain a value between 0 and 1, the 
NMI formula (2) is utilised.

 Gn NMIi X Y
i

N

�
�
�
1

�( , )  (3)

To generate the connection matrix for channel selection, NMI is computed for 
each pair of channels across all samples. The connection matrix for the ith sample, 
denoted as NMIi, is summed across all samples to obtain the total connection matrix 
Gn (3). In this study, Gn is utilised for channel selection based on a predefined thresh-
old. Optimal EEG channels are then selected based on the performance of each chan-
nel, aiming to enhance the accuracy of emotion recognition.

3.2	 Feature	extraction

In the past few years, deep learning has gained significant traction, with many 
approaches incorporating deep convolutional features to enhance their classifica-
tion performance [15]. This study employed two distinct neural network architec-
tures, specifically EEG-based Convolutional Neural Network (E-CNN) and Peripheral 
signals-based Convolutional Neural Network (P-CNN), to extract emotion-related fea-
tures. These CNNs were trained separately on the DEAP and MAHNOB-HCI datasets, 
respectively. To automatically extract crucial features, the E-CNN and P-CNN models 
underwent ten-fold cross-validation on the training sets of their respective datasets. 
Subsequently, the HC-MFB model was tested using the corresponding testing sets. 
Through numerous experiments, the internal structure parameters of the HCNNs 
were determined, albeit with variations in the learning rate and training epoch 
between the two datasets.

The training of HCNNs on the DEAP dataset utilised a learning rate of 10–3, a 
batch size of 18 and 30 epochs. Conversely, for the MAHNOB-HCI dataset, the HCNNs 
were trained with a learning rate of 10–4, a batch size of 15 and 20 epochs. Detailed 
parameter descriptions for the two CNNs can be found in Table 1 For instance, in the 
first convolutional layer (Conv 1), there were 16 kernel mappings with a kernel size 
of 7 × 7 and a stride length of 1. This was followed by a max-pooling operation with 
a kernel size of 2 × 2 and a stride of 2. These specific parameter configurations were 
selected to achieve optimal training outcomes for each dataset, considering their 
distinct characteristics and requirements.

https://online-journals.org/index.php/i-jim


iJIM | Vol. 17 No. 16 (2023) International Journal of Interactive Mobile Technologies (iJIM) 135

Heterogeneous CNNs for Emotion Recognition Combined with Multimodal Factorised Bilinear Pooling and Mobile  Application Recommendation

Table 1. The structure of CNN

Layer E-CNN P-CNN

Conv 4 * 1,3,64

Pool 3 * 2,2

Conv 3 1,2,64 1,5,32

Pool 2 2,2 2,2

Conv 2 1,7,32 1,5,32

Pool 1 2,2 2,2

Conv 1 1,7,16 1,7,16

To enhance the model’s performance, a dropout layer was introduced following 
the last convolutional layer. In a study referenced as [16], it was demonstrated that 
CNNs can serve as effective emotion classifiers in EEG-based emotion recognition. 
The researchers highlighted the importance of incorporating batch normalization 
(BN) layers within the CNN architecture. Consequently, BN layers were added to 
each CNN utilised in this approach. Stochastic gradient descent (SGD) was employed 
as the optimisation method, and the loss function employed was cross-entropy. Post-
training, the classification layer and softmax activation function were discarded to 
generate feature vectors. These feature vectors comprise the heterogeneous convo-
lutional features extracted by HCNNs. Subsequently, a process called deep fusion 
was conducted using the multi-modal fusion block (MFB) to integrate the convolu-
tion features effectively.

3.3	 Multimodal	fusion

In our experiment, we employed a novel fusion technique called MFB. The 
detailed process is illustrated in Figure 1 After passing through the fully connected 
layer, the heterogeneous convolution feature vectors were utilised as input for the 
MFB. This pooling operation involved two feature vectors of different forms: x ∈ Rm 
and y ∈ Rn. In this approach, a multimodal bilinear model was employed, which can 
be mathematically described as follows:

 Zi = xTWiy (4)

Here, Zi ∈ R represents the output of the bilinear model, while Wi ∈ Rm × n rep-
resents the projection matrix used in the process. This formulation allows for the 
effective fusion of the two feature vectors using the MFB technique.

3.4	 Classification

This paper utilises ensemble learning for the classification of multimodal sig-
nals. Ensemble learning involves training and learning multiple base learners inde-
pendently and then combining a subset of them based on their individual learning 
performance. This approach effectively mitigates the issue of overfitting that can 
arise when using a single base classifier. Various ensemble learning methods, such 

https://online-journals.org/index.php/i-jim


 136 International Journal of Interactive Mobile Technologies (iJIM) iJIM | Vol. 17 No. 16 (2023)

Saisanthiya and Supraja

as boosting, bagging and stacking, have been developed and employed [17]. In this 
study, the fusion of the four bands of EEG signals with peripheral signals, or eye 
movement signals, is conducted. The weak supervised models from all three bands 
are combined, and a strong supervised model is obtained through majority voting to 
enhance the effectiveness of the recognition model.

3.5	 Mobile	application	recommendation

The mobile app recommendation module plays a crucial role in the EEG-based 
emotion recognition recommendation system implemented on a mobile application. 
This module is responsible for leveraging recognised user emotions based on EEG 
data to provide personalised app recommendations. The following sections describe 
the key aspects of the mobile app recommendation module:

Recommendation engine. The recommendation engine is the core component 
of the module. It utilises the user’s emotional state, determined through EEG-based 
emotion recognition, to generate app recommendations tailored to the user’s prefer-
ences. The engine employs algorithms such as collaborative filtering, content-based 
filtering, or hybrid approaches to analyse user data and app characteristics for gen-
erating personalised recommendations.

User profile. To provide accurate and relevant app recommendations, the mod-
ule maintains a user profile that includes information about the user’s preferences, 
past app interactions and emotional states derived from the EEG data. The user pro-
file is continuously updated as the user interacts with the recommended apps and 
provides feedback, enabling the recommendation engine to refine and adapt its rec-
ommendations over time.

App database. The module relies on a comprehensive app database that stores 
information about various mobile applications, including their features, categories, 
ratings, reviews and user feedback. This database serves as a knowledge base for 
the recommendation engine to match user preferences and emotional states with 
relevant apps.

Real-time processing. The mobile app recommendation module is designed to 
operate in real-time, providing instant recommendations based on the user’s cur-
rent emotional state. It utilizes efficient algorithms and data structures to handle 
the computational demands within the mobile application, ensuring a seamless and 
responsive user experience.

User interface. The recommendation module interfaces with the mobile applica-
tion’s user interface to present the app recommendations to the user. It may display 
recommended apps in a visually appealing manner, showcasing relevant informa-
tion such as app names, icons, descriptions, ratings and user reviews. The user inter-
face also allows users to provide feedback on the recommended apps, contributing 
to the continuous improvement of the recommendation system.

Privacy and data security. The module incorporates measures to ensure user 
privacy and data security. It adheres to best practices for handling sensitive EEG 
data, encrypting user information and complying with relevant data protection reg-
ulations. User consent and transparency in data usage are emphasised to maintain 
user trust.

Performance monitoring and evaluation. To evaluate the effectiveness of 
the mobile app recommendation module, performance metrics such as recom-
mendation accuracy, user engagement and user satisfaction can be monitored and 

https://online-journals.org/index.php/i-jim


iJIM | Vol. 17 No. 16 (2023) International Journal of Interactive Mobile Technologies (iJIM) 137

Heterogeneous CNNs for Emotion Recognition Combined with Multimodal Factorised Bilinear Pooling and Mobile  Application Recommendation

analysed. These metrics provide insights into the performance of the module and 
guide further enhancements.

The mobile app recommendation module in the EEG-based emotion recognition 
recommendation system is responsible for leveraging user emotions derived from 
EEG data to generate personalised app recommendations. By considering the user’s 
emotional state, preferences and the emotional context of apps, the module enhances 
the user experience by providing relevant and engaging app recommendations.

4	 EXPERIMENT	AND	RESULTS

4.1	 Dataset	preparation

In this study, a comprehensive database was created consisting of 32 subjects 
who underwent examination. To serve as trigger stimuli, 40 videos were carefully 
selected, each with a duration of 63 seconds. The database also included record-
ings of the participants’ central nervous system activity, peripheral physiological 
systems and facial expressions. Following the viewing of each video, participants 
were requested to conduct self-assessments related to valence, arousal, dominance 
and liking.

The objective of the MAHNOB-HCI database is to capture multimodal emo-
tions by recording signals from 30 individuals as they watch a series of 20 videos. 
These signals encompass various aspects, including central nervous system activ-
ity, peripheral physiological signals and eye movement signals. After each video, 
participants provide ratings for arousal, valence, control and predictability dimen-
sions to describe their emotional experiences. To ensure data consistency, the mid-
dle 30 seconds of each video were chosen for analysis, accounting for variations in 
video duration. Unfortunately, three individuals encountered issues with equipment 
and recordings, resulting in corrupted data, while data from two individuals were 
incomplete. Consequently, the analysis focused on the remaining 25 participants. 
The signals were down sampled to 256 Hz, and the EEG channels AF3, FC1, F4, CP1, 
CP2 and PZ were utilised. The processing methods employed mirrored those of the 
DEAP dataset.

4.2	 Results	and	discussions

In our research, we implemented the fusion technique discussed in Section III 
to combine the convolutional features of our models. Specifically, we utilized the 
HCNNs model to extract features from both the EEG signals and PPS signals. To 
examine the interactions between the bands of EEG signals for each emotion, we 
employed an ensemble classifier to perform emotion recognition on different com-
binations of these bands after fusing the multimodal signals. The classification accu-
racies achieved on the DEAP dataset and MAHNOB-HCI dataset are presented in 
Figures 2 and 3, respectively. Each point on the graphs represents the average accu-
racy obtained through ten-fold cross-validation on the respective datasets. The solid 
lines, displayed in different colours, illustrate the combinations of various wave-
bands, while the dotted lines depict the final average accuracy obtained from five 
10-fold cross-validations.

https://online-journals.org/index.php/i-jim


 138 International Journal of Interactive Mobile Technologies (iJIM) iJIM | Vol. 17 No. 16 (2023)

Saisanthiya and Supraja

b)

a)

Fig. 2. The results of tenfold cross validation DEAP dataset for EEG a d PPS for a) arousal b) valence

a)

Fig. 3. (Continued)

https://online-journals.org/index.php/i-jim


iJIM | Vol. 17 No. 16 (2023) International Journal of Interactive Mobile Technologies (iJIM) 139

Heterogeneous CNNs for Emotion Recognition Combined with Multimodal Factorised Bilinear Pooling and Mobile  Application Recommendation

b)

Fig. 3. The results of tenfold cross validation MAHNOB-HCI dataset for EEG a d PPS for a) arousal b) valence

Based on the classification outcomes illustrated in Figure 3 for the MAHNOB-HCI 
dataset, the combinations of theta, alpha and gamma bands resulted in the high-
est average accuracies for the arousal dimension and valence dimension, achieving 
90.37% and 90.50%, respectively. Specifically, concerning the arousal dimension, 
the fusion of theta, alpha and gamma bands led to the highest accuracy, while for 
the valence dimension, the fusion of theta, alpha and beta bands yielded the high-
est accuracy.

Table 2 presents the classification results of various methods that fuse multi-
modal DEAP datasets for emotion classification. In our study, we employ multimodal 
bilinear pooling neural networks to conduct ensemble classification of emotional 
states. Our proposed method achieves the highest accuracy of 93.22% in the arousal 
dimension and an accuracy of 90.46% in the valence dimension. When comparing 
our method to the DCNN approach, we observe that the HCNNs and MFB models 
outperform the DCNN method specifically in the arousal dimension. This improve-
ment can be attributed to the HCNNs and MFB models’ capability to automatically 
extract and fuse deep features. Interestingly, our method achieves superior classifi-
cation performance by exclusively utilizing the combination of three bands.

Table 2. The comparison results on DEAP dataset

Fusion Method Arousal Valence

MDBN 87.32% 83.69%

DCNN 92.92% 92.24%

HC_MFB 93.21% 90.46%

Table 3 presents the classification results of various methods that fuse multi-
modal MAHNOB_HCI datasets for emotion classification. The best results are shown 
in the Table 3, using our proposed method HC_MFB for the arousal and valence 
dimensions.

https://online-journals.org/index.php/i-jim


 140 International Journal of Interactive Mobile Technologies (iJIM) iJIM | Vol. 17 No. 16 (2023)

Saisanthiya and Supraja

Table 3. The comparison results on MAHNOB-HCI dataset

Fusion Method Arousal Valence

Maltitask CNN 74.17% 75.21%

Deep Learning 80.41% 80.76%

HC_MFB 90.36% 90.49%

Now we can integrate our more accurate results with the following mobile appli-
cation designed for emotion recommendation.

Emotion tracker. This mobile application utilises advanced machine learning 
algorithms to precisely recognise and track emotions based on user input, such as 
facial expressions, voice recordings, or text analysis. By analysing these inputs, the 
application provides detailed insights into emotional patterns and trends over time, 
helping users gain a better understanding of their emotions.

MoodMeter. This mobile app employs a combination of self-assessment and 
machine learning techniques to classify and track users’ emotions. It allows users to 
input their emotional state through a user-friendly interface and provides real-time 
feedback and suggestions for managing emotions effectively.

EmoSense. This mobile application incorporates multimodal emotion recog-
nition, including facial expressions, speech analysis and physiological signals, to 
provide a comprehensive understanding of users’ emotional states. It offers person-
alised recommendations and techniques for emotional well-being based on the clas-
sification results.

Feelings diary. This app allows users to keep a digital diary of their emotions 
throughout the day. By capturing text entries, images and audio recordings, the app 
employs sentiment analysis and machine learning algorithms to classify and ana-
lyze emotional patterns, helping users identify triggers and manage their emotions 
effectively.

5	 CONCLUSION

In conclusion, the combination of heterogeneous CNNs and multimodal facto-
rised bilinear pooling has demonstrated promising results in emotion recognition, 
particularly when applied to EEG and eye movement data. By leveraging these tech-
niques, mobile applications can harness the power of EEG signals and eye movement 
data to provide users with accurate and real-time emotion recognition capabilities. 
These applications have the potential to offer personalised insights into users’ emo-
tional states, enhance self-awareness and promote emotional well-being. The inte-
gration of EEG and eye movement data in emotion recognition allows for a more 
comprehensive understanding of users’ emotional experiences. Mobile applications 
utilizing this approach can provide valuable feedback and recommendations tai-
lored to individual users based on their unique patterns of brain activity and eye 
movements.

The recommendation of mobile applications that incorporate heterogeneous 
CNNs and multimodal factorised bilinear pooling for EEG and eye movement data 
empowers users to actively monitor and manage their emotions. These applications 
can provide valuable tools for self-reflection, emotional tracking and fostering emo-
tional resilience. As research in this field continues to advance, it is expected that 

https://online-journals.org/index.php/i-jim


iJIM | Vol. 17 No. 16 (2023) International Journal of Interactive Mobile Technologies (iJIM) 141

Heterogeneous CNNs for Emotion Recognition Combined with Multimodal Factorised Bilinear Pooling and Mobile  Application Recommendation

mobile applications for emotion recognition using EEG and eye movement data will 
become increasingly accurate, user-friendly and accessible. Such advancements 
have the potential to revolutionise the way individuals understand, monitor and 
regulate their emotions, ultimately contributing to improved mental well-being and 
emotional health.

6	 REFERENCES

 [1] A.T. Lopes and D.R. de Almeida, “Emotion recognition from facial expressions: A survey. 
Computer Vision and Image Understanding,” vol. 189, p. 102824, 2020.

 [2] F. Eyben, F. Weninger, F. Gross, and B. Schuller, “Recent developments in opensmile, 
the Munich open-source multimedia feature extractor,” In Proceedings of the 
21st ACM international conference on Multimedia, pp. 835–838, 2013. https://doi.
org/10.1145/2502081.2502224

 [3] D. Zhang, L. Yao, and S.M. Zhou, “Emotion recognition from EEG signals using multi-
dimensional information in EMD domain,” IEEE Transactions on Affective Computing,  
vol. 11, no. 1, pp. 140–152, 2019.

 [4] T.Y. Lin, A. Roy Chowdhury, and S. Maji, “Bilinear CNN models for fine-grained visual 
recognition,” In Proceedings of the IEEE International Conference on Computer Vision,  
pp. 1449–1457, 2018. 

 [5] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image 
descriptions,” In Proceedings of the IEEE Conference on Computer Vision and Pattern 
Recognition, pp. 3128–3137, 2015. https://doi.org/10.1109/CVPR.2015.7298932

 [6] J.Y. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, 
“Beyond short snippets: Deep networks for video classification,” In Proceedings of the 
IEEE Conference on Computer Vision and Pattern Recognition, pp. 4694–4702, 2015.

 [7] G. Litjens, T. Kooi, B.E. Bejnordi, A.A. Setio, F. Ciompi, M. Ghafoorian, and C.I. Sanchez,  
“A survey on deep learning in medical image analysis,” Medical Image Analysis, vol. 42, 
pp. 60–88, 2017. https://doi.org/10.1016/j.media.2017.07.005

 [8] S. Nemati, R. Rohani, M.E. Basiri, M. Abdar, N.Y. Yen, and V. Makarenkov, “A hybrid latent 
space data fusion method for multimodal emotion recognition,” In IEEE Access, vol. 7, 
pp. 172948–172964, 2019. https://doi.org/10.1109/ACCESS.2019.2955637

 [9] H. Huang, Z. Hu, W. Wang, and M. Wu, “Multimodal emotion recognition based on 
ensemble convolutional neural network,” IEEE Access, vol. 8, pp. 3265–3271, 2020. 
https://doi.org/10.1109/ACCESS.2019.2962085

 [10] Y. Zhang, C. Cheng, and Y. Zhang, “Multimodal emotion recognition using a hierarchical 
fusion convolutional neural network,” IEEE Access, vol. 9, pp. 7943–7951, 2021. https://
doi.org/10.1109/ACCESS.2021.3049516

 [11] M. Wu, W. Su, L. Chen, W. Pedrycz, and K. Hirota, “Two-stage fuzzy fusion based- 
convolution neural network for dynamic emotion recognition,”  IEEE Transactions 
on Affective Computing, vol. 13, no. 2, pp. 805–817, 2022. https://doi.org/10.1109/
TAFFC.2020.2966440

 [12] J. Shukla, M. Barreda-Ángeles, J. Oliver, G.C. Nandi, and D. Puig, “Feature extraction 
and selection for emotion recognition from electrodermal activity,” IEEE Transactions 
on Affective Computing, vol. 12, no. 4, pp. 857–869, 2021. https://doi.org/10.1109/
TAFFC.2019.2901673

 [13] L. Chen and P. Pu, “A survey of personalization approaches for mobile recommen-
dation,” ACM Computing Surveys (CSUR), vol. 51, no. 3, pp. 1–34, 2018. https://doi.
org/10.1145/3190507

https://online-journals.org/index.php/i-jim
https://doi.org/10.1145/2502081.2502224
https://doi.org/10.1145/2502081.2502224
https://doi.org/10.1109/CVPR.2015.7298932
https://doi.org/10.1016/j.media.2017.07.005
https://doi.org/10.1109/ACCESS.2019.2955637
https://doi.org/10.1109/ACCESS.2019.2962085
https://doi.org/10.1109/ACCESS.2021.3049516
https://doi.org/10.1109/ACCESS.2021.3049516
https://doi.org/10.1109/TAFFC.2020.2966440
https://doi.org/10.1109/TAFFC.2020.2966440
https://doi.org/10.1109/TAFFC.2019.2901673
https://doi.org/10.1109/TAFFC.2019.2901673
https://doi.org/10.1145/3190507
https://doi.org/10.1145/3190507


 142 International Journal of Interactive Mobile Technologies (iJIM) iJIM | Vol. 17 No. 16 (2023)

Saisanthiya and Supraja

 [14] Z.-M. Wang, S.-Y. Hu, and H. Song, “Channel selection method for EEG emotion recog-
nition using normalized mutual information,” IEEE Access, vol. 7, pp. 143303–143311, 
2019. https://doi.org/10.1109/ACCESS.2019.2944273

 [15] L. Wu, Y. Wang, X. Li, and J. Gao, “Deep attention-based spatially recursive networks 
for fine-grained visual recognition,” IEEE Transactions on Cybernetics, vol. 49, no. 5,  
pp. 1791–1802, 2019. https://doi.org/10.1109/TCYB.2018.2813971

 [16] W.-C. Fang, K.-Y. Wang, N. Fahier, Y.-L. Ho, and Y.-D. Huang, “Development and valida-
tion of an EEG-based real-time emotion recognition system using edge AI computing 
platform with convolutional neural network system-on-chip design,” IEEE Journal on 
Emerging and Selected Topics in Circuits and Systems, vol. 9, no. 4, pp. 645–657, 2019. 
https://doi.org/10.1109/JETCAS.2019.2951232

 [17] Chen Wei, Lan-lan Chen, Zhen-zhen Song, Xiao-guang Lou, and Dong-dong Li, “EEG-
based emotion recognition using simple recurrent units network and ensemble learn-
ing,” Biomedical Signal Processing and Control, vol. 58, p. 101756, 2020. https://doi.
org/10.1016/j.bspc.2019.101756

7	 AUTHORS

D. Saisanthiya received a B.Tech. Degree in CSE from Arulmigu Meenakshi 
Amman College of Engineering, Thiruvannamalai, affiliated to Anna University, 
Tamil Nadu, in 2009 and M.Tech Degree in CSE from Sastha Institute of Science and 
Technology, Chembarambakkam, affiliated to Anna University, Tamil Nadu in 2011. 
She is currently working towards the Ph.D. degree at the Department of Networking 
and Communications, SRM University, Kattankulathur, Tamil Nadu, India. Her 
research interests include deep learning and Machine learning algorithms (E-mail: 
saisantd@srmist.edu.in).

Dr. P. Supraja is currently working as Associate Professor, Department of 
Networking and Communications in SRM Institute of Science and Technology 
Kattankulathur, India. She is a recipient of AICTE Visvesvaraya Best Teacher Award 
2020. She completed Indo-US WISTEMM Research fellowship at University of 
Southern California, Los Angeles, USA, funded by IUSSTF and DST Govt., of India. She 
served as a Post-Doctoral Research Associate at Northumbria University, Newcastle, 
UK and completed her Ph.D. from Anna University in 2017. She has published 
more than 50 research papers in reputed national and international level journals/
conferences. She received university-level Best Research Paper Award in 2019 and 
2022. Also, she has received funding from AICTE for conducting STTP. Her research 
interests include Cognitive Computing, Optimization algorithms, Machine learn-
ing, Deep Learning, Wireless Communication, and IoT. She is a reviewer in IEEE, 
Interscience, Elsevier and Springer journals. She is also a member of several national 
and international professional bodies including IEEE, ACM, ISTE, etc. In addition, 
she has received the young women in Engineering award and Distinguished Young 
Researcher award from various international organizations (E-mail: suprajap@
srmist.edu.in).

https://online-journals.org/index.php/i-jim
https://doi.org/10.1109/ACCESS.2019.2944273
https://doi.org/10.1109/TCYB.2018.2813971
https://doi.org/10.1109/JETCAS.2019.2951232
https://doi.org/10.1016/j.bspc.2019.101756
https://doi.org/10.1016/j.bspc.2019.101756
mailto:saisantd@srmist.edu.in
mailto:suprajap@srmist.edu.in
mailto:suprajap@srmist.edu.in