P-ISSN : 2715-2448 | E-ISSN : 2715-7199 

Vol.4 No.1 January 2023 

Buana Information Technology and Computer Sciences (BIT and CS) 

 
28 | Vol.4 No.1, January 2023 

 
Detecting Harmful Activity in Pilgrimage Using Deep Learning 

 
 Musa Dima Genemo 
Study Program Computing Software Engineering 

Gumushane University, Turkey 

Email: musa.ju2002@gmail.com 

 
 ‹β› 
 

Abstract—CCTV surveillance is the most extensively used 

intelligent latest innovation. The use of surveillance cameras has 

risen dramatically because of the con-venience of monitoring 

from anywhere and the reduction of crime rates in public areas.  

In this paper, we introduce the idea of bad vibe activity detec-

tion from live videos to enhance the security and safety of 

pilgrims.  The proposed bad vibes activity recognition model is 

intended to be addressed in the most efficient manner possible 

using cutting-edge technologies such as TensorFlow and Keras.   

TensorFlow was chosen because the project could be deployed 

to a mobile environment in the future with the possibility of ex-

tension of other areas such as airport security, bus stain, and 

public areas that may deserve special attention for security 

checks. We choose MediaPipe Ho-listic for employee bad vibe 

recognition in the model. 

Keywords—Artificial Intelligence, Classification, Real-Time 

Object Recognition, Computer vision. 

 
Abstrak—Pengawasan CCTV adalah inovasi cerdas terbaru yang 

paling banyak digunakan. Penggunaan kamera pengawas telah 

meningkat secara dramatis karena kemudahan pemantauan dari 

mana saja dan pengurangan tingkat kejahatan di tempat umum. 

Dalam makalah ini, kami memperkenalkan ide deteksi aktivitas 

getaran buruk dari video langsung untuk meningkatkan 

keamanan dan keselamatan jemaah. Model pengenalan aktivitas 

getaran buruk yang diusulkan dimaksudkan untuk ditangani 

dengan cara seefisien mungkin menggunakan teknologi mutakhir 

seperti TensorFlow dan Keras. TensorFlow dipilih karena proyek 

dapat diterapkan ke lingkungan seluler di masa mendatang 

dengan kemungkinan perluasan area lain seperti keamanan 

bandara, noda bus, dan area publik yang mungkin memerlukan 

perhatian khusus untuk pemeriksaan keamanan. Kami memilih 

MediaPipe Ho-listic untuk pengenalan getaran buruk karyawan 

dalam model. 

Kata Kunci—Artificial Intelligence, Klasifikasi, Real-Time Object 

Recognition, Computer vision. 

 
I. INTRODUCTION 

The use of surveillance cameras has risen dramatically 
because of the conven-ience of monitoring from anywhere 
and the reduction of crime rates in public areas. Hajj is one of 
the Five Pillars of the Islamic religion where the pilgrimage 

to the holy city of Mecca in the kingdom of Saudi Arabia, 
which takes place in the last month of the year (Hijri 
calendar) and which all Muslims are obligated to make at 
least once throughout their lifetime if they can afford it [1].  
Before COVID_19 emerged, 2.5 million people would travel 
every year to Saudi Arabia for Hajj.  Due to this, the security 
of pilgrims needs special attention.  New cut-ting-edge 
technology is required to ensure the safety of the people and 
the city where the hajj imitation takes place, as well as the 
detection of forbidden activi-ties and the carrying of 
prohibited things such as guns, flames, sharp metals, and the 
like.  Human activity recognition (HAR) is the ability to use 
sensors to ana-lyze human body indicators or motion and 
identify human actions or events [2]. HAR is regarded as a 
significant component in various scientific research set-tings, 
such as health [3], Human-robot interaction [4], and security 
[5]. 

Such technologies are in high demand during the hajj festival 
to safeguard pil-grims' safety. Many people have become 
victims of the Hajj scam in recent years, losing money, 
cellphones, and other valuables. Nowadays Terrorist acts 
pose the greatest danger to public safety [6]. Prohibited 
things, such as carrying a gun, hurling a bomb, deceiving 
people, and threatening a suicide bombing, should be 
checked instantly. 

As a result, these challenges demand models that generate a 
warning or alarm. If accurate forecasts are provided in a 
timely manner, human lives can be saved by employing this 
newly introduced model. Interleaved actions, such as 
throwing a stone at three walls (Ramy Al Jamarat), which is 
also known as stoning the devil (sheytan) and running 
between mina and muzdalifah are a pillar of the Hajj pil-
grims. The stoning of the devil may cause prediction 
ambiguity, by throwing stones at people or away from the 
road and running between mina and muzdali-fah may cause 
prediction ambiguity with a sudden run. A recurrent Neural 
Net-work (RNN) is used to overcome activity overlapping 
difficulties. 

Despite this, utilizing smart CCTV surveillance reduces labor 
expenses while also increasing the security and safety of 
pilgrims.  This study proposes a deep feature extraction 
mechanism for forbidden motion and activity identification 


29 | Vol.4 No.1, January 2023 

 
to address the difficulties.  We proposed a new model named 
l4-branched-action net. By using this new model, we extract 
features from the video frame and bels the activity to activity 
to their respective class like the need for special attention, or 
safe move. 64 layers of CNN- deep architecture are used for 
feature extraction. To optimize the deep features that have 
been obtained, an ACO feature selection technique is applied. 
By running convolution layers over pre-trained public data 
like the CIFRA-100. 

II. METHOD 

The proposed model will be presented in its entirety in this 
section. Further-more, this section includes details of the 
proposed 64-layer Classification algo-rithm. We used the 
CIFAR-100 dataset to train the proposed model, as well as 
feature extraction from the action recognition dataset using 
the proposed CNN architecture, feature selection using Ant 
Colony Optimization (ACO), and predic-tion using a variety 
of algorithms. For autonomous feature extraction from video 
frames and classifications events in the frame, a novel 
proposed 64-layer CNN architecture is used. The 
recommended L4-BranchedActionNet's physical archi-
tecture is shown in Fig.3 and Fig.4. 

 
Fig.3. Structure of proposed model 

 
Fig 4. video frame generation 

Table 1. Layer configuration of L4-branched action net 

Lay

er #  

Layer name  Feature 

maps  

Filter depth  Strid

e  

1  Input  227 × 227 × 

3 

  
2  Conv_1  55 × 55 × 96  11 × 11 × 3 × 

96  

[4 4]  

3  ReLU_1  55 × 55 × 96   

4  Batch_Norm_

3  

55 × 55 × 96   

…..  FC_20  1 × 1 × 100  [1 1]  Same 

62  Prob  1 × 1 × 100   

63  FC_21 1 × 1 × 100 [1,1] same 

64 Video 

description 

   
The data was collected using a script generated utilizing 

OpenCV and MediaPipe Holistic, as shown in Fig.5 frames 

of data are recorded for each word caught. 

 
Fig.4. Key using Open Pose using MediaPipe Holistic point 

extraction [12] 

NumPy array is used instead of pictures to hold video frames. 

We passed three major steps to train the model. The following 

are the details of the new model's operations steps. The first 

step the is Conv layer; (1) In the Conv layer the input x i−1 

filter is computed using equation 1. 

  
where 𝑝𝑗 input channels and 𝑝^𝑗 represent the number of 
output channels.   j represents several layers in the mode, fi 

filter. Equation 2 is used to calculate the max pool in the 

pooling layer. 

 
where 𝑢, 𝑣 represents the matrix index of frame X𝑝, 𝑗-1and 𝑙, 

𝑚 matrix index of the pooling window. It calculates the mean 
and variance in fragments. The mean is derived, and the 

features are separated using the standard deviation as follows. 

   
where 𝑤 is the number of feature maps in a batch. We used 
both ReLU and Leaky_ ReLU in the proposed model. All 

numbers less than 0 are transformed to 0 by the standard 

ReLU, which is stated as [15]: 

 
For values less than zero, Leaky ReLU has a small slope 

rather than zero. A leaky ReLU will have v = 0.01u when u 

is negative. CNN can further be learned in-depth from several 

works [16-19]. 

The second step is feature extraction;(2) For feature 

extraction from a video frame, the appropriate frame is 

retrieved. The proposed approach is intended to feature 

extraction from the deep-trained CNN pipeline. We trained 

the new model on public dataset t such as CIFAR100 [70] 

which contained images of 1000 classes. The trained network 

is then used for feature extraction on action recognition 

datasets and the FC_18 layer is chosen for features extraction. 

A total of 4096 features is attained per frame from the FC_18 

layer. The prepared dataset contains a total of 13250 video 


30 | Vol.4 No.1, January 2023 

 
frames. This makes the feature set dimension of all datasets 

13250 × 4096. Figure 5 illustrates the visualizations of the 

strongest feature maps at various convolution layers on L4-

Branched-ActionNet. 

 
Fig.5. Image visualizations of strongest feature maps at various 

convolution layers (a) Conv_1, (b) Conv_2, (c) Conv_5, (d) 

G_Conv_8, (e) Conv_10. 

And the third step is (3) after interpreting the received result 

the extracted features are coded by applying entropy-coded 

ACO optimization operation [25] using equation (5). 

 
Where (x1-xn) represents the feature. We used ACO for 

feature optimization based on the likelihood at a given point 

at a certain time. The last step is classification, in which 

ACO-based chosen features are at the end passed to the 

predictor for categorization. Several SVM and KNN versions 

are used to assess model performance. Cub-SVM emerges as 

the most effective as shown in table 2. 

 
Table 2. Performance of the model 

 
Classifier Sensitivity Specificity Precision Measure Percent 

LSVM 83.38 72.62 39.94 52.52 77.74 

QSVM 89.11 91.53 61.79 76.01 86.14 

FGSVM 57.29 51.78 25.02 32.80 54.39 

MGSVM 90,52 92.35 62.56 76.58 86.28 

CGCVM 68.47 64.45 31.80 41.75 66.33 

CSVM 96.33 95.59 76.61 88.08 92.99 

 For testing, we employed random selection using sklearn's 

train test function. Following that, Keras' Callback functions 

were used to improve the training's efficiency. The accuracy 

of the test data is evaluated. We also used the public dataset 

ON WEIZMANN to compare our results to the current state 

of the art. The outcome is shown in table 3. 

 
Table 3. Performance evaluation on weizmann dataset 

 
Method reference  Year  Accuracy 

DWT+KNN [21]  2020  0.93 

CNN+ELM [22]  2020  0.94 

Gabor-Ridgelet Transform [23]  2020  0.93 

LCF + MSVM [22]  2021  0.95 

ANN [24]  2020  0.80 

PCANet-XY-YT [25]  2021  0.91 

Ours (L4-Branched-ActionNet + EntACS 

+ Cub-SVM)  

-  0.93 

 
III. RESULTS AND DISCUSSION 

In The major goal of this study is to develop a CNN 
architecture that can recog-nize harmful actions during the 
Hajj festival. Then, the Deep L4-BranchedActionNet Deep 
Network proposed here is used to extract powerful features. 
The pretraining is carried out using a publicly available 
dataset, CIFAR-100. For testing, we employed random 
selection using sklearn's train test func-tion. Following that, 
Keras' Callback functions were used to improve the train-
ing's efficiency to complete this design, many methods such 
as fine-tuning, add-ing and removing layers, and neurons 
were used. Finally, the 64-layer architec-ture was proven it is 
the most efficient in terms of performance. Tensor flow 
Keras, OpenCV, and the NumPy library were used in all the 
experiments in this. 

Table 4. confusion matrix of csvm classifier 

Sudden ran 0.92021 0.00 0.01 0.02 

Fighting  0.00 0.91221 0.00 0.01 

Throwing  0.01 0.01 0.90021 0.00 

Robbing  0.00 0.00 0.01 0.91002 

 Sudden 

ran 

Fighting Throwing  Robbing 

 
IV. CONCLUSION AND RECOMMENDATIONS 

Detection of harmful vibes is critical for pilgrims' safety. To 

detect banned actions during the hajj festival, we utilized a 

64-layer CNN network called L4-Branched-ActionNet. The 

model is evaluated on datasets that are freely availa-ble, such 

as the CIFAR-100 object detection dataset. The 

characteristics were retrieved and subsequently reduced 

using an entropy-coded ACO. To evaluate model 

performance, several SVM and KNN versions are utilized. 

With an accu-racy of 0.91221, Cub-SVM emerges as the 

most effective. This work will be im-plemented on security 

personnel's mobile phones for convenient monitoring from 

any location in future work. 

REFERENCES 

1.  R. K. Tripathi, A. S. Jalal, and S. C. Agrawal, "Suspicious 

human activity recognition:  a   review," Artificial Intelligence 

Review, vol. 50, pp. 283-339, 2018 

2.  A. Tapus, A. Bandera, R. Vazquez-Martin, and L. V. Calderita, 

"Perceiving the person and their interactions with the others for 

social robotics–a review," Pattern Recognition  Letters, vol. 118, 

pp. 3-13, 2019. 

3.  A. Ilidrissi and J. K. Tan, "A deep unified framework for 

suspicious action recognition," Artificial Life and Robotics, vol. 

24, pp. 219-224, 2019. 

4.  Konstantinidis, D., Dimitropoulos, K., & Daras, P. (2018). Sıgn 

Language Recognıtıon Based On Hand And Body Skeletal Data. 

2018- 3DTV-Conference: The True Vision -  apture, 

Transmission and Display of 3D Video (3DTVCON).  

5.  S. J. Elias, S. M. Hatim, N. A. Hassan, L. M. A. Latif, R. B. 

Ahmad, M. Y. Darus, and A. Z. Shahuddin, "Face Recognition 

Attendance System Using Local Binary Pattern (LBP)," Bulletin 

of Electrical Engineering and Informatics, vol. 8, 2019. 

6.  A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet 

classification with deep convolutional neural networks," 


31 | Vol.4 No.1, January 2023 

 
Advances in neural information processing systems, vol. 25, pp. 

1097-1105, 2012 

7. Genemo, M. D. (2022). Suspicious activity recognition for 

monitoring cheating in exams. Proceedings of the Indian 

National Science Academy, 1-10. 

8. C. A. Devine and E. D. Chin, "Integrity in nursing students: A 

concept analysis," Nurse education today, vol. 60, pp. 133-138, 

2018. 

9.  H. M. Abdulghani, S. Haque, Y. A. Almusalam, S. L. Alanezi, 

Y. A. Alsulaiman, M. Irshad, et al., "Self-reported cheating 

among medical students: An alarming finding in a cross-

sectional study from Saudi Arabia," PloS one, vol. 13, p. 

e0194963, 2018. 

10. M. A. Lewis and C. Neighbors, "An examination of college 

student activities and attentiveness during a web-delivered 

personalized normative feedback intervention," Psychology of 

Addictive Behaviors, vol. 29, p. 162, 2015. 

11.  RWTH-PHOENIX-2014-T veri seti, 

https://wwwi6.informatik.rwth-aachen.de/~koller/RWTH-

PHOENIX- 2014-T/ 

12.  S. Balocco, M. González, R. Ñanculef, P. Radeva, and G. 

Thomas, "Calcified plaque detection in IVUS sequences: 

Preliminary results using convolutional nets," in International 

Workshop on Artificial Intelligence and Pattern Recognition, 

2018, pp. 34-42 

13. Y. Liu, X. Wang, L. Wang, and D. Liu, "A modified leaky ReLU 

scheme (MLRS) for topology optimization with multiple 

materials," Applied Mathematics and Computation, vol. 352, 

pp. 188-204, 2019 

14. J. Bouvrie, "Notes on convolutional neural networks," Neural 

Nets, MIT CBCL Tech Report, pp. 47-60, 2006. 

15. Y. Li, Z. Hao, and H. Lei, "Survey of convolutional neural 

network," Journal of Computer Applications,  vol. 36, pp. 2508- 

2515, 2016. 

16. A. Divakaran, Q. Yu, A. Tamrakar, H. S. Sawhney, J. Zhu, O. 

Javed, et al., "Real-time object detection, tracking and occlusion 

reasoning," ed: Google Patents, 2018. 

17. A. Booranawong, N. Jindapetch, and H. Saito, "A system for 

detection and tracking of human movements using RSSI 

signals," IEEE Sensors Journal, vol. 18, pp. 2531-2544, 2018. 

18. A. B. Mabrouk and E. Zagrouba, "Abnormal behavior 

recognition for intelligent video surveillance systems: A 

review," Expert Systems with Applications, vol. 91, pp. 480-

491, 2018. 

19. Krizhevsky and G. Hinton, "Learning multiple layers of features 

from tiny images (Technical Report)," University of Toronto, 

2009 

20. D. K. Vishwakarma, "A two-fold transformation model for 

human action recognition using decisive pose," Cognitive 

Systems Research, vol. 61, pp. 1-13, 2020. 

21.  M. A. Khan, Y.-D. Zhang, S. A. Khan, M. Attique, A. Rehman, 

and S. Seo, "A resource conscious human action recognition 

framework using 26-layered deep convolutional neural 

network," Multimedia Tools and Applications, pp. 1-23, 2020. 
 

https://wwwi6.informatik.rwth-aachen.de/~koller/RWTH-PHOENIX-
https://wwwi6.informatik.rwth-aachen.de/~koller/RWTH-PHOENIX-