Proceedings of Engineering and Technology Innovation, vol. x, no. x, 20xx, pp. xx-xx 

 
Non-Facial Video Spatiotemporal Forensic Analysis Using Deep 

Learning Techniques 

Premanand Ghadekar, Vaibhavi Shetty
*
, Prapti Maheshwari, Raj Shah, Anish Shaha,  

Vaishnav Sonawane 

Department of Information Technology, Vishwakarma Institute of Technology, Pune, India 

Received 11 July 2022; received in revised form 15 September 2022; accepted 04 October 2022 

DOI: https://doi.org/10.46604/peti.2022.10290 

Abstract 

Digital content manipulation software is working as a boon for people to edit recorded video or audio content. 

To prevent the unethical use of such readily available altering tools, digital multimedia forensics is becoming 

increasingly important. Hence, this study aims to identify whether the video and audio of the given digital content 

are fake or real. For temporal video forgery detection, the convolutional 3D layers are used to build a model which 

can identify temporal forgeries with an average accuracy of 85% on the validation dataset. Also, the identification 

of audio forgery, using a ResNet-34 pre-trained model and the transfer learning approach, has been achieved. The 

proposed model achieves an accuracy of 99% with 0.3% validation loss on the validation part of the logical access 

dataset, which is better than earlier models in the range of 90-95% accuracy on the validation set. 

 
Keywords: transfer learning, mel-spectrogram, forgery, data augmentation 

 
1. Introduction 

Surveillance cameras are now found in almost every location, such as banks and businesses, where the recordings are 

used to reduce crime. However, due to the availability of video editing software like Adobe, the process of video editing has 

become simple [1]. Currently, videos are commonly considered the most extensively utilized communication and 

entertainment medium. Hence, this kind of vogue surely emphasizes the utilization of automated perusal and video content 

understanding using technology. This is referred to as the major goal of computer vision [2]. 

Several methods have been developed for detecting image forgeries, most of which rely on the extraction of specific 

image modifications in the output image or examination of discrepancies compared to a regular camera pipeline [3]. Based 

on the modification domain, these adjustments can be classified as intra-frame or inter-frame [4]. This study concentrated on 

forgeries in the video along directions of inter-frame which are ubiquitous in surveillance videos and difficult to detect. A 

significant gap has been found in existing work in direction of inter-frame forgeries due to the lack of a temporal video 

forgery dataset. A dataset that consists of various temporal forgeries is created and published on Kaggle [5]. The main 

objective of the proposed research is to come up with a model trained on this dataset that can identify temporal forgeries 

with an average accuracy of 85% on the validation data. 

This study also helps identify false audio, reduce the spread of rumors and hate speech, make better-informed decisions, 

and master the art of fake audio detection. Digital authentication and forensics are the conformation and examination of 

audio for validation of its uniqueness (identify forgery, if any), and it also has a lot of applications [6]. Copy-move, deletion, 

insertion, replacement, and splicing are all methods for audio forgery [7]. For the audio forgery detection of text-to-speech 

                                                           
* Corresponding author. E-mail address: vaibhavi.shetty19@vit.edu 


2 Proceedings of Engineering and Technology Innovation, vol. x, no. x, 20xx, pp. xx-xx  

(TTS) and voice-conversion (VC) frauds, a ResNet-34 using the transfer learning approach is implemented. After successful 

training of the model, the prediction of audio files and their classification into three categories: real, spoof_TTS, and 

spoof_VC have been done. The ASVspoof 2019 dataset has data balance problems [8]. Hence, the proposed model 

introduces a framework to fill that gap, leading to help the models generalize to a wider range of inputs. 

2. Literature Review 

Richard and Roussev [9] discussed many digital forensic image/video analysis techniques, incorporating deep learning 

object identification structure using the YOLO method, and chromatic and pattern techniques for object recognition 

approaches. To accomplish digital video analysis in a forensic context, their study not only covers various forensic visual 

data analysis issues and resolutions but also describes several unique graphic data analytic techniques. Several experimental 

results for picture enhancing techniques and object recognition methods are shown, demonstrating how YOLO in particular 

may be used to find numerous criminal suspects and crime scene objects, then establish a link between some of them. 

“YOLOv3: An Incremental Improvement,” [10] made transparent YOLO upgrades. The findings of multistage and single-

stage object detectors are compared in this article. In terms of speed and accuracy, the numbers confirmed that the YOLOv3 

object detector outperforms other object detectors. 

The goal of virtual/digital media forensics is to establish systems that can automatically assess visual integrity. In the 

literature, feature-based [11-12] and convolutional neural network (CNN)-based [13-14] integrity analysis approaches had 

been investigated. Most of the proposed techniques for video-based digital forensics attempt to identify computationally 

cheap alterations, such as dropped or duplicated frames or copy-move operations [15]. Ways that differentiate computer-

found faces from genuine faces are used to detect face-based interventions. And a two-stream network was proposed to 

identify two distinct face-swapping manipulations [16]. A new dataset by Rossler et al. [17] was especially relevant to 

practitioners, which has around half a million modified photographs created via feature-based face editing. 

Hinton et al.[18] talked about the limits of CNNs for inverse graphics applications, laying the groundwork for a more 

vigorous “capsule” design. However, due to the absence of an optimization algorithm and the limits of technology at the 

time, this complicated architecture could not be executed properly. Instead, CNNs that are simple to create have become 

popular. Sharma and Singh [19] proposed a combined technique of image classification that employs transfer learning for 

feature selection and principal component analysis (PCA) for feature reduction. Capsule networks have now been created 

with impressive early results due to the introduction of the expectation-maximizing routing algorithm along with the 

dynamic routing algorithm [20]. According to Sabour et al. [20], stratified pose relationships amongst the pieces of objects 

are well characterized using the output of a dynamic routing algorithm, i.e., the accordance between capsules. 

Many machine learning algorithms have been specially designed for video forgery detection. To discover 

counterfeiting, Saddique et al. [21] proposed adopting discrete texture analysis in successive frames. Christlein et al. [22] 

analyzed the effectiveness of characteristics for copy-move watermarking that used a multitude of conventional feature sets, 

including scale-invariant feature transform (SIFT) and speed-up robust features (SURF), and block-based features such as 

PCA, discrete wavelet transform (DWT), discrete cosine transform (DCT), and kernel principal component analysis 

(KPCA). 

Using the concept of a near-neighbor-dense field, D’Amiano et al. [23] suggested a patch-match-based copy-move 

detection approach. However, while confronted with huge amounts of data, this method failed miserably. Wu et al. [24] 

suggested a method for detecting frame duplication and frame deletion in vector flow picture sequences by observing 

velocity and discontinuity peaks. In the moving picture expert group (MPEG) [25] videos, it presented a video forgery copy-

move detection algorithm. To determine the optical flow coefficient for each region, their method separates each video frame 


 Proceedings of Engineering and Technology Innovation, vol. x, no. x, 20xx, pp. xx-xx 3 

into suspected cleared sections. When an uncommon trend in the optical flow coefficient object is found, it indicates 

counterfeiting. Singh and Singh [26] proposed a passive blind approach that uses the correlation coefficient and coefficient 

of variation to detect duplicate frames. 

Wang et al. [27] offered discrete wavelet packet deconstruction and singular point analysis of speech data, to identify 

audio tampering of time-domain such as audio recognition, addition, replacement, and slicing. It provides a technique for 

measuring reverberation length for identifying indicators of tampering in audio tapes. They compared pitch to format 

sequences to detect copy-move forgeries in audio recordings. To identify places of copy-move fraud in a video, a histogram 

is computed using LBP and a comparison technique is applied. Detailed analysis of image and video forgery along with fake 

video datasets used for tampering has been demonstrated [28]. 

3. Dataset and Attributes 

This part describes the dataset used for both video and audio forgery detection. The attributes of the datasets such as the 

number of files, description of files, etc. are mentioned. The representation technique of the audio files is researched and an 

explanation for choosing the mel-spectrogram has been given. 

 
(a) Frame augmentation before insertion 

  
(b) Inserting augmented frames in a random position (c) Deleting the original frame from a random position 

Fig. 1 Temporal forgery techniques implemented in dataset creation 

(1) Video forgery: A custom dataset for temporal forgery detection has been developed by modifying the dashcam dataset 

containing 1544 videos [29]. 

(2) Creating a custom dataset: The creation of a dataset for forgery detection has been achieved by introducing seven types 

of important temporal forgeries in the dashcam dataset, like insertion, deletion, duplication, flipping, rotations, and 

zooming forgery. Fig. 1 shows some of the forgery techniques implemented during dataset creation. Training data 

contains 9448 videos of the dashcam dataset containing non-tampered and tampered videos that are forged by the 

mentioned forgery techniques. Test data contains 2904 videos of the same type. 

(3) Audio forgery: ASVspoof 2019 was created for the third automatic speaker verification spoofing and countermeasures 

challenge. Table 1 shows the number of audio files present in the ASVspoof 2019 logical access (LA) dataset according 

to their labels. 

Table 1 ASVspoof 2019 data distributions 

- Train Dev Eval 

Bonafide 2580 2548 7355 

Spoof 22800 22296 63882 

Total 25380 24844 71237 

Rotation Zooming Flipping 


4 Proceedings of Engineering and Technology Innovation, vol. x, no. x, 20xx, pp. xx-xx  

3.1.   Audio representation technique 

3.1.1.   Spectrogram 

Fourier transform is used to build spectrograms from sound sources. The Fourier transform displays the amplitude of 

each fundamental frequency after dividing a signal into its fundamental frequencies. A spectrogram breaks the length of a 

sound source into tiny window segments, which are then subjected to the Fourier transform to detect the frequencies 

contained within each window. Next, all of those windows’ Fourier transforms are then integrated into a single plot. It plots 

frequency (y-axis) vs. time (x-axis) and uses different colors to show each frequency amplitude. The brightness of the color 

that represents the signal is proportional to its energy. 

3.1.2.   Chroma features 

Chroma-based characteristics, also known as “pitch class profiles,” are a useful tool for analyzing music with usefully 

categorized pitches (typically into twelve categories) and tuning which approximates the equal-temperament scale. 

Chromatic and melodic aspects of music are captured by chroma features, which are resistant to changes in timbre and 

instrumentation. 

3.1.3.   Mel-spectrograms 

A mel-spectrogram happens to a spectrogram where the frequencies exist convinced to the mel scale. It remaps the 

principles in hertz to the mel scale as shown in Fig. 2. The mel scale can be termed as the scale of pitches perceived by 

listeners to have the same distance from one another. General frequency measurement has a common reference point which 

can be defined by equating a 1000 Hz tone, with a pitch of 1000 mel and 40 dB greater than the listener’s threshold.  

 
Fig. 2 Hertz vs. mel scale representation 

The formula to convert frequency from hertz into mel scale can be expressed by:  

10
2595 log 1

100

f
m

 
  

   
(1) 

where m represents mel on the mel scale and f represents frequency in hertz. 

A mel-spectrogram forms two influential changes relating to a normal spectrogram that plots frequency intersection 

time. It uses the mel scale as a suggestion of correction frequency in contact with the y-point around which something 

revolves. And the decibel scale is used as a suggestion of correction amplitude to signify banners. The proposed research 

uses the mel-spectrogram-based dataset and it is considered better than other audio representation techniques. The reason has 

also been demonstrated and experimented on in the experimental section. 

M
e
l 

sc
a
le

 
Hertz scale 


 Proceedings of Engineering and Technology Innovation, vol. x, no. x, 20xx, pp. xx-xx 5 

3.2.   Data augmentation 

The training, development, and assessment datasets of ASVspoof 2019 have a lot of data imbalance, as shown in Table 

1. As a result, this research shows that developing an augmentation framework that can generate mel-spectrograms from the 

current datasets while also addressing the dataset’s data imbalance problem, and allowing the network to learn more valuable 

features. 

SpecAugment changes the spectrogram by distorting it in time, masking blocks of successive frequency channels, and 

masking blocks of utterances in time. Time shifting, time masking, and frequency masking are the three primary methods for 

augmenting data. 

3.2.1.   Time shifting 

In time shifting, the audio is moved linearly from the left or right with a random second in time shifting. Here, the audio 

is fast-forwarded by a certain interval of x seconds, the first of these x seconds is marked as 0, i.e., silence. Then, the shift of 

the audio to the right (backward) for x seconds again, and the last x seconds are marked as 0, i.e., silence. Fig. 3 shows the 

spectrogram of the original audio and time-shifted audio. 

  
(a) Spectrogram of original audio (b) Spectrogram of time shifted audio 

Fig. 3 Spectrogram of time shifted audio 

3.2.2.   Frequency masking 

Masked frequency channels are [f0, f0 + f]. Here f0 is selected from (0, v-f), where v denotes the number of frequency 

channels, and the selection of f is made from a uniform distribution ranging from 0 to masking parameter F. Fig. 4 shows the 

masked mel-spectrogram using frequency masking. 

3.2.3.   Time masking 

As shown in Fig. 5, while doing time masking, the masking of t sequential steps of time [t0, t0 + t] is obtained. The t is 

selected within a uniform distribution ranging from 0 to a masking parameter T, and from [0, τ – t] t0 is selected. Here, τ is 

the length of the audio file. 

  
Fig. 4 Frequency masking Fig. 5 Time masking 

Frequency mask 
Time mask 

 
6 Proceedings of Engineering and Technology Innovation, vol. x, no. x, 20xx, pp. xx-xx  

4. Methodology 

4.1.   Video forgery 

 
Fig. 6 Proposed methodology flow diagram 

Step 1: The proposed model as shown in Fig. 6 takes data in the form of multiple sequences of images from videos. The 

dashcam dataset consists of 1181 videos for training and 363 videos for testing. As shown in Fig. 1, after applying 

the forgery techniques to the existing dashcam dataset, a new dataset with a total of 9448 videos for training and 

2904 videos for testing is produced. It is used for video forgery analysis, model training, and validation. 

Step 2: For training, the model takes data in the form of video and extracts some sequences of frames from the video. These 

multiple sequences of videos are labeled with the category of forged and original, depending upon the type of video 

from which these frames are extracted. 

Step 3: For making the operation of frame extraction and sequencing faster, pre-extracted frames from the videos are kept in 

storage. The starting point of the sequence of frames is randomly chosen to avoid overfitting on specific time 

instances. The length of the sequence or clip is a data-dependent hyperparameter (depends upon video length). The 

labeled data created in this procedure is passed to the model for training and validation purposes. 

Step 4: The model contains multiple convolutional 3D layers which convolve the sequence of frames to a 3D volume of 

features as shown in Fig. 7. From the convolved output, the model chooses important features using max pooling 3D 

layers. The dropout layers are added in between the series of max pooling 3D and convolutional 3D layers to avoid 

model overfitting. The model contains Relu as an activation function for the neurons. The output layer consists of 

two neurons that contribute two classes named forged and original. As the model is a classification problem, it uses 

a categorical cross-entropy loss function. A stochastic gradient descent optimizer is used to overcome the overhead 

of the gradient descent algorithm and a learning rate scheduler is used to decay the learning rate with an increase in 

the number of epochs. The Tesla K80 GPU is used for training this model. 

Step 5: Table 2 shows the observed metrics while training the model. The accuracy and loss are 85.55% and 30.74% 

respectively. The precision (positive predictive value) and recall (sensitivity) values are 86.51% and 74.75% 

respectively. Table 3 shows the validation metrics for the model. The accuracy and loss are 82.17% and 35.87% 

respectively. 

Temporal video 
forgery dataset 

generation 

Video data input 

Pre-processing 

 Extracting frames 

 Inserting forgery 

 Creating forged videos 

Prediction on evaluation 
dataset 

Model training on original 
and forged dataset 

 
 Proceedings of Engineering and Technology Innovation, vol. x, no. x, 20xx, pp. xx-xx 7 

 
Fig. 7 Proposed model of CNN architecture with hyperparameters 

 
Table 2 Training metrics            Table 3 Validation metrics 

Accuracy 0.8555  Accuracy 0.8217 

Loss 0.3074  Loss 0.3587 

Precision 0.8651    

Recall 0.7475    

  
The formulas for calculating accuracy and loss are shown below: 

TP TN
Accuracy

FP FN TP TN




    
(2) 

 
FP FN

Loss
FP FN TP TN




    
(3) 

a = filter size: 3×3×3, layer: Convolution 3D, input channel: 3 

Relu(a) activation function 

kernel size: 1×2×2, layer: Maxpooling 3D 

b = filter size: 3×3×3, layer: Convolution 3D, input channels: 64 

Relu(b) activation function 

kernel size: 2×2×2, layer: Maxpooling 3D 

c = filter size: 3×3×3, layer: Convolution 3D, input channel: 128 

Relu(c) activation function 

d = filter size: 3×3×3, layer: Convolution3D, input channel: 256 

Relu(d) activation function 

kernel size: 2×2×2, layer: Maxpooling 3D 

e = filter size: 3×3×3, layer: Convolution 3D, input channel: 256 

Relu(e) activation function 

f = filter size: 3×3×3, layer: Convolution3D, input channel: 512 

Relu(f) activation function 

kernel size: 2×2×2, layer: Maxpooling 3D 

j = input channels: 8192, layer: Linear, output channels: 4096 

k = Relu(j) activation function 

Dropout(k) 50% 

g = filter size: 3×3×3, layer: Convolution 3D, input channel: 512 

Relu(g) activation function 

h = filter size: 3×3×3, layer: Convolution 3D, input channel: 512 

Relu(h) activation function 

kernel size: 2×2×2, layer: Maxpooling 3D 

l = input channels: 4096, layer: Linear, output channels: 4096 

m = Relu(l) activation function 

Dropout(m) 50% 

input channels: 4096, layer: Linear, output channels: 2 


8 Proceedings of Engineering and Technology Innovation, vol. x, no. x, 20xx, pp. xx-xx  

The terms false positives (FP), false negatives (FN), true positives (TP), and true negatives (TN) are derived from the 

confusion matrix in Fig. 8. 

 
Fig. 8 Confusion matrix of the proposed model prediction on the validation dataset 

4.2.   Audio forgery 

The proposed study takes the training audio data files from the ASVspoof 2019 LA dataset section and develops a mel-

spectrogram representation of each audio file. Data augmentation has been used to address the problem of data imbalance in 

the dataset. Next, the ResNet-34 model applies transfer learning to the dataset and supplemented data. The data is then 

divided into three categories: real, spoofed_TTs, and spoofed_VC. The proposed methodology for the research is described 

in the next section. 

Step 1: This study proposes a voice classifier based on deep convolutional neural networks for detecting spoofing attempts 

with the help of the ASVspoof 2019 dataset and required pre-processing. 

Step 2: In the suggested technique, the train folder is used from the ASVspoof 2019 dataset and an audio time-frequency 

model of power spectral densities on the mel frequency scale (mel-spectrogram). 

Step 3: For deeper residual training, (80-20) train and validation split (for transfer learning on ResNet-34 architecture) are 

designed. The fastai package and the Tesla K80 GPU are used to implement this transfer learning approach. The 

proposed methodology is shown in Fig. 9. 

Transfer learning: Given the significant computing and time resources required to create neural network models for 

this challenge, and the significant improvements in the skill that they provide on related problems. It helps in improving the 

DL models using pre-trained models as a preliminary step in computer vision. 

A residual network, or ResNet for short: It is an artificial neural network that uses skip connections or shortcuts to 

bypass some layers in the creation of a deeper neural network. Skipping enables the creation of deeper network layers 

without having to deal with vanishing gradients. 

Step 4: On the validation split formed on the train folder files, the first epoch provides an accuracy of 92.47%. A total of 12 

epochs have been executed for audio forgery detection on the ASVspoof dataset. After every 4 epochs, the learning 

rate is identified to get minimum loss and then changed for the further epochs. 

Step 5: Better accuracy of 99% with 0.03% validation loss on the validation set is achieved after performing 12 epochs and 

fine-tuning the learning rate of the proposed model. Fig. 10 depicts the relationship between loss and learning rate. 

The final model metrics are shown in Fig. 11. 

 
Original Forged 

O
ri

g
in

a
l 

F
o

rg
e
d
 

 Proceedings of Engineering and Technology Innovation, vol. x, no. x, 20xx, pp. xx-xx 9 

 
Fig. 9 Proposed methodology flow diagram 

 
Fig. 10 Loss versus learning rate graph (for fine-tuning) 

 
Fig. 11 Metrics of the final model preparation 

5. Experiments 

5.1.   Video forgery  

(1) Error level analysis (ELA) 

ELA is used to recognize parts of the picture with varying compression rates. One of the major drawbacks of ELA is 

that it provides inaccurate recognition when low-quality JPEG images and recoloring are considered [30]. Using the ELA 

technique, ELA-processed images for the input images (CASIA2 dataset) are generated as shown in Fig. 12. 

These newly generated images which undergo ELA processing are passed to a 2D convolutional neural network with 

labels attached as “original” and “forged”. The model comes up with an accuracy of 90.28% for the validation part. This 

experiment is done to see the performance changes in the classification of forged videos frame-by-frame. The outcome is 

that the performance of this ELA metric-based model is not as efficient as the proposed model which works on convolutional 

 
Audio data input and 
pre-processing 

Mel-spectrogram 
generation 

Transfer learning over 
ResNet-34 

Training on generated 
spectrograms 

Evaluation on dev and 
validation dataset 

Tesla K80 GPU 

1e-06 1e-05 1e-04 1e-03 1e-02 

0.40 

0.35 

0.25 

0.30 

0.20 

0.15 

Learning rate 

L
o
s
s
 

10 Proceedings of Engineering and Technology Innovation, vol. x, no. x, 20xx, pp. xx-xx  

3D CNN layers. Here, CASIA 2 dataset has been used. By using ELA, forgery in the spatial domain is detected. This 

approach can be used to detect a spatial forgery in videos by separating the frames and doing ELA analysis frame-by-frame. 

 
Fig. 12 Input image versus forged image versus image after ELA processing 

(2) Discrete cosine transform (DCT) 

The DCT is a mathematical modification that is essential to the JPEG standard compaction. The primary objective of 

these procedures is to change a signal from one sort of interpretation to another. The DCT may be used to transform the 

signal (intra-frame information) into quantitative data (“frequency” or “spectral” information), allowing the image to be 

quantified and compressed. 

CASIA 2 dataset is used here. DCT is applied to the frame for compression. The difference between the original and 

compressed frame is obtained to identify the spatial forgery in the frame. DCT coefficients are used to identify irregularities 

due to the spatial domain analysis caused by superimposing an image over another one. 

(3) Using image processing (a non-AI approach) 

The structural similarity index measure (SSIM) checks the similarity between two images by the standard deviation of 

pixel values of the image. These become the factors that can be used to detect some types of forgeries such as insertion, 

duplication, copy-moving, and removal of the region of a frame in a video. 

SSIM is a perception-based model that analyzes image deterioration as a perceived change in structural information, as 

well as crucial perceptual appearances, such as contrast and luminance masking. Structural information means the 

assumption that pixels have a lot of interdependencies, especially when they’re close together in a space. 

In the proposed approach, SSIM and the standard deviation have been used to detect and analyze forgeries that can be 

embedded between consecutive video frames. This has been done by computing and analyzing the SSIM value between two 

consecutive frames of a video, along with calculating the difference between the standard deviation of pixel values of these 

two frames. 

A sudden change in standard deviation values of a frame in a video sequence with a very low similarity index between 

consecutive frames depicts a high probability of forgery. Meanwhile, a frame window is proportionally divided into several 

segments. SSIM and the standard deviations are calculated for each segment and compared with corresponding segments in 

the previous frame. This results in more accurate forgery detection and localization of forgery. Fig. 13 shows the entire flow 

diagram of the process. 

Forged portion 

Original image Forged image ELA processed image 


 Proceedings of Engineering and Technology Innovation, vol. x, no. x, 20xx, pp. xx-xx 11 

 
Fig. 13 Non-AI video forgery detection flow diagram 

 
(a) Frame 1: Untampered video frame segment (b) Frame 2: Tampered video frame segments in 

next consecutive frame 

Fig. 14 Non-AI approach illustration to identify video forgery 

In Fig. 14, “A” represents the difference between the standard deviation values of the current frame segment and the 

corresponding segment of the frame previous to it. “B” represents the SSIM value between the consecutive frame segments. 

Identification of tampered frame segments is made by setting a minimum threshold to SSIM values and a maximum 

threshold to the difference of standard deviation values between consecutive frame segments. The red highlighted segments 

indicate forgery in a particular section of the video. By using this technique, temporal forgery analysis of any video can be 

done. 

5.2.   Audio forgery 

Table 4 shows the comparison of the metrics such as accuracy and F1 scores with other chart types (spectrogram, 

chroma STFT). The proposed model achieves a better performance under experimental conditions. By observing the chart 

type comparison, chroma STFT gives the least accuracy and F1 scores, whereas the mel-spectrogram gives the best accuracy 

and F1 scores. 

Fig. 15 shows the various methods to convert the visual and audio media transmitted via radio wave signals to an 

image. It can be seen that the reason behind choosing the mel-spectrogram over the differing present methods is that the 

spectrogram gives a short “snapshot” of visual and audio media transmitted via radio waves. Therefore, it is suitable to 

recommend CNN-located architectures grown for management representation. Fig. 14 shows the confusion matrix generated 

on the validation part of the ASVspoof 2019 LA dataset. 

 
Video data input 

Segmenting video 
frame 

Applying video metrix: 

 Structure similarity index 
measure (SSIM) 

 Standard deviation 
(contrast) 

Detection and localization 
of forgery  


12 Proceedings of Engineering and Technology Innovation, vol. x, no. x, 20xx, pp. xx-xx  

The counts of FP and FN are quite less than TP and TN. The wrong predictions inform that type 1 errors are more than 

type 2. However, the error rate is very less on the whole and the errors observed in some samples were because a lot of 

tweaking/augmentation was done on the testing set. 

Table 4 ASVspoof 2019 validation set accuracy versus graphing approach 

Chart type Accuracy F1 score Classes 

Mel-spectrogram (proposed) 99% 

99% Real 

99% Spoof_TTS 

98% Spoof_VC 

Spectrogram 90.33% 

90.52% Real 

95.83% Spoof_TTS 

86.34% Spoof_VC 

Chroma STFT 82.50% 

73.32% Real 

93.52% Spoof_TTS 

80.10% Spoof_VC 

 
Fig. 15 Confusion matrix of the proposed model prediction on the validation dataset 

6. Conclusions 

To improve the accuracy and quality of video and audio forgery identifications, two models for their detection are 

proposed. The experiments lead to the following conclusions: 

(1) This research on video temporal forgery identification fills the gap in existing work on inter-frame forgery detection due 

to the lack of temporal video forgery detection. It proposes the use of convolutional 3D layer model architecture with an 

accuracy of 85.55%. Also, a non-AI technique has been developed using metrics like SSIM and the standard deviations 

of the video frame segments to identify runtime temporal forgery. A comprehensive dataset for temporal forgery 

identification has been created for future research. 

(2) In audio forgery identification, ASVspoof 2019 dataset using transfer learning is proposed. Moreover, it proposes a 

comparative study on various audio representation techniques and a study on why the mel-spectrogram is efficient for 

audio data. Augmentation of data has been done to handle the data imbalance problem. 

(3) The computational complexity in CNN models utilized in the audio and video forgery algorithms, the number of 

parameters in each feature map is limited to a constant (usually less than 1) multiplied by the input pixels n. Convolving 

a fixed length filter over an image with n pixels requires O(n) time because each output is just the sum-product of k 

pixels in the image and k weights in the filter, and k is constant with n. Similarly, every max or avg pooling operation 

takes no more than linear time in terms of input size. Hence, the entire runtime remains linear O(n). 

Confusion matrix 

Real Spoof_TTS Spoof_VC 

Predicted 

S
p

o
o

f_
T

T
S
 

S
p

o
o

f_
V

C
 

R
e
a
l 

A
c
tu

a
l 


 Proceedings of Engineering and Technology Innovation, vol. x, no. x, 20xx, pp. xx-xx 13 

(4) Both the video and audio forgery models do not incur any computational overhead. All the processing is done as the 

requirements of the proposed model. Overall, the proposed models achieved the optimal accuracy performance of 99% 

on the validation dataset with minimal loss. Future work of this research can be directed to combining the video and 

audio forgery detection works. One way of doing this is by extracting audio and visual parts of video and feeding them 

to respective models. Outputs of both models can be combined to generate the final result. 

Conflicts of Interest 

The authors declare no conflict of interest. 

References 

[1] S. Fadl, Q. Han, and Q. Li, “CNN Spatiotemporal Features and Fusion for Surveillance Video Forgery Detection,” 

Signal Processing: Image Communication, vol. 90, article no. 116066, January 2021. 

[2] Y. B. Deshmukh and S. K. Korde, “Forensic Video/Image Analytics – A Deep Learning Approach,” International 

Journal of Creative Research Thoughts (IJCRT), vol. 8, no. 9, pp. 411-418, September 2020. 

[3] D. Afchar, V. Nozick, J. Yamagishi, and I. Echizen, “MesoNet: A Compact Facial Video Forgery Detection Network,’’ 

IEEE International Workshop on Information Forensics and Security (WIFS), article no. 8630761, December 2018. 

[4] J. Xiao, S. Li, and Q. Xu, “Video-Based Evidence Analysis and Extraction in Digital Forensic Investigation,” IEEE 

Access, vol. 7, pp. 55432-55442, April 2019. 

[5] P. Ghadekar, P. Maheshwari, R. Shah, A. Shaha, V. Sonawane, and V. Shetty, “Video Forgery Dataset,” 

https://www.kaggle.com/datasets/rajshah1/video-forgery-dataset, September 10, 2022. 

[6] H. Malik and H. Farid, “Audio Forensics from Acoustic Reverberation,” IEEE International Conference on Acoustics, 

Speech and Signal Processing, pp. 1710-1713, March 2010. 

[7] C. Kraetzer, A. Oermann, J. Dittmann, and A. Lang, “Digital Audio Forensics: A First Practical Evaluation on 

Microphone and Environment Classification,” MM&Sec '07: Proceedings of the 9th Workshop on Multimedia & 

Security, pp. 63-74, September 2007. 

[8] J. Yamagishi, M. Todisco, M. Sahidullah, H. Delgado, X. Wang, N. Evans, et al., “ASVspoof 2019: The 3rd Automatic 

Speaker Verification Spoofing and Countermeasures Challenge database,” [sound]. University of Edinburgh. The Centre 

for Speech Technology Research (CSTR), https://doi.org/10.7488/ds/2555. 

[9] I. I. I. Richard and V. Roussev, “Digital Forensic Tools: The Next Generation,” Digital Crime and Forensic Science in 

Cyberspace, IGI Global, pp. 75-90, April 2006. 

[10] J. Redmon and A. Farhadi, “YOLOv3: An Incremental Improvement,” University of Washington, Technical Report, 

article no. 1804.02767, April 2018. 

[11] H. Farid, Photo Forensics, The MIT Press, February 2019. 

[12] D. Güera, Y. Wang, L. Bondi, P. Bestagini, S. Tubaro, and E. J. Delp, “A Counter-Forensic Method for CNN-Based 

Camera Model Identification,” IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 

pp. 1840-1847, July 2017. 

[13] D. Güera, F. Zhu, S. K. Yarlagadda, S. Tubaro, P. Bestagini, and E. J. Delp, “Reliability Map Estimation for Cnn-Based 

Camera Model Attribution,” IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 964-973, 

March 2018. 

[14] P. Bestagini, S. Milani, M. Tagliasacchi, and S. Tubaro, “Local Tampering Detection in Video Sequences,” IEEE 15th 

International Workshop on Multimedia Signal Processing (MMSP), pp. 488-493, September-October 2013. 

[15] D. Graupe, “Principles of Artificial Neural Networks,” Advanced Series in Circuits and Systems, Vol. 7, World 

Scientific, 2013. 

[16] P. Zhou, X. Han, V. I. Morariu, and L. S. Davis, “Two-Stream Neural Networks for Tampered Face Detection,” IEEE 

Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 1831-1839, July 2017. 

[17] A. Rössler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner, “Faceforensics: A Large-Scale Video 

Dataset for Forgery Detection in Human Faces,” arXiv preprint, article no. 1803.09179, March 2018. 

[18] G. E. Hinton, A. Krizhevsky, and S. D. Wang, “Transforming Auto-Encoders,” Artificial Neural Networks and Machine 

Learning – ICANN 2011, Lecture Notes in Computer Science, vol. 6791, pp. 44-51, 2011. 


14 Proceedings of Engineering and Technology Innovation, vol. x, no. x, 20xx, pp. xx-xx  

[19] R. Sharma and A. Singh, “An Integrated Approach towards Efficient Image Classification Using Deep CNN with 

Transfer Learning and PCA,” Advances in Technology Innovation, vol. 7, no. 2, pp. 105-117, April 2022. 

[20] S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic Routing Between Capsules,” Proceedings of the 31st International 

Conference on Neural Information Processing Systems (NIPS'17), pp. 3859-3869, December 2017. 

[21] M. Saddique, K. Asghar, U. I. Bajwa, M. Hussain, and Z. Habib, “Spatial Video Forgery Detection and Localization 

Using Texture Analysis of Consecutive Frames,” Advances in Electrical and Computer Engineering, vol. 19, no. 3, pp. 

97-108, 2019. 

[22] V. Christlein, C. Riess, J. Jordan, C. Riess, and E. Angelopoulou, “An Evaluation of Popular Copy-Move Forgery 

Detection Approaches,” IEEE Transactions on Information Forensics and Security, vol. 7, no. 6, pp. 1841-1854, 

December 2012. 

[23] L. D’Amiano, D. Cozzolino, G. Poggi, and L. Verdoliva, “A PatchMatch-Based Dense-Field Algorithm for Video 

Copy–Move Detection and Localization,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, 

no. 3, pp. 669-682, March 2019. 

[24] Y. Wu, X. Jiang, T. Sun, and W. Wang, “Exposing Video Inter-Frame Forgery Based on Velocity Field Consistency,” 

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2674-2678, May 2014. 

[25] G. Ulutas, B. Ustubioglu, M. Ulutas, and V. V. Nabiyev, “Frame Duplication Detection Based on Bow Model,” 

Multimedia Systems, vol. 24, no. 5, pp. 549-567, October 2018. 

[26] G. Singh and K. Singh, “Video Frame and Region Duplication Forgery Detection Based on Correlation Coefficient and 

Coefficient of Variation,” Multimedia Tools and Applications, vol. 78, no. 9, pp. 11527-11562, May 2019. 

[27] Z. Wang, Y. Yang, C. Zeng, S. Kong, S. Feng, and N. Zhao, “Shallow and Deep Feature Fusion for Digital Audio 

Tampering Detection,” EURASIP Journal on Advances in Signal Processing, vol. 2022, article no. 69, 2022. 

https://doi.org/10.1186/s13634-022-00900-4 
[28] F. H. Chan, Y. T. Chen, Y. Xiang, and M. Sun, “Anticipating Accidents in Dashcam Videos,” Computer Vision – 

ACCV 2016, vol. 10114, pp 136-153, 2016. 

[29] S. Tyagi and D. Yadav, “A Detailed Analysis of Image And Video Forgery Detection Techniques,” The Visual 

Computer, 2022, in press. https://doi.org/10.1007/s00371-021-02347-4 

[30] I. B. K. Sudiatmika, F. Rahman, T. Trisno, and S. Suyoto, “Image Forgery Detection Using Error Level Analysis and 

Deep Learning,” Telecommunication Computing Electronics and Control (TELKOMNIKA), vol. 17, no. 2, pp. 653-659, 

April 2019. 

 
Copyright©  by the authors. Licensee TAETI, Taiwan. This article is an open access article distributed 

under the terms and conditions of the Creative Commons Attribution (CC BY-NC) license 

(https://creativecommons.org/licenses/by-nc/4.0/). 

https://doi.org/10.1186/s13634-022-00900-4
https://doi.org/10.1007/s00371-021-02347-4