Mathematical Problems of Computer Science 54, 53–68, 2020. UDC 004.8 Improving UAV Object Detection through Image Augmentation Karen M. Gishyan University of Bath, Bath, United Kingdom e-mail: karen.gishyan@bath.edu Abstract Ground-image based object detection algorithms have had great improvements over the years and provided good results for challenging image datasets such as COCO and PASCAL VOC. These models, however, are not as successful when it comes to unmanned aerial vehicle (UAV)-based object detection and commonly performance deterioration is observed. It is due to the reason that it is a much harder task for the models to detect and classify smaller objects rather than medium-size or large-size objects, and drone imagery is prone to variances caused by different flying altitudes, weather conditions, camera angles and quality. This work explores the performance of two state-of-art-object detection algorithms on the drone object detection task and proposes image augmentation 1 procedures to improve model performance. We com- pose three image augmentation sequences and propose two new image augmentation techniques and further explore their different combinations on the performances of the models. The augmenters are evaluated for two deep learning models, which include model-training with high-resolution images (1056×1056 pixels) to observe their overall effectiveness. We provide a comparison of augmentation techniques across each model. We identify two augmentation procedures that increase object detection accuracy more effectively than others and obtain our best model using a transfer learning 2 approach, where the weights for the transfer are obtained from training the model with our pro- posed augmentation technique. At the end of the experiments, we achieve a robust model performance and accuracy, and identify the aspects of improvement as part of our future work. Keywords: Computer Vision, Deep Learning, Image Processing 1. Introduction Deep learning-based computer vision algorithms for object detection in images and videos have had much success in the last decade. Object identification and detection from un- manned aerial vehicles (UAVs) have widely gained the attention of researchers, as UAV 1Image augmentation is a technique to expand the training dataset by creating modified versions of images in the dataset through different methods of processing. 2Transfer Learning is a research problem in Machine Learning that focuses on storing knowledge gained while solving one problem and applying it to a different but related problem. 53 54 Improving UAV Object Detection through Image Augmentation and computer vision-based applications can be successfully deployed in search-and-rescue, surveillance, agricultural crop identification and counting [1, 2]. However, compared to the object detection models that are trained on ground-based images, there is a significant de- cline in the performance of detecting objects of such models when applied to UAV-based images. As drones fly at various altitudes and there is a constant change in the viewing angles, UAV-based models have to deal with more visual appearances of the same object to work successfully. As an example, UAVs can look at one object from front-view to side view in a very short period of time, which can result in arbitrary aspect ratios of the objects [3]. This work proposes image processing and augmentation techniques and evaluates their performance on YOlOv3-SPP and YOlOv5-Large object detection models. We compose augmentation methods such as (3 × 4) Grid Augmentation and Geometric Sequence that successfully improve performance across multiple models, and their combination through transfer learning helps to achieve the highest overall mean average precision (mAP) 0.5 accuracy with YOLOv3-SPP. We also identify procedures where giving significantly more annotations 3 to the model during the training process does not improve and sometimes even impairs performance, and we verify the previous findings that augmentation techniques are effective when there are certain amount of images in the training dataset, which come from the same distribution as the validation set, otherwise in most of the cases we do not observe performance improvement. The augmentation procedures are evaluated across images with (1056 × 1056) input dimensions, which we regard as high resolution images for our exper- iments. High resolution images, however, translate to more computational resources and an incremental increase in the image size is also not guaranteed to improve performance. In spite of the significant amount of annotations in the images in the dataset that we use, there still exists a class imbalance, meaning the model may see an instance of a car class much more frequently than an instance of a bicycle class, resulting in training overfitting and decreasing the overall model accuracy during validation, which we aim to improve with our proposed augmentation methods. Our results show that (3×4) Grid Augmentation and (3×5) Grid Augmentation with Transfer Learning with random resampling methodology can successfully improve accuracy. As part of future work, instead of random sampling, we will explore Grid Augmentation variations with an oversampling methodology and specifically target the images containing imbalanced classes. 2. Methodology 2.1. Models In this work, for conducting our experiments, we use two model variants, which use one- stage object detection paradigm. The models are YOLOv3 with Spatial Pyramid Pooling (YOLOv3-SPP) from the works of [4, 5, 6] and YOLOv5-Large [7]. To solve the problem of YOLOv2 backbone’s low ability to do feature extraction and inability to make a full use of multi-scale local region features, [5] propose DC-SPP-YOLO (Dense Connection and Spatial Pyramid Pooling Based YOLO) to improve the accuracy of YOLOv2. They design a new spatial pyramid pooling introduced to collect and concatenate the multi-scale local region 3Annotation is the process of labeling images, which provides information to the computer vision model about the objects in the image. COCO bounding box annotations store (x-top left, y-top left, width, height) values of the given object in the image in a json format, while Pascal VOC annotations store (x-top left, y-top left, x-bottom right, y-bottom right) values of the object in an xml format. K. Gishyan 55 features for more comprehensive learning. They also show that DC-SPP-YOLO is more accurate than YOLOv2. We use a combination of spatial pyramid pooling and YOLOv3 by [6] and a recently developed YOlOv5-Large model by [7] for these experiments. [7] mentions that YOLOv5-Large achieves an average precision (AP) 0.5 score of 66.5% on a COCO test- dev dataset. In spite of being the official YOLOv5 repository, the model is fairly new, and the official paper is still to be released. 2.2. Data This work uses a subset of VisDrone-DET2019 4 dataset as a starting point for the experi- ments, and later scales upon it. The original dataset includes 10 categories, which are pedes- trian, person, car, van, bus, truck, motor, bicycle, awning-tricycle and tricycle. For these 10 categories, there are about 540,000 bounding boxes, 6471 training, 548 validation and 1580 testing images [8]. We select 1000 images having a height of less than 768 pixels from the original training dataset for the experiment, which are later readjusted to 1056 × 1056 input image dimensions and use it for conducting our image augmentation experiments. These images are later divided into 700 training, 200 validation and 100 test sets. The original 10 categories are reduced to 7, and our class list contains all default categories except motor, awning-tricycle and tricycle. By using image augmentation sequences and techniques, we later generate 700 new images equal to the number of images in our training dataset, and our final dataset includes 1400 training, 200 validation and 100 test images. We keep 11.7% of our total images for validation, compared to the original dataset, where the number of validation images is 6.3% of the total. 2.3. Image Transformations and Augmentations We present three image augmentation sequences and name them Blending Sequence, Blur- ring Sequence, Geometric Sequence, and 2 new image augmentation techniques and name them (3 × 4) Grid Augmentation, and (3 × 5) Grid Augmentation, which are applied to the training dataset to generate 700 new images, and for the main experiment, we com- pare the performance of the augmentation techniques against the base results5, or simply No Augmentation results. To test whether the obtained results can be further approved, we later experiment with a combination of sequences and grid augmentation for the model providing the best result across the experiments. The first three sequences are explored as part of the data-warping augmentations, while the grid augmentations are explored as part of random-resampling augmentation to help decrease class imbalance. Each augmentation sequence includes from five to seven augmenters 6 and is grouped according to its relevant augmentation type. Using such approach allows each image in our dataset to undergo a unique image transformation procedure obtained from each augmentation sequence. After the augmentation procedures, we mostly end up with training images, which may be a little complex in form and not always observed in real life, however, as it will be seen from the results, most of them do positively affect the performance, and some of them do it much 4VisDrone dataset is a large drone-based dataset containing images and videos particularly made for object-detection and object-tracking tasks. 5Base results are obtained from evaluating the models on the original 700 images without any augmen- tation, and the respective model is called a base model. 6Augmenter is a distinctive image transformation technique. 56 Improving UAV Object Detection through Image Augmentation more effectively than the rest. Each augmenter from a sequence is selected with a certain probability, and in the end one new augmented image is generated per image in the original dataset. With a defined probability, depending on the augmentation sequence, the original image undergoes from zero to two image processing procedures for each sequence, the prob- abilities varying for each one. We leave a small chance that no augmenter is applied for a given image. The grid augmentations are stochastic, as they select images randomly, while Blending, Blurring and Geometric sequences are deterministic, so for a set of images they always produce the same results making the experiments reproducible. 2.3.1. Blending Sequence Blending sequence includes five different blending augmenters. For each image, either one or two of the five available augmenters is selected. Each augmenter itself, has a 90 % probability of happening when selected. The reason that the augmenters are not assigned full probabilities of happening is to prevent them from altering the default dynamics and over damaging the image. This can be better conceptualized for the case when two augmenters are selected, this method giving a small chance that not both of them are applied all the time. If for a given image two augmenters are selected, there is a 1 % probability that none of them is applied, and 10% probability that at least one is not applied. The Blending Sequence includes Alpha Frequency Noise Blending with a nearest upscale method, Alpha Mask Blending, Alpha Checkerboard Blending with a hue upscale method, Alpha Linear Gradient Blending and Alpha Simplex Noise Blending with a linear upscale method. Alpha Frequency Noise Blending uses frequency noise masks for blending two images. The alpha masks are sampled from varying frequency noises, and as we use the nearest neighbour upsampling, it mostly results in the creation of blobs with sharp edges. Alpha Mask Blending augmenter generates an (H, W) or (H, W, C) channel-wise mask for each image, where H is the height, W is the width and C is the number of channels of the image. The mask is later used to alpha blend pixels between the foreground augmenter branch and the background branch. Clouds are added as a part of this augmenter, as can be observed at the bottom right hand side of Fig. 1b. Alpha CheckerBoard Blending uses a checkerboard pattern for blending the images. (R × C) grid, where R is the number of rows and C is the number of columns, is placed on each image. For this experiment, R is 1, and C is a number uniformly drawn from the [1,4] interval. The value for the hue of images is randomly drawn from the [-80,80] interval. Alpha Linear Gradient Blending blends two images along a vertical gradient. Here the gradient is applied to a pooled-image, and the starting and ending coordinates of Y-coordinate is chosen at random. This randomness causes the gradient to increase either at the top or at the bottom [9]. Alpha Simplex Noise Blending detects edges inside the image, which are marked black and white, then uses simplex noise to alpha blend with the original image. Here we use a linear upscaling method, which creates rectangles having smooth edges. A sample blending transformation example can be seen in Fig. 1. 2.3.2. Blurring Sequence This sequence includes four blur-related augmenters and three contrast-related augmenters, and together they comprise the Blurring Sequence. For each image, one or two of the available augmenters are selected. As Blurring Sequence includes more augmenters, each augmenter has a 95% probability of being executed when selected (see Fig. 2). The augmenters are: K. Gishyan 57 (a) Original image. (b) Transformed Image. Fig. 1. Blending Sequence transformation. Additive Gaussian Noise, Gaussian Blur, Motion Blur, Mean Shift Blur, Histogram Equal- ization, Sigmoid Contrast and Linear Contrast. Additive Gaussian Noise adds Gaussian noise to the image. For each pixel, the noise is sampled from F(0, s) where F follows a Normal Distribution, and s is sampled once per image. We choose s as 70. Gaussian Blur uses a Gaussian kernel for blurring the images. The standard deviation for the kernel is selected randomly from the range of [1,3] for each image. The Motion Blur augmenter applies a motion blur with a kernel size of 10 × 10 and a blur angle from the range of [-30,30], selected randomly per image. The Mean Shift Blur augmenter applies a pyramidic mean shift filter to each image. Histogram Equalization augmenter transforms the image into RGB color space, applies histogram equalization and the input channels are transformed back to the original color space. For the Sigmoid Contrast output, we use a (3,10) tuple. The multiplier for the sigmoid function is selected randomly from the interval of [3,10] per image, and the cutoff value, which defines the shift of the sigmoid function in the horizontal direction, is selected randomly from the range of [0.4,0.6] per image. Higher values result in darker pixels [9], and we select the range accordingly. Linear Contrast scales each pixel to 127+α ∗ (a − 127), where a is selected randomly from the range of [0.6,1.4] per image. The default range is preserved. (a) Original image. (b) Transformed Image. Fig. 2. Blurring Sequence transformation. 2.3.3. Geometric Sequence For each image as before, either one or two augmenters get selected. For this sequence, each augmenter has a 93 % probability of being applied to the image, when selected. The 58 Improving UAV Object Detection through Image Augmentation Geometric Sequence has the following augmenters: Elastic Transformation, 90 Degree Ro- tation, Rotation from -30 to 30, Polar Warping, Shear Transformation in Y direction and Shear Transformation in X direction. Elastic Transformation uses displacement fields and (a) Original image. (b) Transformed Image. Fig. 3. Geometric Sequence transformation. moves pixels around the figure. The augmenter has an alpha parameter, and we pass a tuple of (20, 30), and each image is transformed with a random value from the range of the tuple. We specifically choose low values for the tuple to introduce pixel distortion. 90 Degree Rotation rotates an image clockwise with a multiple of 90 degrees, resulting in var- ious flipping transformations. Rotation from -30 to 30 rotates an image with a randomly chosen value from the interval of [-30,30]. Polar Warping transforms an image into polar representation, which results in circular-warped patterns in the image. This transformation can be partially observed in subfigure 3b. Shear Transformation in Y direction applies a shear transformation across the Y axis, and Shear Transformation in X direction applies shear transformation across the X axis (Fig. 3b). The transformations intervals, from which the values are selected randomly for the transformation are [30,50] and [30,70], respectively. The values are chosen to replicate realistic viewing scenarios that a drone may be exposed to during a real-time flight. During a Geometric Transformation, however, some of the objects start to lie outside of the image and thus some of the bounding-box coordinates get dropped. 2.3.4. Grid Augmentations 2.3.5. (3 × 4) Grid and (3 × 5) Grid [10], in their recent work on YOLOv4, use a new data augmentation technique first proposed by [6], which combines randomly chosen four images together during the model training pro- cess to increase image variability. They name this technique Mosaic Augmentation and demonstrate that it helps to increase the mAP accuracy score on COCO dataset. We extend the idea of Mosaic Augmentation and propose two new data augmentation tech- niques and name them (3 × 4) Grid Augmentation and (3 × 5) Grid Augmentation, the latter being used with transfer learning during the model training, which combine 3 images horizontally and 4 and 5 images vertically, respectively. The combined images with their bounding boxes for (3 × 4) Grid Augmentation can be seen in Fig. 4 7. It is a common practice and sometimes a requirement to choose the image size of the deep learning net- work as a multiple of 16 or 32 pixels. For this reason, for the (3 × 4) Grid Augmentation, 7A visual example of (3 × 5) Grid Augmentation is not provided as it follows a transformation pattern similar to (3 × 4) Grid Augmentation. K. Gishyan 59 each image is resized to (352 × 352), which is thought to be an optimal size considering the increase in width and height that we get after grid augmentation. Each image, thus, has a square shape. The images are then combined horizontally and vertically, resulting in (1056 × 1408) size images, including annotations of 12 randomly selected images, which are resized accordingly. For the (3 × 5) Grid Augmentation, a random batch of 3 images is selected once a time. The 3 images are resized to the minimum image height in the given batch, and the widths are re-adjusted according to the new image height. The results in the creation of horizontal images. A random batch of 5 horizontally combined images are then selected to be combined vertically. The images are now resized based on the minimum image width of the given batch, and the heights are adjusted accordingly. This method allows us to preserve the relative sizes of the images, as well as different widths and heights. These differences, however, are sometimes not visible, as most of the images have relatively similar shapes, and this may create grid augmentations, where each image in the grid may appear to be square in shape. For the (3 × 5) Grid Augmentation, we join images, which have undergone a Blur Sequence transformation, unlike (3 × 4) Grid Augmentation, where the images are selected from the subset of original Visdrone dataset. Grid augmentation methods significantly increase the number of small object in a given image, and most of the grid-image combinations may be realistically analogous to the real world, as the drone sees objects from various settings during a real flight, especially from higher altitudes, an effect obtained from grid augmenting transformation. Though technically the proposed technique allows us to combine more images both horizontally and vertically, we should be careful about the computational costs, because if the given image has a number of annotations on average, the image after grid transformation contains a × w × h annotations on average, where w and h are the horizontal and vertical numbers of combined images. 2.3.6. Blurring Sequence and Grid Augmentation For the model that provides the best mAP 0.5 score, we test an augmentation sequence, where instead of blurring the images then combining them into a grid-like image as done for (3 × 5) Grid Augmentation, we apply a Blurring Sequence augmentation to the whole grid-augmented image. 2.3.7. All Augmentation Sequences and Grid Augmentation As a final experiment, we select 700 images that have been transformed using (3 × 4) Grid Augmentation and divide them into 3 equal proportions. We transform the first, second, and third proportions using Blending Sequence, Blurring Sequence, Geometric Sequence, respectively. The results for Blending Sequence and Geometric Sequence can be seen by looking at Fig. 5. For training the model, we use only the transformed 700 images, unlike the rest of the experiments, which use 1400 images. Combining images into a grid-like form also combines all separate annotations into one. We provide a summary of augmentation sequences including the total number of annotations for each technique in Table 1. We observe that grid augmentation techniques significantly increase the amount of annotations in the dataset, most of them being generated from (3×5) Grid Augmentation, which, over the original dataset with no augmentation, contains 15.3 times more labels. 60 Improving UAV Object Detection through Image Augmentation (a) (3 × 4) grid-augmented image. (b) (3 × 4) grid-augmented image with bounding boxes. Fig. 4. (3 × 4) grid-augmented image. (a) Blending Sequence applied to a (3 × 4) grid- augmented image. (b) Geometric Sequence applied to a (3 × 4) grid- augmented image. Fig. 5. Augmentation Sequences applied to (3 × 4) grid-augmented images. K. Gishyan 61 Table 1: Augmentation Sequences applied to the original 700 training images. Augmentation Sequence No Bounding Box Annotations Total No of Images No Augmentation 40309 700 Blending Sequence 40309 1400 Blurring Sequence 40309 1400 Geometric Sequence 37967 1400 (3 × 4) Grid 471233 1400 (3 × 5) Grid 618256 1400 Blurring Sequence and (3 × 4) Grid 471233 1400 All Sequences and (3 × 4) Grid 458044 700 a Blending, Blurring, Geometric Sequences, (3 × 4) Grid, (3 × 5) Grid are applied to the original 700 images. b Blurring Sequence and (3 × 4) Grid, All Sequences and (3 × 4) Grid are applied to 700 images that have been (3 × 4) grid-augmented. 2.4. Transfer Learning The trainings begin from a pretrained checkpoint for both models. YOLOv3-SPP uses yolov3-spp-ultralytics weights, and YOLOv5-Large uses yolov5l weights. Using pretrained backbones helps to speed up learning and gives more improved performance than training the models from randomly-initialized weights, however, we believe the performance can be further improved if we use weights that have been trained specifically on the drone-object detection footage, as the pretrained weights available for initialization work well with larger objects, but there is a significant performance reduction when evaluated on a small object detection task. For this reason, we conduct transfer learning based on the weights that we have obtained from our experiments and as the results will later show it helps to further improve performance and obtain a better model. We use the model’s weights, which have been trained on the dataset without augmentation (700 images) as an initialization point for YOLOv3-SPP and YOLOv5-Large and the weights that have been trained on the dataset with (3×4) grid-augmented images as an initialization point for YOLOv3-SPP trained with Geometric Augmentation as part of the performance-improvement experiment with the best model. 3. Experiments and Results 3.1. Model Specifications 3.2. YOLOv3-SPP YOLOv3-SPP trained on higher resolution images provides the better results for both ex- periments. The base model provides mAP 0.5 score of 39.7 %. We further improve this performance up to 46.2 % with (3 × 4) Grid Augmentation, which becomes the procedure providing the highest increase in mAP score amongst all the models without using transfer learning8. Geometric Sequence and Blending Sequence provide almost equal results in terms 8Without transfer learning means not using custom weights obtained from our base model. 62 Improving UAV Object Detection through Image Augmentation Table 2: Parameters of YOLOv3-SPP and YOLOv5-Large. Parameters YOLOv3-SPP YOLOv5-Large Batch Size 2 2 Epoch Size 15 15 Channels 3 3 Validation Interval 1 1 Initial Learning Rate 0.001 0.01 Image Size (1056 × 1056) (1056 × 1056) of mAP 0.5; 45.3% and 45.2%, respectively. We will later see that the Geometric Sequence and the pretrained weights of (3 × 4) Grid Sequence together provide the highest results obtained for this work. The other augmenter, the Blurring Sequence provides a result of 45.1%, however, the results of (3 × 5) Augmentation with Transfer Learning is almost just as good as the base model, while containing about 620,000 more annotations than the base model and using pretrained weights as initialization weights for training. This shows that extensive augmentation procedure is not guaranteed to improve performance. We observe very smooth learning and increase in precision by looking at Fig. 6. Summary results can be found in Table 3. Fig. 6. mAP 0.5: Yolov3-SPP Results with (1056 × 1056) image size. Table 3: Best Results Summary for YOLOv3-SPP: (1056 × 1056) size. Rank Model mAP 0.5 F1 score 1 (3 × 4) Grid Augmentation 46.2% 48.9% 2 Geometric Sequence 45.3% 47.1% 3 Blending Sequence 45.2% 47.7% - No Augmentation 39.7% 41.2% K. Gishyan 63 3.3. YOLOv5-Large With YOLOv5-Large, we obtain an mAP 0.5 of 29.2%, achieved by (3×5) Grid Augmentation with Transfer Learning. Further results can be found in Table 4, showing that Blending Sequence and Blurring Sequence are important augmentation procedures for YOLOv5-Large with mAP 0.5 scores of 27.0 % and 26.3 %, respectively. We also observe an increase in accuracy provided by the augmentation sequences over the base model. Fig. 7. mAP 0.5: YOLOv5-Large Results with (1056 × 1056) image size. Table 4: Best Results Summary for YOLOv5-Large: (1056 × 1056) size. Rank Model mAP 0.5 mAP 0.5:0.95 1 (3 × 5) Grid Augmentation with Transfer Learning 29.2% 14.0% 2 Blending Seqence 27.0% 14.6 % 3 Blurring Sequence 26.3 % 14.2 % - No Augmentation 22.5% 12.9% 4. Further Improvements We conduct a few more experiments to see if the model performance can be further improved. We make one more experiment with YOLOv5-Large and 3 more experiments with YOLOv3- SPP. By training a model with Blending Sequence augmentation using pretrained weights from the base model training we are able to obtain the best model for YOLOv5-Large, with an mAP 0.5 score of 36.5%, which is a 14% improvement from the base model. We choose Blending Sequence as it is the second-best augmenter for YOLOv5-Large, in terms of mAP 0.5 score. For the YOLOv3-SPP model, we choose the weights obtained from (3 × 4) Grid Augmentation for YOLOv3-SPP and initialize it for our second-best Geometric Sequence- transformed dataset for training. We obtain mAP 0.5 score of 48.6%, which is a 2.4% improvement over our previous best score, and 8.9% improvement over our base model. We 64 Improving UAV Object Detection through Image Augmentation also provide two examples where we train the models with the augmentation methods defined in 2.3.6., where we perform Blurring Sequence transformation on the dataset containing 700 original and 700 (3 × 4) grid-augmented images, and another experiment, where we take 700 training images, which have undergone a (3 × 4) grid-augmentation and apply Blurring, Blending and Geometric Sequence transformations, each augmenter transforming 1/3 of the images, as described in Section 2.3.7.. Moreover, for this augmenter, which was our most complex transformation, the mAP decreases by 5.9% compared to the base model. We observe that more annotations do not strictly increase accuracy. As a note, these final two augmentation experiments where we did not observe improvements contained no images from the original dataset, unlike the rest of the experiments. The experiments with the best overall results for YOLOv3-SPP and YOLOv5-Large can be observed from Fig. 8 and Fig. 9. Fig. 8. mAP 0.5: YOLOv5-Large (1056 × 1056) size with improved performance. Fig. 9. mAP 0.5: YOLOv3-SPP (1056 × 1056) size with improved performance. K. Gishyan 65 (a) Inference 1. (b) Inference 2. Fig. 10. Inference results of YOLOv3-SPP. 5. Inference Results From the inference results of the best YOLOv3-SPP model from Fig. 10a, we can observe that the model performs well on identifying almost all of the objects in the defined categories, however, we also see in Fig. 10b that there are pedestrians in the left hand part of the image that are not detected, so we identify room for improvement. 6. Conclusion This work explored multiple-object detection task for drones using a component of VisDrone- DET2019 dataset. We constructed three-image-processing sequences each containing from 5 to 7 augmenters, and named them Blending Sequence, Blurring Sequence, Geometric Se- quence. In an effort to reduce data overfitting with random resampling, we also developed grid augmentation techniques where we combined images in a custom-shape (3 × 4) square form and (3 × 5) form where the relative dimensions of the images were preserved after combination. We named these techniques (3 × 4) Grid Augmentation and (3 × 5) Grid Augmentation. We conducted multiple experiments where we tested how our augmentation techniques affected the performance of the models with an aim to identify the techniques that work best for the drone object detection task. We identified three augmentation pro- cedures that help to improve the model performance the most, which are using Geometric Sequence, (3 × 4) Grid Augmentation transformations and (3 × 5) Grid Augmentation with Transfer Learning. We considered (3 × 4) Grid Augmentation as the overall best and Geo- metric Sequence as the second best augmentation method and the best one among all three sequences. We further conducted a few more experiments by using transfer learning from custom weights, and combining our sequences with grid augmenters in different forms for two of our best models; YOLOv5-Large and YOLOv3-SPP with (1056 × 1056) image size, achieving further increase in accuracy. We also showed that some augmentation techniques did not work well and resulted in performance deterioration, despite containing significant amount of annotations, as we saw when applying Blurring Sequence to the (3 × 4) grid- augmented dataset, and when applying all of our sequences to 700 (3 × 4) grid-transformed images. These datasets for training contained no images from the original dataset, and we can state that when training and validation datasets are not from the same setting, the aug- mentation procedures will most certainly not be successful. On the other hand, as a result of our further experiments, we obtained our overall best model by using the weights of the model trained with (3×4) Grid Augmentation for training a model with Geometric Sequence 66 Improving UAV Object Detection through Image Augmentation augmentation, and improved the performance of the previous best result by 2.4% to mAP 0.5 score of 48.6% achieved with a YOLOv3-SPP model. The dataset sizes differ and the original dataset contains 3 more classes, so it is challenging to compare this setting to the original, however, we note that the winner of the VisDrone-DET 2019 challenge achieved an AP 0.5 score of 54.0%, and 47.98% was the tenth best score [8]. We take these results as benchmark results for future improvement. References [1] L Jangwon, J Wang, D Crandall, S Šabanovic and G Fox, “Real-Time, cloud-based object detection for unmanned aerial vehicles” First IEEE International Conference on Robotic Computing (IRC), Taichung, Taiwan, DOI: 10.1109/IRC.2017.77, pp. 36 - 43, 2017. [2] A. Carrio, C. Sampedro, A. Rodriguez-Ramos and P. Campoy, “A review of deep learning methods and applications for unmanned aerial vehicles” Journal of Sensors, https://doi.org/10.1155/2017/3296874, 2017. [3] Zh. Wu, K. Suresh, P. Narayanan, H. Xu, H. Kwon and Zh. Wang, “Delving into ro- bust object detection from unmanned aerial vehicles: A deep nuisance disentanglement approach”, Proceedings of the IEEE International Conference on Computer Vision, pp. 1201–1210, 2019. [4] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement”, arXiv:1804.02767v1, 2018. [5] Zh. Huang and J. Wang, “Dc-spp-yolo: Dense connection and spatial pyramid pooling based yolo for object detection”, arXiv:1903.08589, 2019. [6] Glenn Jocher. Ultralytics yolov3. (2019). https://github.com/ultralytics/yolov3 [7] Glenn Jocher. Ultralytics yolov5. (2020). https://github.com/ultralytics/yolov5 [8] Dheeraj Reddy Pailla et al., “Visdrone-det2019: The vision meets drone object detection in image challenge results”, IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), DOI: 10.1109/ICCVW.2019.00030, 2019. [9] A. Jung. imgaug. (2020). Online.Available: https://github.com/aleju/imgaug [10] A. Bochkovskiy, C.-Yao Wang and Hong-Yuan Mark Lia, “Yolov4: Optimal speed and accuracy of object detection”, arXiv preprint arXiv:2004.10934, 2020. Submitted 22.07.2020, accepted 19.11.2020. K. Gishyan 6 7 ²Âê-Çó ëï³óí³Í ³é³ñϳݻñÇ ×³Ý³ãÙ³Ý µ³ñ»É³íáõÙ å³ïÏ»ñÇ ³áõ·Ù»Ýï³ódzÛÇ ÙÇçáóáí γñ»Ý Ø. ¶ÇßÛ³Ý ´³ÃÇ Ð³Ù³Éë³ñ³Ý e-mail: karen.gishyan@bath.edu ²Ù÷á÷áõÙ ²é³ñϳݻñÇ Ñ³Ûïݳµ»ñÙ³Ý ³É·áñÇÃÙÝ»ñÁ, áñáÝù ÑÇÙÝí³Í »Ý ó³Ù³ù³ÛÇÝ å³ÛÙ³ÝÝ»ñáõÙëï³óí³ÍÉáõë³ÝϳñÝ»ñÇíñ³, í»ñçÇÝï³ñÇÝ»ñÇÁÝóóùáõÙµ³ñ»É³íí»É ¨ Ù»Í Ñ³çáÕáõÃÛáõÝÝ»ñ »Ý ·ñ³Ýó»É ³ÛÝåÇëÇ µ³ñ¹ ïíÛ³ÉÝ»ñÇ Ñ³í³ù³ÍáõÝ»ñÇ íñ³, ÇÝãåÇëÇù »Ý COCO-Ý ¨ PASCAL VOC-Á: ²Ûë Ùá¹»ÉÝ»ñÁ, ³ÛÝáõ³Ù»Ý³ÛÝÇí, ³ÛÝù³Ý ¿É É³í ³ñ¹ÛáõÝù»ñ ã»Ý ·ñ³ÝóáõÙ ²Âê-Ç íñ³ ÑÇÙÝí³Í ³é³ñϳݻñÇ Ñ³Ûïݳµ»ñÙ³Ý Å³Ù³Ý³Ï ¨ Ñ³×³Ë ÝϳïíáõÙ ¿ ϳï³ñáճϳÝáõÃÛ³Ý í³ïóñ³óáõÙ: ä³ï׳éÝ ³ÛÝ ¿, áñ Ùá¹»ÉÝ»ñÇ Ñ³Ù³ñ ß³ï ³í»ÉÇ µ³ñ¹ ¿ ѳÛïݳµ»ñ»É ¨ ¹³ë³Ï³ñ·»É ³é³í»É ÷áùñ ûµÛ»ÏïÝ»ñÁ, Ç ï³ñµ»ñáõÃÛáõÝ ÙÇçÇÝ ¨ Ëáßáñ ã³÷Ç ûµÛ»ÏïÝ»ñÇ, ¨ ¹ñáÝÝ»ñÇó ³ñí³Í Éáõë³ÝϳñÝ»ñÁ ѳÏí³Í »Ý ¹ñë¨áñ»É ß»ÕáõÙÝ»ñ ï³ñµ»ñ ÃéÇãù³ÛÇÝ µ³ñÓñáõÃÛáõÝÝ»ñÇ, »Õ³Ý³Ï³ÛÇÝ å³ÛÙ³ÝÝ»ñÇ, ï»ë³ËóÇÏÇ ³ÝÏÛ³Ý ¨ áñ³ÏÇ å³ï׳éáí: ²Ûë ³ß˳ï³ÝùÁ áõëáõÙݳëÇñáõÙ ¿ »ñÏáõ ųٳݳϳÏÇó ³é³ñϳݻñÇ Ñ³Ûïݳµ»ñÙ³Ý ×³Ý³ãÙ³Ý ³É·áñÇÃÙÝ»ñÇ Ï³ï³ñáճϳÝáõÃÛáõÝÁ ¹ñáÝÝ»ñÇó ³ñí³Í Éáõë³ÝϳñÝ»ñÇ Ñ³Ûïݳµ»ñÙ³Ý ËݹñÇ ¹»åùáõÙ ¨ ³é³ç³ñÏáõÙ ÝϳñÝ»ñÇ ³áõ·Ù»Ýï³ódzݻñÇ »Õ³Ý³ÏÝ»ñ µ³ñ»É³í»Éáõ Ùá¹»ÉÝ»ñÇ Ï³ï³ñáճϳÝáõÃÛáõÝÁ: Ø»Ýù ϳ½ÙáõÙ »Ýù ÝϳñÝ»ñÇ ³áõ·Ù»Ýï³ódzݻñÇ »ñ»ù ß³ñù»ñ ¨ ³é³ç³ñÏáõÙ »ñÏáõ Ýáñ ÙÇçáóÝ»ñª áõëáõÙݳëÇñ»Éáí ¹ñ³Ýó ï³ñµ»ñ ѳٳÏóáõÃÛáõÝÝ»ñÁ Ùá¹»ÉÝ»ñÇ Ï³ï³ñáճϳÝáõÃÛ³Ý íñ³: ²áõ·Ù»Ýï³ódzݻñÁ ·Ý³Ñ³ïíáõÙ »Ý ËáñÁ áõëáõóÙ³Ý »ñÏáõ Ùá¹»ÉÝ»ñÇ íñ³, áñáÝù Ý»ñ³éáõÙ »Ý Ùá¹»ÉÝ»ñÇ áõëáõóáõÙ µ³ñÓñ áñ³ÏÇ ÝϳñÝ»ñÝ»ñáí (1056 x 1056 åÇùë»ÉÝ»ñ)ª Ùá¹»ÉÝ»ñÇ ÁݹѳÝáõñ ³ñ¹Ûáõݳí»ïáõÃÛáõÝÁ ¹Çï³ñÏ»Éáõ ѳٳñ: Ø»Ýù ѳٻٳïáõÙ »Ýù ³áõ·Ù»Ýï³óÇáÝ ï»ËÝÇϳݻñ Ûáõñ³ù³ÝãÛáõñ Ùá¹»ÉÇ Ñ³Ù³ñ: Ø»Ýù óáõÛó »Ýù ï³ÉÇë »ñÏáõ ³áõ·Ù»Ýï³óÇáÝ ÁÝóó³Ï³ñ·»ñ, áñáÝù ³í»É³óÝáõÙ »Ý ³é³ñϳݻñÇ Ñ³Ûïݳµ»ñÙ³Ý/׳ݳãÙ³Ý ×ß·ñïáõÃÛáõÝÁ ß³ï ³í»ÉÇ ³ñ¹Ûáõݳí»ï Ï»ñåáí, ù³Ýª ÙÛáõëÝ»ñÁ ¨ ëï³ÝáõÙ, ÁݹѳÝáõñ ³éٳٵ, Ù»ñ É³í³·áõÝ Ùá¹»ÉÁª û·ï³·áñÍ»Éáí ÷á˳ÝóÙ³Ý áõëáõóÙ³Ý Ùáï»óáõÙ, áñï»Õ ÷á˳ÝóÙ³Ý ÏßÇéÝ»ñÁ Ó»éù »Ý µ»ñíáõÙ áõëáõó³Ý»Éáí Ùá¹»ÉÁ Ù»ñ ÏáÕÙÇó ³é³ç³ñÏí³Í ³áõ·Ù»Ýï³óÇáÝ Ù»Ãá¹áí: öáñÓ³ñÏáõÙÝ»ñÇ ³ñ¹ÛáõÝùáõÙ Ù»Ýù ëï³ÝáõÙ »Ýù Ùá¹»ÉÇ É³í ϳï³ñáճϳÝáõÃÛáõÝ ¨ ×ß·ñïáõÃÛáõÝ ¨ Ù³ïݳÝßáõÙ µ³ñ»É³íÙ³Ý ³ëå»ÏïÝ»ñ áñå»ë ѻﳷ³ ³ß˳ï³ÝùÇ Ù³ë: ´³Ý³ÉÇ µ³é»ñ` ѳٳϳñ·ã³ÛÇÝ ï»ëáÕáõÃÛáõÝ, ËáñÁ áõëáõóáõÙ, å³ïÏ»ñÇ Ùß³ÏáõÙ: 6 8 Improving UAV Object Detection through Image Augmentation Óëó÷øåíèå îáíàðóæåíèÿ îáúåêòîâ ÁÏËÀ ñ óâåëè÷åíèåì èçîáðàæåíèÿ Êàðåí M. Ãèøÿí Áàòñêèé óíèâåðñèòåò e-mail: karen.gishyan@bath.edu Àííîòàöèÿ Àëãîðèòìû îáíàðóæåíèÿ îáúåêòîâ íà îñíîâå íàçåìíûõ èçîáðàæåíèé çíà÷èòåëüíî óëó÷øèëèñü çà ïîñëåäíèå ãîäû è äàëè õîðîøèå ðåçóëüòàòû äëÿ ñëîæíûõ íàáîðîâ äàííûõ èçîáðàæåíèé, òàêèõ êàê COCO è PAS- CAL VOC. Îäíàêî ýòè ìîäåëè íå òàê óñïåøíû, êîãäà äåëî äîõîäèò äî îáíàðóæåíèÿ îáúåêòîâ ñ ïîìîùüþ ÁÏËÀ, è îáû÷íî íàáëþäàåòñÿ óõóäøåíèå õàðàêòåðèñòèê. Ýòî ñâÿçàíî ñ òåì, ÷òî äëÿ ìîäåëåé ãîðàçäî ñëîæíåå îáíàðóæèâàòü è êëàññèôèöèðîâàòü áîëåå ìåëêèå îáúåêòû, ÷åì îáúåêòû ñðåäíåãî èëè áîëüøîãî ðàçìåðà, à èçîáðàæåíèÿ ñ äðîíîâ ïîäâåðæåíû òðàíñëÿöèîííûì îòêëîíåíèÿì, âûçâàííûìè ðàçíîé âûñîòîé ïîëåòà, ïîãîäíûìè óñëîâèÿìè, ðàêóðñàìè è êà÷åñòâîì êàìåðû.  ýòîé ðàáîòå èññëåäóåòñÿ ýôôåêòèâíîñòü äâóõ ñîâðåìåííûõ àëãîðèòìîâ îáíàðóæåíèÿ îáúåêòîâ â çàäà÷å îáíàðóæåíèÿîáúåêòîâ ñïîìîùüþäðîíàèïðåäëàãàþòñÿïðîöåäóðûóâåëè÷åíèÿ èçîáðàæåíèÿ äëÿ ïîâûøåíèÿ ïðîèçâîäèòåëüíîñòè ìîäåëè. Ìû ñîñòàâëÿåì òðè ïîñëåäîâàòåëüíîñòè óâåëè÷åíèÿ èçîáðàæåíèÿ è ïðåäëàãàåì äâà íîâûõ ìåòîäà óâåëè÷åíèÿ èçîáðàæåíèÿ, à òàêæå äîïîëíèòåëüíî èññëåäóåì èõ ðàçëè÷íûå êîìáèíàöèè íà õàðàêòåðèñòèêàõ ìîäåëåé. Àóãìåíòåðû îöåíèâàþòñÿ äëÿ äâóõ ìîäåëåé ãëóáîêîãî îáó÷åíèÿ, êîòîðûå âêëþ÷àþò îáó÷åíèå ìîäåëè ñ èçîáðàæåíèÿìè ñ âûñîêèì ðàçðåøåíèåì (1056 x 1056 ïèêñåëåé), ÷òîáû îöåíèòü èõ îáùóþ ýôôåêòèâíîñòü. Ìû ïðåäîñòàâëÿåì ñðàâíåíèå ìåòîäîâ óâåëè÷åíèÿ äëÿ êàæäîé ìîäåëè. Ìû îïðåäåëÿåì äâå ïðîöåäóðû äîïîëíåíèÿ, êîòîðûå ïîâûøàþò òî÷íîñòü îáíàðóæåíèÿ îáúåêòîâ áîëåå ýôôåêòèâíî, ÷åì äðóãèå, è ïîëó÷àåì íàøó ëó÷øóþ îáùóþ ìîäåëü ñ èñïîëüçîâàíèåì ïîäõîäà ñ ïåðåíîñîì îáó÷åíèÿ, ïðè êîòîðîì âåñà äëÿ ïåðåíîñà ïîëó÷àþòñÿ â ðåçóëüòàòå îáó÷åíèÿ ìîäåëè ñ ïîìîùüþ ïðåäëàãàåìîé íàìè òåõíèêè óâåëè÷åíèÿ.  êîíöå ýêñïåðèìåíòîâ ìû äîñòèãàåì íàäåæíûõ õàðàêòåðèñòèê è òî÷íîñòè ìîäåëè è îïðåäåëÿåì àñïåêòû óëó÷øåíèÿ êàê ÷àñòü íàøåé áóäóùåé ðàáîòû. Êëþ÷åâûå ñëîâà: êîìïüþòåðíîå çðåíèå, ãëóáîêîå îáó÷åíèå, îáðàáîòêà èçîáðàæåíèÿ. 05_Karen_Gishyan_54_53_68 (1) Karen+