Article Search
닫기

Original Article

Split Viewer

International Journal of Fuzzy Logic and Intelligent Systems 2024; 24(3): 194-202

Published online September 25, 2024

https://doi.org/10.5391/IJFIS.2024.24.3.194

© The Korean Institute of Intelligent Systems

Features Exploitation of YOLOv5-Based Freeze Backbone for Performance Improvement of UAV Object Detection

Laily Nur Qomariyati, Nurul Jannah, Suryo Adhi Wibowo, and Thomhert Suprapto Siadari

Center of Excellence Artificial Intelligence for Learning and Optimization (CoE AILO), Telkom University, Bandung, Indonesia

Correspondence to :
Suryo Adhi Wibowo (suryoadhiwibowo@telkomuniversity.ac.id)

Received: October 21, 2022; Revised: June 28, 2024; Accepted: August 24, 2024

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Object detection in unmanned aerial vehicles (UAV) has steadily developed over time. Improving the performance of UAV-based object detection has proven challenging for researchers because of several problems, such as data imbalance, scale variants, prediction accuracy, and memory and computation limitations for real-time applications. In this study, feature exploitation using the YOLOv5 algorithm is carried out to improve the performance of UAV-based object detection. The best-performing model leverages a frozen backbone, modifies concatenation layers, and adds a head to the algorithm for improved performance. This approach improved performance and achieved a mAP@0.5 value of 22.4%. Based on the results of the mAP, the exploitation of features affected the performance of UAV-based object detection.

Keywords: Exploitation, Object detection, UAV, YOLOv5

One area of artificial intelligence (AI) is computer vision, particularly object detection, a technology that continues to be developed with various features to support industrial needs and make it easier for humans to perform certain tasks. Object detection can be implemented in various systems, such as smart surveillance systems [1], autonomous driving [2], and object detection for unmanned aerial vehicles (UAV) [3]. Object detection on UAVs is a rapidly growing technology. It can be implemented in various fields, such as agriculture, aerial photography, shipping goods, security and surveillance, and search and rescue. However, object detection by UAVs faces challenges such as data imbalance, prediction accuracy, memory, and computational limitations, particularly in real-time applications, which researchers aim to optimize.

Several studies have addressed these issues using different methods, such as Faster RCNN, YOLO, and SSD. Zhang et al. [4] used the SlimYOLOv3 model and VisDrone2018-DET as datasets and showed that the model was capable of achieving a detection accuracy comparable to that of YOLOv3 with fewer FLOPs and faster runtime than the original YOLOv3. This makes the model suitable for real-time UAV . However, this study did not solve the data imbalance, resulting in the mean average precision (mAP) score for the dominant object class being higher than that of other classes [4]. This is because using a one-step detector, SlimYOLOv3, which samples the region of the entire image, whereas a two-step detector avoids this problem using a region proposal network (RPN) mechanism [5]. Jannesari proposed a deep feature architecture pyramid network (DFPN) and a modified loss function to overcome the VisDrone-2018 dataset imbalance and achieve real-time object detection [5]. Based on the results of the mAP evaluation in both studies, the DFPN with the basic network MobileNet achieved a higher score of 29.2 compared to SlimYOLOv3, which scored 23.9. However, Zhu et al. [6] used the same dataset, and TPHYOLOv5 as the object detector produced a higher mAP value of 39.18. This indicates that the TPHYOLOv5 model can effectively address detection problems with small-scale, densely packed, and blurred objects [6]. This study introduces a novel approach for improving UAV-based object detection by exploiting the YOLOv5 algorithm. The key contributions of this study are as follows.

  • • Exploitation on YOLOv5-based freeze backbone method by added prediction head.

  • • Modified concatenation layers process on YOLOv5-based freeze backbone algorithm.

  • • Combined added prediction head and modified concatenation layers to achieve higher mAP value.

2.1 UAV-Based Object Detection

Object detection involves finding and classifying objects in an image and displaying a bounding box on an object to indicate the level of confidence [7]. There are two methods for object detection: one- and two-stage detectors [8]. A one-stage detector uses a single, fully convolutional feed-forward network that directly provides a bounding box and object classification, whereas a two-stage detector consists of a region process proposal and a classification stage [8]. Object detection algorithms that use a one-stage detector include YOLO and SSD [9]. In this study, object detection was tested using aerial images captured by a UAV. Object detection in UAVs faces challenges, such as occlusion, small objects, and scale variations. This type of object detection can be implemented in UAVs and used for various purposes such as air security, freight forwarding, aerial photography, and monitoring.

2.2 YOLOv5

You Only Look Once (YOLO) is an object detection algorithm that uses a convolutional neural network (CNN) and Darknet as the basic network for training and inference [10]. YOLO has 24 convolutional layers with two fully connected layers. This detector also used 1×1 reduction layers, followed by 3×3 convolutional layers. YOLO trains convolutional layers on ImageNet classification tasks with half the resolution (224 × 224) and then doubles the resolution for object detection. Several versions of YOLO have been developed based on other studies, including YOLOv2 [10], YOLOv3 [11], YOLOv4 [12], and YOLOv5 [13]. YOLOv5 is the fifth version of the object detection algorithm from YOLO.

The YOLOv5 architecture was adapted from the YOLOv4 model [12] and its architectural details are shown in Figure 1. Based on the .yaml file, the YOLOv5 model has the following structure: Backbone, Neck, and Head. In YOLOv5, there is a C3 or CSP Bottleneck with three convolutions as a feature extractor or backbone that functions to maintain features through propagation, encourages the network to reuse features, and helps maintain granularity-refined features to pass them on to deeper layers more efficiently. Before the features were forwarded to the neck, the process was run on the spatial pyramid pooling fast (SPPF) layer. SPPF functions increase the receptive field and separate the most important features, and there is a max pooling process. Furthermore, the feature is processed by PANet as a neck, which generates feature pyramids. Feature pyramids help the model generate good scaling of objects and also help in the identification of objects of equal size and different scales. The detection head on YOLOv5 is used to detect the final part. This section applies anchor boxes to features and produces the final vector output with class probabilities, objectivity scores, and bounding boxes [15].

In this study, experiments were conducted to exploit the YOLOv5 architecture, including freezing the backbone, modifying the processes in the neck section, and adding layers and a detection head, namely, the original YOLOv5 model, exploitation I, exploitation II, exploitation III, and exploitation IV. The results of several experiments were compared to assess the performance improvement by measuring the mAP values of each model. All experiments in this study used the YOLOv5 series or a large model and did not use a pretrained weight model or training from scratch. The epoch set for the entire experiment consisted of 50 epochs. The processed input image processed is 640x640. All exploit models were trained using a freeze backbone to reduce the computational load of the training model. The dataset used in this study is VisDrone-DET2019.

3.1 Dataset

For object detection, the required dataset consists of images and annotations. Various types of datasets can be found on the Internet. The dataset used in this study consists of aerial images. The aerial image dataset used in this study was VisDrone-DET2019 [16]. This dataset was created by the AISKYEYE team at the Laboratory of Machine Learning and Data Mining, Tianjin University, China [17]. VisDrone-DET2019 uses the same data as VisDrone-DET2018 and consists of 8,599 images captured by drones at different locations and altitudes. In addition, it includes over 540,000 annotated bounding boxes across 10 categories. The images from the VisDrone dataset are shown in Figure 1.

3.2 Original YOLOv5

The original YOLOv5 model was trained without any modifications to the architecture to serve as a reference for comparing its performance with those of other models that incorporate architectural exploits. The original YOLOv5 model structure consisted of three prediction heads and 24 stages. In the detection head section, there were three different sizes of feature maps, with values of 80x80, 40x40, and 20x20. The performance results of this model were compared with those of other models. The structure of the original YOLOv5 model is illustrated in Figure 2.

3.3 Exploitation I

The next experiment involved training the model by exploiting the model structure on the backbones. In this experiment, the YOLOv5l backbone was frozen by setting its gradient value to 0. As shown in Figure 3, the frozen backbone occurred in stages 0–9 in the backbone structures. The concatenation layer in YOLOv5 combines the feature maps from different stages or layers of a backbone network. When the backbone is frozen, the feature maps extracted by these layers remain fixed and do not undergo further optimization through backpropagation. In the structure of this model, there are three prediction heads and 24 stages same as the original YOLOv5 model structure.

3.4 Exploitation II

The next experiment was to train the model by exploiting the model structure at the concave layer concatenation which combines the upsampling process on the neck with the C3 process on the backbones. The difference between exploitations I and II is in the concatenation layer, particularly in stages 12 and 16, as shown in Figure 4. Originally, YOLOv5 utilized a concatenated layer in conjunction with the C3 layer; however, in the exploitation II model, this approach was replaced with a convolutional layer rather than the C3 process. This modification aims to alter how feature maps are concatenated and processed, potentially improving the ability of the model to extract and utilize meaningful features for object detection tasks. The structure of this model has three prediction heads and 24 stages same as the original YOLOv5 model structure.

3.5 Exploitation III

The third experiment involved training the model by utilizing the architecture of the neck and head sections while maintaining the same concatenated layer as in the original YOLOv5 model. In this modified model, an additional detection head is incorporated along with an extra layer, resulting in four detection heads. The newly added detection head operated with a larger feature map size of 320 × 320 (scale 4), as shown in Figure 5. Integrating detection heads involves adding new layers dedicated to object detection tasks. These layers typically follow the feature-extraction backbone and are designed to process feature maps at multiple resolutions. Each detection head performs convolutional operations and applies activation functions to extract and refine features relevant to object detection within its respective scale. For processing, adding detection heads introduces parallel processing paths within the model. This parallelism allows the model to simultaneously analyze feature maps at multiple resolutions, which accelerates inference and improves computational efficiency. The experiment comprised 35 stages. This enhancement of the YOLOv5 architecture expands its capability to handle larger-scale objects and improves detection accuracy.

3.6 Exploitation IV

The fourth experiment focused on leveraging the model architecture at the concatenated layer, specifically by altering the concatenation process from the original YOLOv5 at Stage 2 to Stage 1 ConvBnSiLU, in addition to introducing layers within the neck and head sections. In the modified model depicted in Figure 6, there are four detection heads with feature-map sizes similar to those of exploitation III model. Similar to the third experiment, this model spans 35 stages and includes changes in the process at the same concatenated layer as in exploitation model II. Adjustments to the concatenated layers aim to optimize feature extraction and improve the ability of the model to detect objects across varying scales and complexities.

We implemented the YOLOv5 Large version, known for its high mAP in pretrained models, but we trained all models from scratch to reduce overfitting during training. All experiments were set to 50 epochs to accelerate training and minimize GPU usage. We analyzed the improvements in each proposed method using the testset-dev of the VisDrone-DET2019 dataset. The performance parameter used was mAP. We compared our model with other benchmark models using the same dataset [18]. These improvements are presented in Table 1.

4.1 Comparative Experiment

Comparison with benchmark model: Based on the mAP values presented in Table 1, our proposed exploitation models (exploitations I-IV) demonstrate varied performance improvements compared to the benchmark models, YOLOv5-SPD, and modified cascade R-CNN. Exploitation model I, which involves training YOLOv5 with a frozen backbone, shows a lower mAP of 8.1%, indicating a reduction in performance compared to YOLOv5-SPD (10.47%) and the modified cascade R-CNN (19.98%). In contrast, exploitation model II, which focuses on modifying the concatenated layer process, achieved an improved mAP of 10.3%, suggesting a slight enhancement over YOLOv5-SPD. Exploitation model III, which incorporates additional layers and detection heads, significantly boosts the performance, with an mAP of 20.9%, surpassing both benchmark models. Exploitation model IV, combining the modifications from exploitation models II and III, further enhanced the performance to a mAP of 22.4%, demonstrating the cumulative benefits of integrating these enhancements.

Effect of freeze backbone: Percentage values of mAP@0.5 original YOLOv5 model and exploitation I were 26.4% and 8.1%, respectively. In this case, the original YOLOv5 model significantly outperformed exploitation model I by 18.3% in terms of the mAP value. This is because the gradient in the backbone process of exploitation model I is set to zero so that the weight value on the backbone is not updated during training, whereas the original YOLOv5 model runs the algorithm and takes the weight value of all processes on the network, including the backbone.

Effect of modified concatenated layers: These changes involved replacing the concatenated C3 layer process with a convolution layer on the backbone. The mAP@0.5 values for exploitation models I and II are 8.1% and 10.3%, respectively. Exploitation model II outperformed exploitation model I, indicating that changing the process at the concatenated layer affected the mAP value of the model. In the original YOLOv5 model structure, the concatenated process combines the feature maps of the previous layer with those in the C3 process (CSP Bottleneck with three convolutions). In contrast, exploitation model II involves modifying this process using a standard convolution (conv). The performance increase owing to the change in the concatenated layer arises because the feature map generated by the conv process differs from that produced by the C3 process. The C3 process includes a bottleneck and three convolutions that can overexploit the feature map and potentially reduce important information. In this study, changing the concatenated layer process, particularly at stages 12 and 16, resulted in a performance improvement of 2.2% compared to the exploitation model I.

Effect of adding prediction head: In this experiment, the model with additional layers and an added output detection head at a different scale affected the model performance, making it superior to exploitation model I. This is because the addition of layers to the network increased the output detection head to a larger scale, specifically with a feature map size of 320 × 320. With a larger feature map, processing small object information becomes easier than processing smaller feature maps. Therefore, in this experiment, the addition of layers and an output detection head on a larger scale improved the model performance.

Effect of adding prediction head and modified concatenated layers: Exploitation model IV outperforms exploitation model I with mAP values of 22.4% and 8.1%, respectively. Exploitation model IV demonstrates that changing the concatenated process, combined with the addition of a layer and an output detection head, results in performance improvement. This improvement can be attributed to the better features generated by the modified concatenated process in the neck, allowing the model to process features more effectively than exploitation model I. Additionally, the inclusion of a detection head with a larger feature map size facilitates easier detection of small objects, as larger sizes are more sensitive in processing localization and identifying small objects.

Comparison between the exploitation models with original YOLOv5: As shown in Figure 7, the original YOLOv5 had the highest mAP value compared to the other exploitation models. This is because the original YOLOv5 model is trained using the entire network, including the backbone. In contrast, exploitation models use a freeze backbone, where the weight values are not updated during training and retain their initial values. This study demonstrates improved performance in the regulated freeze backbone models (exploitation models I to IV) by modifying the concatenate layer process, adding additional layers, and scaling the output detection head. Consequently, the mAP in the models with frozen backbones can be improved. The performance difference between exploitation model I and the original YOLOv5 had an mAP value difference of 18.3%, indicating that the model trained with the entire network had the potential for improved performance. This could potentially lead to a four-fold increase in performance. Therefore, exploitation model IV, with an mAP value of 22.4%, and a model structure set with a freeze backbone and a combination of exploits II and III, could potentially achieve higher mAP values (up to 89.6%) if trained on the entire network.

Effect of exploitation and original YOLOv5 on the performance of each class: As shown in Figure 8, each model has the highest mAP value for a specific category. The car category achieved a higher mAP value compared to other categories due to the dataset imbalance in VisDrone-DET2019, which contained more samples of cars than other objects. The car category had the most labels compared to the other categories, resulting in a dominant accuracy for this category. The original YOLOv5 model had the highest AP value across all categories because its performance, trained on the entire network, was higher than that of the exploitation models with a freeze backbone. The performance comparison between exploitation models II and I shows that changing the concatenated layer in exploitation model II led to an increased performance in all categories, except bicycles. Model III outperforms models I and II by adding layers and detection heads with a larger feature map scale. Combining the exploits in model IV resulted in the highest mAP value across all categories, outperforming exploitation models I–III. This indicates that exploiting the concatenated layer, adding layers, and detection heads improves the accuracy compared with models without such exploitation. However, none of the models in this study fully addressed the dataset imbalance issue, leading to significant differences in mAP values across categories, which hinders optimal object detection in non-dominant categories.

Limitation: Despite demonstrating notable improvements, the proposed method has several potential disadvantages that warrant further discussion. Freezing the backbone can help reduce overfitting and computational load; however, it also significantly reduces the adaptability of the model to different datasets because the backbone weights remain static and might not capture the nuances of the specific dataset used during training. Although modifications to the concatenated layers and the addition of detection heads have improved the performance, these changes increase the complexity of the model, posing challenges in scaling the model to different datasets or adapting it to various real-world applications without extensive tuning and validation.

In this study, we used the YOLOv5 model to exploit the features for UAV-based object detection. This involves freezing the backbone, modifying the concatenated layer, and adding a layer to the detection head. Exploitation model IV achieved the highest performance, with a mAP value of 22.4%, outperforming other models in this experiment. Freezing the backbone reduced the performance because the backbone weights were not updated during training. However, replacing the C3 process with a convolution layer in the concatenated layer improved the performance. Additionally, adding a detection head enhanced the performance of the model, with exploitation model III outperforming models I and II. Combining these modifications in exploitation model IV results in further improvements, demonstrating that these architectural changes can significantly enhance object detection.

Future work will explore dynamic backbone-freezing strategies and additional architectural modifications to improve the performance and adaptability to new data. By extending the analysis to multiple datasets and various conditions, we aim to comprehensively assess the generalizability of the model and practical relevance, enhancing the impact of our findings.

This research was supported by the Adhi Wibowo Foundation Program (Grant No. 025/AWF-Scholarship/2022) and fulfilling the research needs of this study, and a Basic Research Grant (2019–2020) from the Ministry of Education, Culture, Research and Technology, Indonesia, and Telkom University.
Fig. 1.

Dataset VisDrone [17].


Fig. 2.

YOLOv5 original architecture.


Fig. 3.

Exploitation I proposed method.


Fig. 4.

Exploitation II proposed method.


Fig. 5.

Exploitation III proposed method.


Fig. 6.

Exploitation IV proposed method.


Fig. 7.

Comparison of mAP Performance for exploitation models and original YOLOv5.


Fig. 8.

Comparison of mAP performance for exploitation models and original YOLOv5 for each class.


Table. 1.

Table 1. Performance improvement for each method in mAP.

MethodsmAP@0.5
YOLOv5-SPD [?]10.47
Modified cascade R-CNN [?]19.98
Exploitation model I8.1
Exploitation model II10.3
Exploitation model III20.9
Exploitation model IV22.4

  1. Yang, CJ, Chou, T, Chang, FA, Ssu-Yuan, C, and Guo, JI. A smart surveillance system with multiple people detection, tracking, and behavior analysis., Proceedings of 2016 International Symposium on VLSI Design, Automation and Test (VLSI-DAT), Hsinchu, Taiwan, 2016, pp.1-4. https://doi.org/10.1109/VLSI-DAT.2016.7482569
    CrossRef
  2. Choi, J, Chun, D, Kim, H, and Lee, HJ. Gaussian YOLOv3: an accurate and fast object detector using localization uncertainty for autonomous driving., Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, South Korea, 2019, pp.502-511. https://doi.org/10.1109/ICCV.2019.00059
    CrossRef
  3. Lee, J, Wang, J, Crandall, D, Sabanovic, S, and Fox, G. Real-time, cloud-based object detection for unmanned aerial vehicles., Proceedings of 2017 1st IEEE International Conference on Robotic Computing (IRC), Taichung, Taiwan, 2017, pp.36-43. https://doi.org/10.1109/IRC.2017.77
    CrossRef
  4. Zhang, P, Zhong, Y, and Li, X. SlimYOLOv3: narrower, faster and better for real-time UAV applications., Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, South Korea, 2019, pp.37-45. https://doi.org/10.1109/ICCVW.2019.00011
    CrossRef
  5. Vaddi, S, Kumar, C, and Jannesari, A. (2019). Efficient object detection model for real-time UAV applications. Available: https://arxiv.org/abs/1906.00786
  6. Zhu, X, Lyu, S, Wang, X, and Zhao, Q. TPH-YOLOv5: improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios., Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021, pp.2778-2788. https://doi.org/10.1109/ICCVW54120.2021.00312
    CrossRef
  7. Zhang, A, Lipton, ZC, Li, M, and Smola, AJ. (2021). Dive into deep learning. Available: https://arxiv.org/abs/2106.11342
  8. Carranza-Garcia, M, Torres-Mateo, J, Lara-Benítez, P, and Garcia-Gutierrez, J (2020). On the performance of one-stage and two-stage object detectors in autonomous vehicles using camera data. Remote Sensing. 13. article no 89
    CrossRef
  9. Liu, W, Anguelov, D, Erhan, D, Szegedy, C, Reed, S, Fu, CY, and Berg, AC. (2016) . SSD: single shot multibox detector. Computer Vision–ECCV 2016. Cham, Switzerland: Springer, 21-37. https://doi.org/10.1007/978-3-319-46448-0_2
    CrossRef
  10. Redmon, J, Divvala, S, Girshick, R, and Farhadi, A . You only look once: unified, real-time object detection., Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp.779-788. https://doi.org/10.1109/CVPR.2016.91
    CrossRef
  11. Redmon, J, and Farhadi, A. (2018) . YOLOv3: An incremental improvement. Available: https://arxiv.org/abs/1804.02767
  12. Bochkovskiy, A, Wang, CY, and Liao, HYM. (2020). YOLOv4: optimal speed and accuracy of object detection. Available: https://arxiv.org/abs/2004.10934
  13. Jocher, G, Stoken, A, Chaurasia, A, Borovec, J, Kwon, Y, and Michael, K. (2021). ultralytics/yolov5: v6.0 - YOLOv5n ‘Nano’ models, Roboflow integration, TensorFlow export, OpenCV DNN support. Available: https://ui.adsabs.harvard.edu/abs/2021zndo...5563715J/abstract
  14. Ultralytics Inc. Transfer learning with frozen layers. Available: https://docs.ultralytics.com/yolov5/tutorials/transfer_learning_with_frozen_layers/
  15. Thuan, D. (2021). Evolution of Yolo algorithm and Yolov5: the state-of-the-art object detention algorithm. Available: https://www.theseus.fi/handle/10024/452552
  16. Du, D, Zhu, P, Wen, L, Bian, X, Lin, H, and Hu, Q . VisDrone-DET2019: the vision meets drone object detection in image challenge results., Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, South Korea, 2019, pp.213-226. https://doi.org/10.1109/ICCVW.2019.00030
    CrossRef
  17. Zhu, P, Wen, L, Du, D, Bian, X, Fan, H, Hu, Q, and Ling, H (2022). Detection and tracking meet drones challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence. 44, 7380-7399. https://doi.org/10.1109/TPAMI.2021.3119563
    CrossRef
  18. AISKYEYE Group. (2024) . VisDrone 2019 Object Detection in Images Challenge. Available: http://aiskyeye.com/leaderboard/

Laily Nur Qomariyati received her bachelor’s degree from Telkom University, Indonesia, in 2021 and is currently pursuing her master’s degree at Telkom University, Bandung, Indonesia. She has been working on research related to deep learning at the Center of Excellence Artificial Intelligence Learning and Optimization, Telkom University. She is interested in conducting research in the fields of machine learning and deep learning.

Nurul Jannah received her S.T. (bachelor’s degree in Electrical Engineering, majoring in Telecommunication Engineering) from Telkom University, Indonesia, in 2022. She worked on research related to machine learning and computer vision at the Image Processing and Vision Laboratory at Telkom University, Bandung, Indonesia, from 2021 to 2022.

Suryo Adhi Wibowo received his bachelor’s and master’s degree from Telkom University, Bandung, Indonesia, in 2009 and 2012, respectively. He then earned his Ph.D. from the Department of Electrical and Computer Engineering at the Pusan National University, Republic of Korea, in 2018. His research interests include computer vision, computer graphics, pattern recognition, visual reality, and machine learning. He currently serves as the Deputy Director of the Center of Excellence for Artificial Intelligence Learning and Optimisation (CoE AILO) at Telkom University.

Thomhert Suprapto Siadari received his bachelor’s degree in 2011 from Telkom University, Bandung, Indonesia, and his master’s degree in 2013 from Kumoh National Institute of Technology, Gumi, Republic of Korea. He earned his Ph.D. in ICT from the University of Science & Technology-ETRI School, Daejeon, Republic of Korea, in 2020. His main research interests include machine learning, deep learning, computer vision, and medical healthcare.

Article

Original Article

International Journal of Fuzzy Logic and Intelligent Systems 2024; 24(3): 194-202

Published online September 25, 2024 https://doi.org/10.5391/IJFIS.2024.24.3.194

Copyright © The Korean Institute of Intelligent Systems.

Features Exploitation of YOLOv5-Based Freeze Backbone for Performance Improvement of UAV Object Detection

Laily Nur Qomariyati, Nurul Jannah, Suryo Adhi Wibowo, and Thomhert Suprapto Siadari

Center of Excellence Artificial Intelligence for Learning and Optimization (CoE AILO), Telkom University, Bandung, Indonesia

Correspondence to:Suryo Adhi Wibowo (suryoadhiwibowo@telkomuniversity.ac.id)

Received: October 21, 2022; Revised: June 28, 2024; Accepted: August 24, 2024

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Object detection in unmanned aerial vehicles (UAV) has steadily developed over time. Improving the performance of UAV-based object detection has proven challenging for researchers because of several problems, such as data imbalance, scale variants, prediction accuracy, and memory and computation limitations for real-time applications. In this study, feature exploitation using the YOLOv5 algorithm is carried out to improve the performance of UAV-based object detection. The best-performing model leverages a frozen backbone, modifies concatenation layers, and adds a head to the algorithm for improved performance. This approach improved performance and achieved a mAP@0.5 value of 22.4%. Based on the results of the mAP, the exploitation of features affected the performance of UAV-based object detection.

Keywords: Exploitation, Object detection, UAV, YOLOv5

1. Introduction

One area of artificial intelligence (AI) is computer vision, particularly object detection, a technology that continues to be developed with various features to support industrial needs and make it easier for humans to perform certain tasks. Object detection can be implemented in various systems, such as smart surveillance systems [1], autonomous driving [2], and object detection for unmanned aerial vehicles (UAV) [3]. Object detection on UAVs is a rapidly growing technology. It can be implemented in various fields, such as agriculture, aerial photography, shipping goods, security and surveillance, and search and rescue. However, object detection by UAVs faces challenges such as data imbalance, prediction accuracy, memory, and computational limitations, particularly in real-time applications, which researchers aim to optimize.

Several studies have addressed these issues using different methods, such as Faster RCNN, YOLO, and SSD. Zhang et al. [4] used the SlimYOLOv3 model and VisDrone2018-DET as datasets and showed that the model was capable of achieving a detection accuracy comparable to that of YOLOv3 with fewer FLOPs and faster runtime than the original YOLOv3. This makes the model suitable for real-time UAV . However, this study did not solve the data imbalance, resulting in the mean average precision (mAP) score for the dominant object class being higher than that of other classes [4]. This is because using a one-step detector, SlimYOLOv3, which samples the region of the entire image, whereas a two-step detector avoids this problem using a region proposal network (RPN) mechanism [5]. Jannesari proposed a deep feature architecture pyramid network (DFPN) and a modified loss function to overcome the VisDrone-2018 dataset imbalance and achieve real-time object detection [5]. Based on the results of the mAP evaluation in both studies, the DFPN with the basic network MobileNet achieved a higher score of 29.2 compared to SlimYOLOv3, which scored 23.9. However, Zhu et al. [6] used the same dataset, and TPHYOLOv5 as the object detector produced a higher mAP value of 39.18. This indicates that the TPHYOLOv5 model can effectively address detection problems with small-scale, densely packed, and blurred objects [6]. This study introduces a novel approach for improving UAV-based object detection by exploiting the YOLOv5 algorithm. The key contributions of this study are as follows.

  • • Exploitation on YOLOv5-based freeze backbone method by added prediction head.

  • • Modified concatenation layers process on YOLOv5-based freeze backbone algorithm.

  • • Combined added prediction head and modified concatenation layers to achieve higher mAP value.

2. Theory

2.1 UAV-Based Object Detection

Object detection involves finding and classifying objects in an image and displaying a bounding box on an object to indicate the level of confidence [7]. There are two methods for object detection: one- and two-stage detectors [8]. A one-stage detector uses a single, fully convolutional feed-forward network that directly provides a bounding box and object classification, whereas a two-stage detector consists of a region process proposal and a classification stage [8]. Object detection algorithms that use a one-stage detector include YOLO and SSD [9]. In this study, object detection was tested using aerial images captured by a UAV. Object detection in UAVs faces challenges, such as occlusion, small objects, and scale variations. This type of object detection can be implemented in UAVs and used for various purposes such as air security, freight forwarding, aerial photography, and monitoring.

2.2 YOLOv5

You Only Look Once (YOLO) is an object detection algorithm that uses a convolutional neural network (CNN) and Darknet as the basic network for training and inference [10]. YOLO has 24 convolutional layers with two fully connected layers. This detector also used 1×1 reduction layers, followed by 3×3 convolutional layers. YOLO trains convolutional layers on ImageNet classification tasks with half the resolution (224 × 224) and then doubles the resolution for object detection. Several versions of YOLO have been developed based on other studies, including YOLOv2 [10], YOLOv3 [11], YOLOv4 [12], and YOLOv5 [13]. YOLOv5 is the fifth version of the object detection algorithm from YOLO.

The YOLOv5 architecture was adapted from the YOLOv4 model [12] and its architectural details are shown in Figure 1. Based on the .yaml file, the YOLOv5 model has the following structure: Backbone, Neck, and Head. In YOLOv5, there is a C3 or CSP Bottleneck with three convolutions as a feature extractor or backbone that functions to maintain features through propagation, encourages the network to reuse features, and helps maintain granularity-refined features to pass them on to deeper layers more efficiently. Before the features were forwarded to the neck, the process was run on the spatial pyramid pooling fast (SPPF) layer. SPPF functions increase the receptive field and separate the most important features, and there is a max pooling process. Furthermore, the feature is processed by PANet as a neck, which generates feature pyramids. Feature pyramids help the model generate good scaling of objects and also help in the identification of objects of equal size and different scales. The detection head on YOLOv5 is used to detect the final part. This section applies anchor boxes to features and produces the final vector output with class probabilities, objectivity scores, and bounding boxes [15].

3. Proposed Method

In this study, experiments were conducted to exploit the YOLOv5 architecture, including freezing the backbone, modifying the processes in the neck section, and adding layers and a detection head, namely, the original YOLOv5 model, exploitation I, exploitation II, exploitation III, and exploitation IV. The results of several experiments were compared to assess the performance improvement by measuring the mAP values of each model. All experiments in this study used the YOLOv5 series or a large model and did not use a pretrained weight model or training from scratch. The epoch set for the entire experiment consisted of 50 epochs. The processed input image processed is 640x640. All exploit models were trained using a freeze backbone to reduce the computational load of the training model. The dataset used in this study is VisDrone-DET2019.

3.1 Dataset

For object detection, the required dataset consists of images and annotations. Various types of datasets can be found on the Internet. The dataset used in this study consists of aerial images. The aerial image dataset used in this study was VisDrone-DET2019 [16]. This dataset was created by the AISKYEYE team at the Laboratory of Machine Learning and Data Mining, Tianjin University, China [17]. VisDrone-DET2019 uses the same data as VisDrone-DET2018 and consists of 8,599 images captured by drones at different locations and altitudes. In addition, it includes over 540,000 annotated bounding boxes across 10 categories. The images from the VisDrone dataset are shown in Figure 1.

3.2 Original YOLOv5

The original YOLOv5 model was trained without any modifications to the architecture to serve as a reference for comparing its performance with those of other models that incorporate architectural exploits. The original YOLOv5 model structure consisted of three prediction heads and 24 stages. In the detection head section, there were three different sizes of feature maps, with values of 80x80, 40x40, and 20x20. The performance results of this model were compared with those of other models. The structure of the original YOLOv5 model is illustrated in Figure 2.

3.3 Exploitation I

The next experiment involved training the model by exploiting the model structure on the backbones. In this experiment, the YOLOv5l backbone was frozen by setting its gradient value to 0. As shown in Figure 3, the frozen backbone occurred in stages 0–9 in the backbone structures. The concatenation layer in YOLOv5 combines the feature maps from different stages or layers of a backbone network. When the backbone is frozen, the feature maps extracted by these layers remain fixed and do not undergo further optimization through backpropagation. In the structure of this model, there are three prediction heads and 24 stages same as the original YOLOv5 model structure.

3.4 Exploitation II

The next experiment was to train the model by exploiting the model structure at the concave layer concatenation which combines the upsampling process on the neck with the C3 process on the backbones. The difference between exploitations I and II is in the concatenation layer, particularly in stages 12 and 16, as shown in Figure 4. Originally, YOLOv5 utilized a concatenated layer in conjunction with the C3 layer; however, in the exploitation II model, this approach was replaced with a convolutional layer rather than the C3 process. This modification aims to alter how feature maps are concatenated and processed, potentially improving the ability of the model to extract and utilize meaningful features for object detection tasks. The structure of this model has three prediction heads and 24 stages same as the original YOLOv5 model structure.

3.5 Exploitation III

The third experiment involved training the model by utilizing the architecture of the neck and head sections while maintaining the same concatenated layer as in the original YOLOv5 model. In this modified model, an additional detection head is incorporated along with an extra layer, resulting in four detection heads. The newly added detection head operated with a larger feature map size of 320 × 320 (scale 4), as shown in Figure 5. Integrating detection heads involves adding new layers dedicated to object detection tasks. These layers typically follow the feature-extraction backbone and are designed to process feature maps at multiple resolutions. Each detection head performs convolutional operations and applies activation functions to extract and refine features relevant to object detection within its respective scale. For processing, adding detection heads introduces parallel processing paths within the model. This parallelism allows the model to simultaneously analyze feature maps at multiple resolutions, which accelerates inference and improves computational efficiency. The experiment comprised 35 stages. This enhancement of the YOLOv5 architecture expands its capability to handle larger-scale objects and improves detection accuracy.

3.6 Exploitation IV

The fourth experiment focused on leveraging the model architecture at the concatenated layer, specifically by altering the concatenation process from the original YOLOv5 at Stage 2 to Stage 1 ConvBnSiLU, in addition to introducing layers within the neck and head sections. In the modified model depicted in Figure 6, there are four detection heads with feature-map sizes similar to those of exploitation III model. Similar to the third experiment, this model spans 35 stages and includes changes in the process at the same concatenated layer as in exploitation model II. Adjustments to the concatenated layers aim to optimize feature extraction and improve the ability of the model to detect objects across varying scales and complexities.

4. Experimental Result

We implemented the YOLOv5 Large version, known for its high mAP in pretrained models, but we trained all models from scratch to reduce overfitting during training. All experiments were set to 50 epochs to accelerate training and minimize GPU usage. We analyzed the improvements in each proposed method using the testset-dev of the VisDrone-DET2019 dataset. The performance parameter used was mAP. We compared our model with other benchmark models using the same dataset [18]. These improvements are presented in Table 1.

4.1 Comparative Experiment

Comparison with benchmark model: Based on the mAP values presented in Table 1, our proposed exploitation models (exploitations I-IV) demonstrate varied performance improvements compared to the benchmark models, YOLOv5-SPD, and modified cascade R-CNN. Exploitation model I, which involves training YOLOv5 with a frozen backbone, shows a lower mAP of 8.1%, indicating a reduction in performance compared to YOLOv5-SPD (10.47%) and the modified cascade R-CNN (19.98%). In contrast, exploitation model II, which focuses on modifying the concatenated layer process, achieved an improved mAP of 10.3%, suggesting a slight enhancement over YOLOv5-SPD. Exploitation model III, which incorporates additional layers and detection heads, significantly boosts the performance, with an mAP of 20.9%, surpassing both benchmark models. Exploitation model IV, combining the modifications from exploitation models II and III, further enhanced the performance to a mAP of 22.4%, demonstrating the cumulative benefits of integrating these enhancements.

Effect of freeze backbone: Percentage values of mAP@0.5 original YOLOv5 model and exploitation I were 26.4% and 8.1%, respectively. In this case, the original YOLOv5 model significantly outperformed exploitation model I by 18.3% in terms of the mAP value. This is because the gradient in the backbone process of exploitation model I is set to zero so that the weight value on the backbone is not updated during training, whereas the original YOLOv5 model runs the algorithm and takes the weight value of all processes on the network, including the backbone.

Effect of modified concatenated layers: These changes involved replacing the concatenated C3 layer process with a convolution layer on the backbone. The mAP@0.5 values for exploitation models I and II are 8.1% and 10.3%, respectively. Exploitation model II outperformed exploitation model I, indicating that changing the process at the concatenated layer affected the mAP value of the model. In the original YOLOv5 model structure, the concatenated process combines the feature maps of the previous layer with those in the C3 process (CSP Bottleneck with three convolutions). In contrast, exploitation model II involves modifying this process using a standard convolution (conv). The performance increase owing to the change in the concatenated layer arises because the feature map generated by the conv process differs from that produced by the C3 process. The C3 process includes a bottleneck and three convolutions that can overexploit the feature map and potentially reduce important information. In this study, changing the concatenated layer process, particularly at stages 12 and 16, resulted in a performance improvement of 2.2% compared to the exploitation model I.

Effect of adding prediction head: In this experiment, the model with additional layers and an added output detection head at a different scale affected the model performance, making it superior to exploitation model I. This is because the addition of layers to the network increased the output detection head to a larger scale, specifically with a feature map size of 320 × 320. With a larger feature map, processing small object information becomes easier than processing smaller feature maps. Therefore, in this experiment, the addition of layers and an output detection head on a larger scale improved the model performance.

Effect of adding prediction head and modified concatenated layers: Exploitation model IV outperforms exploitation model I with mAP values of 22.4% and 8.1%, respectively. Exploitation model IV demonstrates that changing the concatenated process, combined with the addition of a layer and an output detection head, results in performance improvement. This improvement can be attributed to the better features generated by the modified concatenated process in the neck, allowing the model to process features more effectively than exploitation model I. Additionally, the inclusion of a detection head with a larger feature map size facilitates easier detection of small objects, as larger sizes are more sensitive in processing localization and identifying small objects.

Comparison between the exploitation models with original YOLOv5: As shown in Figure 7, the original YOLOv5 had the highest mAP value compared to the other exploitation models. This is because the original YOLOv5 model is trained using the entire network, including the backbone. In contrast, exploitation models use a freeze backbone, where the weight values are not updated during training and retain their initial values. This study demonstrates improved performance in the regulated freeze backbone models (exploitation models I to IV) by modifying the concatenate layer process, adding additional layers, and scaling the output detection head. Consequently, the mAP in the models with frozen backbones can be improved. The performance difference between exploitation model I and the original YOLOv5 had an mAP value difference of 18.3%, indicating that the model trained with the entire network had the potential for improved performance. This could potentially lead to a four-fold increase in performance. Therefore, exploitation model IV, with an mAP value of 22.4%, and a model structure set with a freeze backbone and a combination of exploits II and III, could potentially achieve higher mAP values (up to 89.6%) if trained on the entire network.

Effect of exploitation and original YOLOv5 on the performance of each class: As shown in Figure 8, each model has the highest mAP value for a specific category. The car category achieved a higher mAP value compared to other categories due to the dataset imbalance in VisDrone-DET2019, which contained more samples of cars than other objects. The car category had the most labels compared to the other categories, resulting in a dominant accuracy for this category. The original YOLOv5 model had the highest AP value across all categories because its performance, trained on the entire network, was higher than that of the exploitation models with a freeze backbone. The performance comparison between exploitation models II and I shows that changing the concatenated layer in exploitation model II led to an increased performance in all categories, except bicycles. Model III outperforms models I and II by adding layers and detection heads with a larger feature map scale. Combining the exploits in model IV resulted in the highest mAP value across all categories, outperforming exploitation models I–III. This indicates that exploiting the concatenated layer, adding layers, and detection heads improves the accuracy compared with models without such exploitation. However, none of the models in this study fully addressed the dataset imbalance issue, leading to significant differences in mAP values across categories, which hinders optimal object detection in non-dominant categories.

Limitation: Despite demonstrating notable improvements, the proposed method has several potential disadvantages that warrant further discussion. Freezing the backbone can help reduce overfitting and computational load; however, it also significantly reduces the adaptability of the model to different datasets because the backbone weights remain static and might not capture the nuances of the specific dataset used during training. Although modifications to the concatenated layers and the addition of detection heads have improved the performance, these changes increase the complexity of the model, posing challenges in scaling the model to different datasets or adapting it to various real-world applications without extensive tuning and validation.

5. Conclusion

In this study, we used the YOLOv5 model to exploit the features for UAV-based object detection. This involves freezing the backbone, modifying the concatenated layer, and adding a layer to the detection head. Exploitation model IV achieved the highest performance, with a mAP value of 22.4%, outperforming other models in this experiment. Freezing the backbone reduced the performance because the backbone weights were not updated during training. However, replacing the C3 process with a convolution layer in the concatenated layer improved the performance. Additionally, adding a detection head enhanced the performance of the model, with exploitation model III outperforming models I and II. Combining these modifications in exploitation model IV results in further improvements, demonstrating that these architectural changes can significantly enhance object detection.

Future work will explore dynamic backbone-freezing strategies and additional architectural modifications to improve the performance and adaptability to new data. By extending the analysis to multiple datasets and various conditions, we aim to comprehensively assess the generalizability of the model and practical relevance, enhancing the impact of our findings.

Fig 1.

Figure 1.

Dataset VisDrone [17].

The International Journal of Fuzzy Logic and Intelligent Systems 2024; 24: 194-202https://doi.org/10.5391/IJFIS.2024.24.3.194

Fig 2.

Figure 2.

YOLOv5 original architecture.

The International Journal of Fuzzy Logic and Intelligent Systems 2024; 24: 194-202https://doi.org/10.5391/IJFIS.2024.24.3.194

Fig 3.

Figure 3.

Exploitation I proposed method.

The International Journal of Fuzzy Logic and Intelligent Systems 2024; 24: 194-202https://doi.org/10.5391/IJFIS.2024.24.3.194

Fig 4.

Figure 4.

Exploitation II proposed method.

The International Journal of Fuzzy Logic and Intelligent Systems 2024; 24: 194-202https://doi.org/10.5391/IJFIS.2024.24.3.194

Fig 5.

Figure 5.

Exploitation III proposed method.

The International Journal of Fuzzy Logic and Intelligent Systems 2024; 24: 194-202https://doi.org/10.5391/IJFIS.2024.24.3.194

Fig 6.

Figure 6.

Exploitation IV proposed method.

The International Journal of Fuzzy Logic and Intelligent Systems 2024; 24: 194-202https://doi.org/10.5391/IJFIS.2024.24.3.194

Fig 7.

Figure 7.

Comparison of mAP Performance for exploitation models and original YOLOv5.

The International Journal of Fuzzy Logic and Intelligent Systems 2024; 24: 194-202https://doi.org/10.5391/IJFIS.2024.24.3.194

Fig 8.

Figure 8.

Comparison of mAP performance for exploitation models and original YOLOv5 for each class.

The International Journal of Fuzzy Logic and Intelligent Systems 2024; 24: 194-202https://doi.org/10.5391/IJFIS.2024.24.3.194

Table 1 . Performance improvement for each method in mAP.

MethodsmAP@0.5
YOLOv5-SPD [?]10.47
Modified cascade R-CNN [?]19.98
Exploitation model I8.1
Exploitation model II10.3
Exploitation model III20.9
Exploitation model IV22.4

References

  1. Yang, CJ, Chou, T, Chang, FA, Ssu-Yuan, C, and Guo, JI. A smart surveillance system with multiple people detection, tracking, and behavior analysis., Proceedings of 2016 International Symposium on VLSI Design, Automation and Test (VLSI-DAT), Hsinchu, Taiwan, 2016, pp.1-4. https://doi.org/10.1109/VLSI-DAT.2016.7482569
    CrossRef
  2. Choi, J, Chun, D, Kim, H, and Lee, HJ. Gaussian YOLOv3: an accurate and fast object detector using localization uncertainty for autonomous driving., Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, South Korea, 2019, pp.502-511. https://doi.org/10.1109/ICCV.2019.00059
    CrossRef
  3. Lee, J, Wang, J, Crandall, D, Sabanovic, S, and Fox, G. Real-time, cloud-based object detection for unmanned aerial vehicles., Proceedings of 2017 1st IEEE International Conference on Robotic Computing (IRC), Taichung, Taiwan, 2017, pp.36-43. https://doi.org/10.1109/IRC.2017.77
    CrossRef
  4. Zhang, P, Zhong, Y, and Li, X. SlimYOLOv3: narrower, faster and better for real-time UAV applications., Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, South Korea, 2019, pp.37-45. https://doi.org/10.1109/ICCVW.2019.00011
    CrossRef
  5. Vaddi, S, Kumar, C, and Jannesari, A. (2019). Efficient object detection model for real-time UAV applications. Available: https://arxiv.org/abs/1906.00786
  6. Zhu, X, Lyu, S, Wang, X, and Zhao, Q. TPH-YOLOv5: improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios., Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021, pp.2778-2788. https://doi.org/10.1109/ICCVW54120.2021.00312
    CrossRef
  7. Zhang, A, Lipton, ZC, Li, M, and Smola, AJ. (2021). Dive into deep learning. Available: https://arxiv.org/abs/2106.11342
  8. Carranza-Garcia, M, Torres-Mateo, J, Lara-Benítez, P, and Garcia-Gutierrez, J (2020). On the performance of one-stage and two-stage object detectors in autonomous vehicles using camera data. Remote Sensing. 13. article no 89
    CrossRef
  9. Liu, W, Anguelov, D, Erhan, D, Szegedy, C, Reed, S, Fu, CY, and Berg, AC. (2016) . SSD: single shot multibox detector. Computer Vision–ECCV 2016. Cham, Switzerland: Springer, 21-37. https://doi.org/10.1007/978-3-319-46448-0_2
    CrossRef
  10. Redmon, J, Divvala, S, Girshick, R, and Farhadi, A . You only look once: unified, real-time object detection., Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp.779-788. https://doi.org/10.1109/CVPR.2016.91
    CrossRef
  11. Redmon, J, and Farhadi, A. (2018) . YOLOv3: An incremental improvement. Available: https://arxiv.org/abs/1804.02767
  12. Bochkovskiy, A, Wang, CY, and Liao, HYM. (2020). YOLOv4: optimal speed and accuracy of object detection. Available: https://arxiv.org/abs/2004.10934
  13. Jocher, G, Stoken, A, Chaurasia, A, Borovec, J, Kwon, Y, and Michael, K. (2021). ultralytics/yolov5: v6.0 - YOLOv5n ‘Nano’ models, Roboflow integration, TensorFlow export, OpenCV DNN support. Available: https://ui.adsabs.harvard.edu/abs/2021zndo...5563715J/abstract
  14. Ultralytics Inc. Transfer learning with frozen layers. Available: https://docs.ultralytics.com/yolov5/tutorials/transfer_learning_with_frozen_layers/
  15. Thuan, D. (2021). Evolution of Yolo algorithm and Yolov5: the state-of-the-art object detention algorithm. Available: https://www.theseus.fi/handle/10024/452552
  16. Du, D, Zhu, P, Wen, L, Bian, X, Lin, H, and Hu, Q . VisDrone-DET2019: the vision meets drone object detection in image challenge results., Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, South Korea, 2019, pp.213-226. https://doi.org/10.1109/ICCVW.2019.00030
    CrossRef
  17. Zhu, P, Wen, L, Du, D, Bian, X, Fan, H, Hu, Q, and Ling, H (2022). Detection and tracking meet drones challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence. 44, 7380-7399. https://doi.org/10.1109/TPAMI.2021.3119563
    CrossRef
  18. AISKYEYE Group. (2024) . VisDrone 2019 Object Detection in Images Challenge. Available: http://aiskyeye.com/leaderboard/

Share this article on :

Related articles in IJFIS