International Journal of Fuzzy Logic and Intelligent Systems 2024; 24(3): 194-202
Published online September 25, 2024
https://doi.org/10.5391/IJFIS.2024.24.3.194
© The Korean Institute of Intelligent Systems
Laily Nur Qomariyati, Nurul Jannah, Suryo Adhi Wibowo, and Thomhert Suprapto Siadari
Center of Excellence Artificial Intelligence for Learning and Optimization (CoE AILO), Telkom University, Bandung, Indonesia
Correspondence to :
Suryo Adhi Wibowo (suryoadhiwibowo@telkomuniversity.ac.id)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Object detection in unmanned aerial vehicles (UAV) has steadily developed over time. Improving the performance of UAV-based object detection has proven challenging for researchers because of several problems, such as data imbalance, scale variants, prediction accuracy, and memory and computation limitations for real-time applications. In this study, feature exploitation using the YOLOv5 algorithm is carried out to improve the performance of UAV-based object detection. The best-performing model leverages a frozen backbone, modifies concatenation layers, and adds a head to the algorithm for improved performance. This approach improved performance and achieved a mAP@0.5 value of 22.4%. Based on the results of the mAP, the exploitation of features affected the performance of UAV-based object detection.
Keywords: Exploitation, Object detection, UAV, YOLOv5
One area of artificial intelligence (AI) is computer vision, particularly object detection, a technology that continues to be developed with various features to support industrial needs and make it easier for humans to perform certain tasks. Object detection can be implemented in various systems, such as smart surveillance systems [1], autonomous driving [2], and object detection for unmanned aerial vehicles (UAV) [3]. Object detection on UAVs is a rapidly growing technology. It can be implemented in various fields, such as agriculture, aerial photography, shipping goods, security and surveillance, and search and rescue. However, object detection by UAVs faces challenges such as data imbalance, prediction accuracy, memory, and computational limitations, particularly in real-time applications, which researchers aim to optimize.
Several studies have addressed these issues using different methods, such as Faster RCNN, YOLO, and SSD. Zhang et al. [4] used the SlimYOLOv3 model and VisDrone2018-DET as datasets and showed that the model was capable of achieving a detection accuracy comparable to that of YOLOv3 with fewer FLOPs and faster runtime than the original YOLOv3. This makes the model suitable for real-time UAV . However, this study did not solve the data imbalance, resulting in the mean average precision (mAP) score for the dominant object class being higher than that of other classes [4]. This is because using a one-step detector, SlimYOLOv3, which samples the region of the entire image, whereas a two-step detector avoids this problem using a region proposal network (RPN) mechanism [5]. Jannesari proposed a deep feature architecture pyramid network (DFPN) and a modified loss function to overcome the VisDrone-2018 dataset imbalance and achieve real-time object detection [5]. Based on the results of the mAP evaluation in both studies, the DFPN with the basic network MobileNet achieved a higher score of 29.2 compared to SlimYOLOv3, which scored 23.9. However, Zhu et al. [6] used the same dataset, and TPHYOLOv5 as the object detector produced a higher mAP value of 39.18. This indicates that the TPHYOLOv5 model can effectively address detection problems with small-scale, densely packed, and blurred objects [6]. This study introduces a novel approach for improving UAV-based object detection by exploiting the YOLOv5 algorithm. The key contributions of this study are as follows.
• Exploitation on YOLOv5-based freeze backbone method by added prediction head.
• Modified concatenation layers process on YOLOv5-based freeze backbone algorithm.
• Combined added prediction head and modified concatenation layers to achieve higher mAP value.
Object detection involves finding and classifying objects in an image and displaying a bounding box on an object to indicate the level of confidence [7]. There are two methods for object detection: one- and two-stage detectors [8]. A one-stage detector uses a single, fully convolutional feed-forward network that directly provides a bounding box and object classification, whereas a two-stage detector consists of a region process proposal and a classification stage [8]. Object detection algorithms that use a one-stage detector include YOLO and SSD [9]. In this study, object detection was tested using aerial images captured by a UAV. Object detection in UAVs faces challenges, such as occlusion, small objects, and scale variations. This type of object detection can be implemented in UAVs and used for various purposes such as air security, freight forwarding, aerial photography, and monitoring.
You Only Look Once (YOLO) is an object detection algorithm that uses a convolutional neural network (CNN) and Darknet as the basic network for training and inference [10]. YOLO has 24 convolutional layers with two fully connected layers. This detector also used 1×1 reduction layers, followed by 3×3 convolutional layers. YOLO trains convolutional layers on ImageNet classification tasks with half the resolution (224 × 224) and then doubles the resolution for object detection. Several versions of YOLO have been developed based on other studies, including YOLOv2 [10], YOLOv3 [11], YOLOv4 [12], and YOLOv5 [13]. YOLOv5 is the fifth version of the object detection algorithm from YOLO.
The YOLOv5 architecture was adapted from the YOLOv4 model [12] and its architectural details are shown in Figure 1. Based on the .yaml file, the YOLOv5 model has the following structure: Backbone, Neck, and Head. In YOLOv5, there is a C3 or CSP Bottleneck with three convolutions as a feature extractor or backbone that functions to maintain features through propagation, encourages the network to reuse features, and helps maintain granularity-refined features to pass them on to deeper layers more efficiently. Before the features were forwarded to the neck, the process was run on the spatial pyramid pooling fast (SPPF) layer. SPPF functions increase the receptive field and separate the most important features, and there is a max pooling process. Furthermore, the feature is processed by PANet as a neck, which generates feature pyramids. Feature pyramids help the model generate good scaling of objects and also help in the identification of objects of equal size and different scales. The detection head on YOLOv5 is used to detect the final part. This section applies anchor boxes to features and produces the final vector output with class probabilities, objectivity scores, and bounding boxes [15].
In this study, experiments were conducted to exploit the YOLOv5 architecture, including freezing the backbone, modifying the processes in the neck section, and adding layers and a detection head, namely, the original YOLOv5 model, exploitation I, exploitation II, exploitation III, and exploitation IV. The results of several experiments were compared to assess the performance improvement by measuring the mAP values of each model. All experiments in this study used the YOLOv5 series or a large model and did not use a pretrained weight model or training from scratch. The epoch set for the entire experiment consisted of 50 epochs. The processed input image processed is 640x640. All exploit models were trained using a freeze backbone to reduce the computational load of the training model. The dataset used in this study is VisDrone-DET2019.
For object detection, the required dataset consists of images and annotations. Various types of datasets can be found on the Internet. The dataset used in this study consists of aerial images. The aerial image dataset used in this study was VisDrone-DET2019 [16]. This dataset was created by the AISKYEYE team at the Laboratory of Machine Learning and Data Mining, Tianjin University, China [17]. VisDrone-DET2019 uses the same data as VisDrone-DET2018 and consists of 8,599 images captured by drones at different locations and altitudes. In addition, it includes over 540,000 annotated bounding boxes across 10 categories. The images from the VisDrone dataset are shown in Figure 1.
The original YOLOv5 model was trained without any modifications to the architecture to serve as a reference for comparing its performance with those of other models that incorporate architectural exploits. The original YOLOv5 model structure consisted of three prediction heads and 24 stages. In the detection head section, there were three different sizes of feature maps, with values of 80x80, 40x40, and 20x20. The performance results of this model were compared with those of other models. The structure of the original YOLOv5 model is illustrated in Figure 2.
The next experiment involved training the model by exploiting the model structure on the backbones. In this experiment, the YOLOv5l backbone was frozen by setting its gradient value to 0. As shown in Figure 3, the frozen backbone occurred in stages 0–9 in the backbone structures. The concatenation layer in YOLOv5 combines the feature maps from different stages or layers of a backbone network. When the backbone is frozen, the feature maps extracted by these layers remain fixed and do not undergo further optimization through backpropagation. In the structure of this model, there are three prediction heads and 24 stages same as the original YOLOv5 model structure.
The next experiment was to train the model by exploiting the model structure at the concave layer concatenation which combines the upsampling process on the neck with the C3 process on the backbones. The difference between exploitations I and II is in the concatenation layer, particularly in stages 12 and 16, as shown in Figure 4. Originally, YOLOv5 utilized a concatenated layer in conjunction with the C3 layer; however, in the exploitation II model, this approach was replaced with a convolutional layer rather than the C3 process. This modification aims to alter how feature maps are concatenated and processed, potentially improving the ability of the model to extract and utilize meaningful features for object detection tasks. The structure of this model has three prediction heads and 24 stages same as the original YOLOv5 model structure.
The third experiment involved training the model by utilizing the architecture of the neck and head sections while maintaining the same concatenated layer as in the original YOLOv5 model. In this modified model, an additional detection head is incorporated along with an extra layer, resulting in four detection heads. The newly added detection head operated with a larger feature map size of 320 × 320 (scale 4), as shown in Figure 5. Integrating detection heads involves adding new layers dedicated to object detection tasks. These layers typically follow the feature-extraction backbone and are designed to process feature maps at multiple resolutions. Each detection head performs convolutional operations and applies activation functions to extract and refine features relevant to object detection within its respective scale. For processing, adding detection heads introduces parallel processing paths within the model. This parallelism allows the model to simultaneously analyze feature maps at multiple resolutions, which accelerates inference and improves computational efficiency. The experiment comprised 35 stages. This enhancement of the YOLOv5 architecture expands its capability to handle larger-scale objects and improves detection accuracy.
The fourth experiment focused on leveraging the model architecture at the concatenated layer, specifically by altering the concatenation process from the original YOLOv5 at Stage 2 to Stage 1 ConvBnSiLU, in addition to introducing layers within the neck and head sections. In the modified model depicted in Figure 6, there are four detection heads with feature-map sizes similar to those of exploitation III model. Similar to the third experiment, this model spans 35 stages and includes changes in the process at the same concatenated layer as in exploitation model II. Adjustments to the concatenated layers aim to optimize feature extraction and improve the ability of the model to detect objects across varying scales and complexities.
We implemented the YOLOv5 Large version, known for its high mAP in pretrained models, but we trained all models from scratch to reduce overfitting during training. All experiments were set to 50 epochs to accelerate training and minimize GPU usage. We analyzed the improvements in each proposed method using the testset-dev of the VisDrone-DET2019 dataset. The performance parameter used was mAP. We compared our model with other benchmark models using the same dataset [18]. These improvements are presented in Table 1.
In this study, we used the YOLOv5 model to exploit the features for UAV-based object detection. This involves freezing the backbone, modifying the concatenated layer, and adding a layer to the detection head. Exploitation model IV achieved the highest performance, with a mAP value of 22.4%, outperforming other models in this experiment. Freezing the backbone reduced the performance because the backbone weights were not updated during training. However, replacing the C3 process with a convolution layer in the concatenated layer improved the performance. Additionally, adding a detection head enhanced the performance of the model, with exploitation model III outperforming models I and II. Combining these modifications in exploitation model IV results in further improvements, demonstrating that these architectural changes can significantly enhance object detection.
Future work will explore dynamic backbone-freezing strategies and additional architectural modifications to improve the performance and adaptability to new data. By extending the analysis to multiple datasets and various conditions, we aim to comprehensively assess the generalizability of the model and practical relevance, enhancing the impact of our findings.
No potential conflict of interest relevant to this article was reported.
Table 1. Performance improvement for each method in mAP.
Methods | mAP@0.5 |
---|---|
YOLOv5-SPD [ | 10.47 |
Modified cascade R-CNN [ | 19.98 |
Exploitation model I | 8.1 |
Exploitation model II | 10.3 |
Exploitation model III | 20.9 |
Exploitation model IV | 22.4 |
International Journal of Fuzzy Logic and Intelligent Systems 2024; 24(3): 194-202
Published online September 25, 2024 https://doi.org/10.5391/IJFIS.2024.24.3.194
Copyright © The Korean Institute of Intelligent Systems.
Laily Nur Qomariyati, Nurul Jannah, Suryo Adhi Wibowo, and Thomhert Suprapto Siadari
Center of Excellence Artificial Intelligence for Learning and Optimization (CoE AILO), Telkom University, Bandung, Indonesia
Correspondence to:Suryo Adhi Wibowo (suryoadhiwibowo@telkomuniversity.ac.id)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Object detection in unmanned aerial vehicles (UAV) has steadily developed over time. Improving the performance of UAV-based object detection has proven challenging for researchers because of several problems, such as data imbalance, scale variants, prediction accuracy, and memory and computation limitations for real-time applications. In this study, feature exploitation using the YOLOv5 algorithm is carried out to improve the performance of UAV-based object detection. The best-performing model leverages a frozen backbone, modifies concatenation layers, and adds a head to the algorithm for improved performance. This approach improved performance and achieved a mAP@0.5 value of 22.4%. Based on the results of the mAP, the exploitation of features affected the performance of UAV-based object detection.
Keywords: Exploitation, Object detection, UAV, YOLOv5
One area of artificial intelligence (AI) is computer vision, particularly object detection, a technology that continues to be developed with various features to support industrial needs and make it easier for humans to perform certain tasks. Object detection can be implemented in various systems, such as smart surveillance systems [1], autonomous driving [2], and object detection for unmanned aerial vehicles (UAV) [3]. Object detection on UAVs is a rapidly growing technology. It can be implemented in various fields, such as agriculture, aerial photography, shipping goods, security and surveillance, and search and rescue. However, object detection by UAVs faces challenges such as data imbalance, prediction accuracy, memory, and computational limitations, particularly in real-time applications, which researchers aim to optimize.
Several studies have addressed these issues using different methods, such as Faster RCNN, YOLO, and SSD. Zhang et al. [4] used the SlimYOLOv3 model and VisDrone2018-DET as datasets and showed that the model was capable of achieving a detection accuracy comparable to that of YOLOv3 with fewer FLOPs and faster runtime than the original YOLOv3. This makes the model suitable for real-time UAV . However, this study did not solve the data imbalance, resulting in the mean average precision (mAP) score for the dominant object class being higher than that of other classes [4]. This is because using a one-step detector, SlimYOLOv3, which samples the region of the entire image, whereas a two-step detector avoids this problem using a region proposal network (RPN) mechanism [5]. Jannesari proposed a deep feature architecture pyramid network (DFPN) and a modified loss function to overcome the VisDrone-2018 dataset imbalance and achieve real-time object detection [5]. Based on the results of the mAP evaluation in both studies, the DFPN with the basic network MobileNet achieved a higher score of 29.2 compared to SlimYOLOv3, which scored 23.9. However, Zhu et al. [6] used the same dataset, and TPHYOLOv5 as the object detector produced a higher mAP value of 39.18. This indicates that the TPHYOLOv5 model can effectively address detection problems with small-scale, densely packed, and blurred objects [6]. This study introduces a novel approach for improving UAV-based object detection by exploiting the YOLOv5 algorithm. The key contributions of this study are as follows.
• Exploitation on YOLOv5-based freeze backbone method by added prediction head.
• Modified concatenation layers process on YOLOv5-based freeze backbone algorithm.
• Combined added prediction head and modified concatenation layers to achieve higher mAP value.
Object detection involves finding and classifying objects in an image and displaying a bounding box on an object to indicate the level of confidence [7]. There are two methods for object detection: one- and two-stage detectors [8]. A one-stage detector uses a single, fully convolutional feed-forward network that directly provides a bounding box and object classification, whereas a two-stage detector consists of a region process proposal and a classification stage [8]. Object detection algorithms that use a one-stage detector include YOLO and SSD [9]. In this study, object detection was tested using aerial images captured by a UAV. Object detection in UAVs faces challenges, such as occlusion, small objects, and scale variations. This type of object detection can be implemented in UAVs and used for various purposes such as air security, freight forwarding, aerial photography, and monitoring.
You Only Look Once (YOLO) is an object detection algorithm that uses a convolutional neural network (CNN) and Darknet as the basic network for training and inference [10]. YOLO has 24 convolutional layers with two fully connected layers. This detector also used 1×1 reduction layers, followed by 3×3 convolutional layers. YOLO trains convolutional layers on ImageNet classification tasks with half the resolution (224 × 224) and then doubles the resolution for object detection. Several versions of YOLO have been developed based on other studies, including YOLOv2 [10], YOLOv3 [11], YOLOv4 [12], and YOLOv5 [13]. YOLOv5 is the fifth version of the object detection algorithm from YOLO.
The YOLOv5 architecture was adapted from the YOLOv4 model [12] and its architectural details are shown in Figure 1. Based on the .yaml file, the YOLOv5 model has the following structure: Backbone, Neck, and Head. In YOLOv5, there is a C3 or CSP Bottleneck with three convolutions as a feature extractor or backbone that functions to maintain features through propagation, encourages the network to reuse features, and helps maintain granularity-refined features to pass them on to deeper layers more efficiently. Before the features were forwarded to the neck, the process was run on the spatial pyramid pooling fast (SPPF) layer. SPPF functions increase the receptive field and separate the most important features, and there is a max pooling process. Furthermore, the feature is processed by PANet as a neck, which generates feature pyramids. Feature pyramids help the model generate good scaling of objects and also help in the identification of objects of equal size and different scales. The detection head on YOLOv5 is used to detect the final part. This section applies anchor boxes to features and produces the final vector output with class probabilities, objectivity scores, and bounding boxes [15].
In this study, experiments were conducted to exploit the YOLOv5 architecture, including freezing the backbone, modifying the processes in the neck section, and adding layers and a detection head, namely, the original YOLOv5 model, exploitation I, exploitation II, exploitation III, and exploitation IV. The results of several experiments were compared to assess the performance improvement by measuring the mAP values of each model. All experiments in this study used the YOLOv5 series or a large model and did not use a pretrained weight model or training from scratch. The epoch set for the entire experiment consisted of 50 epochs. The processed input image processed is 640x640. All exploit models were trained using a freeze backbone to reduce the computational load of the training model. The dataset used in this study is VisDrone-DET2019.
For object detection, the required dataset consists of images and annotations. Various types of datasets can be found on the Internet. The dataset used in this study consists of aerial images. The aerial image dataset used in this study was VisDrone-DET2019 [16]. This dataset was created by the AISKYEYE team at the Laboratory of Machine Learning and Data Mining, Tianjin University, China [17]. VisDrone-DET2019 uses the same data as VisDrone-DET2018 and consists of 8,599 images captured by drones at different locations and altitudes. In addition, it includes over 540,000 annotated bounding boxes across 10 categories. The images from the VisDrone dataset are shown in Figure 1.
The original YOLOv5 model was trained without any modifications to the architecture to serve as a reference for comparing its performance with those of other models that incorporate architectural exploits. The original YOLOv5 model structure consisted of three prediction heads and 24 stages. In the detection head section, there were three different sizes of feature maps, with values of 80x80, 40x40, and 20x20. The performance results of this model were compared with those of other models. The structure of the original YOLOv5 model is illustrated in Figure 2.
The next experiment involved training the model by exploiting the model structure on the backbones. In this experiment, the YOLOv5l backbone was frozen by setting its gradient value to 0. As shown in Figure 3, the frozen backbone occurred in stages 0–9 in the backbone structures. The concatenation layer in YOLOv5 combines the feature maps from different stages or layers of a backbone network. When the backbone is frozen, the feature maps extracted by these layers remain fixed and do not undergo further optimization through backpropagation. In the structure of this model, there are three prediction heads and 24 stages same as the original YOLOv5 model structure.
The next experiment was to train the model by exploiting the model structure at the concave layer concatenation which combines the upsampling process on the neck with the C3 process on the backbones. The difference between exploitations I and II is in the concatenation layer, particularly in stages 12 and 16, as shown in Figure 4. Originally, YOLOv5 utilized a concatenated layer in conjunction with the C3 layer; however, in the exploitation II model, this approach was replaced with a convolutional layer rather than the C3 process. This modification aims to alter how feature maps are concatenated and processed, potentially improving the ability of the model to extract and utilize meaningful features for object detection tasks. The structure of this model has three prediction heads and 24 stages same as the original YOLOv5 model structure.
The third experiment involved training the model by utilizing the architecture of the neck and head sections while maintaining the same concatenated layer as in the original YOLOv5 model. In this modified model, an additional detection head is incorporated along with an extra layer, resulting in four detection heads. The newly added detection head operated with a larger feature map size of 320 × 320 (scale 4), as shown in Figure 5. Integrating detection heads involves adding new layers dedicated to object detection tasks. These layers typically follow the feature-extraction backbone and are designed to process feature maps at multiple resolutions. Each detection head performs convolutional operations and applies activation functions to extract and refine features relevant to object detection within its respective scale. For processing, adding detection heads introduces parallel processing paths within the model. This parallelism allows the model to simultaneously analyze feature maps at multiple resolutions, which accelerates inference and improves computational efficiency. The experiment comprised 35 stages. This enhancement of the YOLOv5 architecture expands its capability to handle larger-scale objects and improves detection accuracy.
The fourth experiment focused on leveraging the model architecture at the concatenated layer, specifically by altering the concatenation process from the original YOLOv5 at Stage 2 to Stage 1 ConvBnSiLU, in addition to introducing layers within the neck and head sections. In the modified model depicted in Figure 6, there are four detection heads with feature-map sizes similar to those of exploitation III model. Similar to the third experiment, this model spans 35 stages and includes changes in the process at the same concatenated layer as in exploitation model II. Adjustments to the concatenated layers aim to optimize feature extraction and improve the ability of the model to detect objects across varying scales and complexities.
We implemented the YOLOv5 Large version, known for its high mAP in pretrained models, but we trained all models from scratch to reduce overfitting during training. All experiments were set to 50 epochs to accelerate training and minimize GPU usage. We analyzed the improvements in each proposed method using the testset-dev of the VisDrone-DET2019 dataset. The performance parameter used was mAP. We compared our model with other benchmark models using the same dataset [18]. These improvements are presented in Table 1.
In this study, we used the YOLOv5 model to exploit the features for UAV-based object detection. This involves freezing the backbone, modifying the concatenated layer, and adding a layer to the detection head. Exploitation model IV achieved the highest performance, with a mAP value of 22.4%, outperforming other models in this experiment. Freezing the backbone reduced the performance because the backbone weights were not updated during training. However, replacing the C3 process with a convolution layer in the concatenated layer improved the performance. Additionally, adding a detection head enhanced the performance of the model, with exploitation model III outperforming models I and II. Combining these modifications in exploitation model IV results in further improvements, demonstrating that these architectural changes can significantly enhance object detection.
Future work will explore dynamic backbone-freezing strategies and additional architectural modifications to improve the performance and adaptability to new data. By extending the analysis to multiple datasets and various conditions, we aim to comprehensively assess the generalizability of the model and practical relevance, enhancing the impact of our findings.
YOLOv5 original architecture.
Exploitation I proposed method.
Exploitation II proposed method.
Exploitation III proposed method.
Exploitation IV proposed method.
Comparison of mAP Performance for exploitation models and original YOLOv5.
Comparison of mAP performance for exploitation models and original YOLOv5 for each class.
Table 1 . Performance improvement for each method in mAP.
Methods | mAP@0.5 |
---|---|
YOLOv5-SPD [ | 10.47 |
Modified cascade R-CNN [ | 19.98 |
Exploitation model I | 8.1 |
Exploitation model II | 10.3 |
Exploitation model III | 20.9 |
Exploitation model IV | 22.4 |
Dheo Prasetyo Nugroho, Sigit Widiyanto, and Dini Tri Wardani
International Journal of Fuzzy Logic and Intelligent Systems 2022; 22(3): 223-232 https://doi.org/10.5391/IJFIS.2022.22.3.223Donho Nam and Seokwon Yeom
International Journal of Fuzzy Logic and Intelligent Systems 2020; 20(1): 43-51 https://doi.org/10.5391/IJFIS.2020.20.1.43Min-Hyuck Lee, and Seokwon Yeom
International Journal of Fuzzy Logic and Intelligent Systems 2018; 18(3): 182-189 https://doi.org/10.5391/IJFIS.2018.18.3.182Dataset VisDrone [17].
|@|~(^,^)~|@|YOLOv5 original architecture.
|@|~(^,^)~|@|Exploitation I proposed method.
|@|~(^,^)~|@|Exploitation II proposed method.
|@|~(^,^)~|@|Exploitation III proposed method.
|@|~(^,^)~|@|Exploitation IV proposed method.
|@|~(^,^)~|@|Comparison of mAP Performance for exploitation models and original YOLOv5.
|@|~(^,^)~|@|Comparison of mAP performance for exploitation models and original YOLOv5 for each class.