International Journal of Fuzzy Logic and Intelligent Systems 2024; 24(1): 1-9
Published online March 25, 2024
https://doi.org/10.5391/IJFIS.2024.24.1.1
© The Korean Institute of Intelligent Systems
Xinzhi Hu1, Wang-Su Jeon2, Grezgorz Cielniak3, and Sang-Yong Rhee2
1Department of IT Convergence Engineering, Kyungnam University, Changwon, Korea
2Department of Computer Engineering, Kyungnam University, Changwon, Korea
3School Computer Science, University of Lincoln, Lincoln, UK
Correspondence to :
Sang-Yong Rhee (syrhee@kyungnam.ac.kr)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Sugar beet is a biennial herb with cold, drought, and salinity resistance and is one of the world’s major sugar crops. In addition to sugar, sugar beets are important raw materials for chemical and pharmaceutical products, and the residue after sugar extraction can be used to produce agricultural by-products, such as compound feed, which has a high comprehensive utilization value [1]. Field weeds, such as sugar beets, are harmful to crop growth and can compete with crops for sunlight and nutrients. If weeds are not removed in time during crop growth, they cause a decrease in crop yield and quality. Therefore, there is considerable interest in the development of automated machinery for selective weeding operations. The core component of this technology is a vision system that distinguishes between crops and weeds. To address the problems of difficult weed extraction, poor detection, and segmentation of region boundaries in traditional sugar beet detection, an end-to-end encoder–decoder model based on an improved UNet++ for segmentation is proposed in this paper and applied to sugar beet and weed detection. UNet++ can better fuse feature maps from different layers by skipping connections and can effectively preserve the details of sugar beet and weed images. The new model adds an attention mechanism to UNet++ by embedding the attention module into the upsampling process of UNet++ to suppress interference from extraneous noise. The improved model was evaluated on a sugar beet and weed dataset containing 1026 images. The image dataset in this study was obtained from sugar beet and weed images collected at the University of Bonn, Germany. According to the experimental results, the model can significantly eliminate noise and improve segmentation accuracy.
Keywords: ATT-NestedUNet, Deep learning, Weed detection, Semantic segmentation, Sugar beet
Currently, computer vision is widely used in various fields because of its efficiency and simplicity in reducing the waste of agricultural resources on weed control and increasing crop yield. Excessive use of herbicides for weed control not only pollutes the environment and increases agricultural costs, but also affects food safety [2]. The precise detection, recognition, and analysis techniques of computer vision have made its application popular in agricultural robotics. See our previous comment on adding a sentence about selective weeding machines that require such vision systems.
Ronneberger et al. [3] proposed UNet as an improved network based on fully convolutional network (FCN). The structure of UNet is symmetrical because of its similarity to U, as shown in Figure 1. The UNet model was first proposed for the segmentation of medical images by Zhou et al. [4]. To satisfy the demand for more accurate image segmentation, they proposed the UNet++ model, as shown in Figure 2. By constructing nested UNet networks, they gradually enriched high-resolution feature maps from the encoder network before fusing the corresponding feature maps from the decoder network. Consequently, the model can capture the fine-grained details of the foreground objects more efficiently.
The nested UNet creates feature maps of the decoder and encoder networks that must be fused semantically. After continuous development and improvement, UNet++ was proposed as a powerful image-segmentation architecture. The advantages of UNet++ are the improvement in accuracy and the flexible network structure with depth supervision, which allows a deep network with a large number of parameters to drastically reduce the number of parameters within an acceptable accuracy range.
We add a set of contributions proposed in this study in the form of a bullet-point list. Attention mechanisms have now been widely applied to image processing tasks [5], and Fu et al. [6] proposed a new framework called the Daul Attention Network (DANet) in 2019. This framework introduces a self-attentive mechanism that captures feature dependencies in the spatial and channel dimensions. Oktay et al. [7] used an attention mechanism to suppress feature responses in irrelevant background regions, which effectively reduced false alarm predictions in image segmentation. The experimental results show that the combination of nested UNet and the attention mechanism can effectively eliminate noise in images and improve image segmentation performance.
The remainder of this paper is organized as follows: in Section 2, we present related studies, including the achievements of UNet in semantic segmentation and UNet++ in weed identification, as proposed by scholars in recent years. In Section 3, we present the proposed approach, including our modifications to the overall framework. Section 4 presents the experimental data, operating environment, and mean Intersection over Union (MIOU) test results. The final section concludes this paper.
Shang et al. [8] proposed a deep-learning-based weed recognition system and a Res-UNet image segmentation network—an improved version of the UNet. By using the ResNet-50 network instead of the UNet backbone network, the problems of difficult crop and weed extraction, poor detection of small plants, segmentation edge oscillation, and distortion of complex backgrounds were solved. The average cross-rate, accuracy, and training time of the images were used as evaluation indices for the experiments. The results showed that the average intersection rate was 82.25% and the average pixel accuracy was 98.67% using the Res-UNet model. Compared with UNet, the Res-UNet model had a higher average intersection rate of 4.74%, a higher average intersection rate of 10.68% than Seg-Net, and a reduced training time of 3 hours. This method is effective in detecting sugar beet weeds in complex backgrounds and can provide a reference for robotic precision weeding. The results show that UNet has a positive effect on weed detection in complex backgrounds and can be used as a reference for robotic weeding.
Wang and Chen [9] proposed real-time segmentation of field weeds based on PCAW-UNet and a real-time segmentation method based on the UNet network. We take UNet as the backbone network, extract multiscale information fusion, add a dual attention module at the end of the model, consider the dependency between image pixel positions and the information connection between different channels, and classify the category information in RGB in the pixel-level image. Dynamic weight coefficients were introduced to solve the problem of low classification accuracy due to the unbalanced proportion of sample categories.
Yu et al. [10] proposed an image-based deep learning method for lawn weed detection. DetectNet is the most successful deep convolutional neural network (DCNN) architecture for detecting early annual grass (Poa annua L.) and various broadleaf weeds of early grass growing on dormant Bermuda grass. DetectNet was used to detect weeds in dormant dogwood, with an F1-score of
Zhao et al. [11] proposed the extraction of the centerlines of ridges from remotely sensed unmanned aerial vehicle (UAV) images using an FCN. Based on high-precision visible remote sensing images acquired by UAVs, they designed a dataset annotation method using the sliding window method to extract the centerline of a farmland monopoly. The images were decomposed into blocks, and a ridge region with a width of 7–17 pixels near the ridge centerline was extracted using a deep learning semantic segmentation network. The results show that the processing of UAV remote sensing images based on the FCN can yield a grid map of global ridge centers, which is convenient for global path planning by agricultural robots.
In this study, the classical image segmentation model UNet was improved and named ATT-NestedUNet (NestedUNet with an attention mechanism). Its network structure is a nested UNet model with an attention mechanism module added to the upsampling, as shown in Figure 1, and is composed of four parts: downsampling, upsampling, attention mechanism, and skip connection. Using a long–short skip connection, the features of the previous layer can be effectively fused.
The convolutional block attention module (CBAM) [12] represents the attention mechanism module of the convolutional module, which is an attention mechanism module for spatial and component combination. Compared to SENet, which focuses only on the channel model, CBAM can achieve better results.
Figure 3 shows the structure of the CBAM, where the dimensions of the input feature are C×H×W, the dimensions of the channel attention (CA) model are C×1×1, and the dimensions of the spatial attention model are 1×H×W.
CA is an effective for obtaining a larger range of information without introducing a large overhead. The CA is effective for obtaining a large range of information without introducing a large amount of overhead.
UNet++ [13] makes three additions to the original UNet:
1. The network combines a DenseNet-like structure with dense skip connections to improve gradient mobility.
2. Filling the hollow structure of UNet to connect the semantic gap between encoder and decoder feature maps.
3. Pruning is possible with deep supervision.
The advantages of UNet++ are as follows: by embedding U-Net models of different depths in UNet++, improved segmentation performance can be achieved for objects of different sizes, which is an improvement over the fixed-depth U-Net. Therefore, all the UNet parts share a single encoder, and their decoders are interwoven into one piece. UNet++ can train all UNets simultaneously. By pruning the trained UNet++, the inference speed was accelerated while maintaining its performance.
ATT-NestedUNet is an encoder that is densely connected to the decoder via a jump connection to fuse the shallow and deep features. Finally, a 1×1 convolutional layer and a sigmoid activation function follow nodes x0,1, x0,2, x0,3, x0,4. A 1×1 convolutional layer and activation function were used to output a segmentation map of the sugar beet and weed images with the same size as the output image of the original input, as shown in Figure 4.
The ATT-NestedUNet can be pruned by pruning the ATT-Nested UNet model into four different depth structures. The ATT-NestedUNet (L1), ATT-NestedUNet (L2), ATT-NestedU-Net (L3), and ATT-NestedUNet (L4) are shown in Figure 5.
The advantage of UNet++ over UNet is that it is not easy to explicitly choose the network depth because UNet is embedded at different depths in the structure. All of these U-Nets partially share a single encoder, whereas the decoders of these UNets are intertwined.
By training ATT-NestedUNet under deep supervision, all constituent UNets can be trained simultaneously while benefiting from a shared image representation. This improves the overall segmentation performance and allows pruning of the model during inference. As shown in Table 1, deep supervision entails counting the outputs of models of different depths such that the outputs of different depths can be observed; thus, the depth of the network can be better designed. At this point, the metric is unclear. We have included this information in the revised manuscript.
The image dataset used in this study was obtained from sugar beet and weed images collected at the University of Bonn, Germany [14]. Images were acquired using a multifunctional robot manufactured by Bosch Deepfield Robotics. The acquisition device was a JAIAD-130GE camera, which provided images with a maximum resolution of 480×360 pixels. The acquisition was conducted on May 23, 2016. Figure 6 shows the farm information collection robot BoniRob, and Figure 7 shows the collection sample and label map in the dataset. Here, the green labels represent sugar beets, and the red labels represent weeds.
Eighty percent of the data in the beet dataset were used for model training, and 20% were used to evaluate the accuracy of the model. There were 820 images in the training set and 206 images in the validation set, all in the PNG format with an image resolution of 480×360 pixels. As shown in Figure 8, there were four categories in the dataset: sugar beets only, weeds only, a mixture of sugar beets and weeds, and neither sugar beets nor weeds. We first preprocessed each image to obtain a normalized average intensity for each channel and separated the vegetation (mainly soil) from the rest of the image.
Semantic segmentation based on deep learning requires a sufficiently large sample dataset, and the number of sample images in the training set cannot meet the experimental requirements. Therefore, data augmentation is used to enhance the number of datasets and reduce the overfitting phenomenon between image samples and the training network. In our experiments, we selected one of the four data enhancement transforms by randomly rotating 90°, randomly flipping, transposing, and selecting data enhancement transforms in hue, brightness, and cropping according to the normalized probability to effectively expand the number of positives, which are called datasets, and effectively eliminate the distortion and overfitting effects caused by an insufficient number of datasets. Finally, the dataset was resized to 480×480 pixels.
The running environment for the entire experiment was a Windows 10 (64-bit) operating system, Anaconda 4.10.3, Python 3.9, CUDA 11.3, cuDNN 8.2.1, AMD Ryzen 9 5900X 12-Core Processor 3.70 GHz processor, using the deep learning framework PyTorch as the development environment, and 32 GB of computer memory.
To improve the performance of the model and enhance its generalization ability, the parameters of the semantic segmentation model pretrained for this experiment were set. Using the Saccharomyces Genome Database (SGD) optimizer, one of the drawbacks of the SGD method is that its update direction is completely dependent on the gradient calculated by the current batch and thus is unstable. The Momentum algorithm borrows the physical concept of momentum, which simulates the inertia of motion of an object. The momentum was set to 0.9, the learning rate was set to 10−2, the minimum learning rate was set to 10−4, the weight decay was set to 10−4, the loss function used was BCEDiceLoss, and the epochs were set to 200.
The loss function [15, 16] used in this study is a combination of two loss functions, binary cross-entropy (BCE) and Dice coefficient, and the experimental results show that the combination of these two loss functions can effectively accelerate the convergence speed of the model. The loss function equation is as follows:
where,
MIOU is a standard measure of semantic segmentation [17, 18]. IOU refers to the ratio of the intersection and union of the ground truth and predicted segments. The MIOU is the average of the IOUs of all images. MIOU was calculated using the following formula:
where TP indicates that the true value is positive and the predicted value is positive, which is called the true positive. False negative (FN) means that the true value is positive and the prediction is negative. FP means that the true value is negative and the prediction is false positive. True negative (TN) means that the true value is negative and the prediction is negative.
To select a model that is more suitable for the research object of this study, this experiment compared the UNet, UNet ++, and ATT-NestedUNet models. This study mainly used the MIOU test as an evaluation index for the model in the system. As listed in Table 2, ATT-NestedUNet achieved MIOU gains of 0.6% and 0.38% on UNet and UNet ++, respectively, throughout the day. A comparison between ATT-NestedUNet and the other two algorithms shows that ATT-NestedUNet has a better training effect. Comparing the three models, the MIOU value of the background was not significantly different. The MIOU values of beets and weeds were relatively high, and the segmentation effect was better.
As shown in Figure 8, the results of ATT-NestedUNet were compared with those of the two de-othering algorithms. To reduce the effect of background complexity, the background of the dataset was black, the beets were green, and the weeds were red. To better display these segmentation results, we converted them to RGB format. The output was divided into three folders labeled background, beet, and weed. The experimental results show that, compared with other algorithms, it can be seen that ATT-NestedUNet has fewer false pixels and can segment beet and weed images more accurately.
The segmentation results in graph form are provided for beets, weeds only, backgrounds only, and both beets & weeds in Figures 8
To segment beet and weed images more accurately, an improved algorithm called ATT-NestedUNet (NestedUNet with an attention mechanism) is proposed in this study. The model follows the design concept of UNet++ and adds an attention mechanism module to the upsampling module. To verify the segmentation performance of ATT-NestedUNet, the model was trained on a dataset containing 820 sugar beet and weed images and tested on a test set containing 206 sugar beet and weed images. The experimental results showed that ATT-NestedUNet achieved better results for both subjective visual perception and objective evaluation metrics than the comparison algorithm.
Considering that the proposed network is not easy to implement on low-power mobile devices, future work will include optimizing the network parameters and reducing the computational costs. Currently, our goal is to improve the computation time without degrading the segmentation accuracy.
No potential conflict of interest relevant to this article was reported.
Example instances from the dataset (a) weeds only, (b) beets only, (c) beets and weeds, (d) soil free of beets and weeds.
Table 1. MIOU values of L1, L2, L3, L4.
Model | MIOU |
---|---|
ATT-NestedUNet L1 | 90.03% |
ATT-NestedUNet L2 | 91.14% |
ATT-NestedUNet L3 | 91.43% |
ATT-NestedUNet L4 | 91.42% |
Table 2. mIOU values for the three trained models.
Model | mIOU | Back ground | Weed | Sugar beet |
---|---|---|---|---|
UNet | 90.82% | 95.29% | 95.72% | 98.95% |
Nested UNet | 91.04% | 95.29% | 95.73% | 98.96% |
ATT-NestedUNet | 91.42% | 95.29% | 95.75% | 98.98% |
E-mail: huxinzhi0326@gmail.com
E-mail: jws2218@naver.com
E-mail: Grzegorz.Cielniak@gmail.com
E-mail: syrhee@kyungnam.ac.kr
International Journal of Fuzzy Logic and Intelligent Systems 2024; 24(1): 1-9
Published online March 25, 2024 https://doi.org/10.5391/IJFIS.2024.24.1.1
Copyright © The Korean Institute of Intelligent Systems.
Xinzhi Hu1, Wang-Su Jeon2, Grezgorz Cielniak3, and Sang-Yong Rhee2
1Department of IT Convergence Engineering, Kyungnam University, Changwon, Korea
2Department of Computer Engineering, Kyungnam University, Changwon, Korea
3School Computer Science, University of Lincoln, Lincoln, UK
Correspondence to:Sang-Yong Rhee (syrhee@kyungnam.ac.kr)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Sugar beet is a biennial herb with cold, drought, and salinity resistance and is one of the world’s major sugar crops. In addition to sugar, sugar beets are important raw materials for chemical and pharmaceutical products, and the residue after sugar extraction can be used to produce agricultural by-products, such as compound feed, which has a high comprehensive utilization value [1]. Field weeds, such as sugar beets, are harmful to crop growth and can compete with crops for sunlight and nutrients. If weeds are not removed in time during crop growth, they cause a decrease in crop yield and quality. Therefore, there is considerable interest in the development of automated machinery for selective weeding operations. The core component of this technology is a vision system that distinguishes between crops and weeds. To address the problems of difficult weed extraction, poor detection, and segmentation of region boundaries in traditional sugar beet detection, an end-to-end encoder–decoder model based on an improved UNet++ for segmentation is proposed in this paper and applied to sugar beet and weed detection. UNet++ can better fuse feature maps from different layers by skipping connections and can effectively preserve the details of sugar beet and weed images. The new model adds an attention mechanism to UNet++ by embedding the attention module into the upsampling process of UNet++ to suppress interference from extraneous noise. The improved model was evaluated on a sugar beet and weed dataset containing 1026 images. The image dataset in this study was obtained from sugar beet and weed images collected at the University of Bonn, Germany. According to the experimental results, the model can significantly eliminate noise and improve segmentation accuracy.
Keywords: ATT-NestedUNet, Deep learning, Weed detection, Semantic segmentation, Sugar beet
Currently, computer vision is widely used in various fields because of its efficiency and simplicity in reducing the waste of agricultural resources on weed control and increasing crop yield. Excessive use of herbicides for weed control not only pollutes the environment and increases agricultural costs, but also affects food safety [2]. The precise detection, recognition, and analysis techniques of computer vision have made its application popular in agricultural robotics. See our previous comment on adding a sentence about selective weeding machines that require such vision systems.
Ronneberger et al. [3] proposed UNet as an improved network based on fully convolutional network (FCN). The structure of UNet is symmetrical because of its similarity to U, as shown in Figure 1. The UNet model was first proposed for the segmentation of medical images by Zhou et al. [4]. To satisfy the demand for more accurate image segmentation, they proposed the UNet++ model, as shown in Figure 2. By constructing nested UNet networks, they gradually enriched high-resolution feature maps from the encoder network before fusing the corresponding feature maps from the decoder network. Consequently, the model can capture the fine-grained details of the foreground objects more efficiently.
The nested UNet creates feature maps of the decoder and encoder networks that must be fused semantically. After continuous development and improvement, UNet++ was proposed as a powerful image-segmentation architecture. The advantages of UNet++ are the improvement in accuracy and the flexible network structure with depth supervision, which allows a deep network with a large number of parameters to drastically reduce the number of parameters within an acceptable accuracy range.
We add a set of contributions proposed in this study in the form of a bullet-point list. Attention mechanisms have now been widely applied to image processing tasks [5], and Fu et al. [6] proposed a new framework called the Daul Attention Network (DANet) in 2019. This framework introduces a self-attentive mechanism that captures feature dependencies in the spatial and channel dimensions. Oktay et al. [7] used an attention mechanism to suppress feature responses in irrelevant background regions, which effectively reduced false alarm predictions in image segmentation. The experimental results show that the combination of nested UNet and the attention mechanism can effectively eliminate noise in images and improve image segmentation performance.
The remainder of this paper is organized as follows: in Section 2, we present related studies, including the achievements of UNet in semantic segmentation and UNet++ in weed identification, as proposed by scholars in recent years. In Section 3, we present the proposed approach, including our modifications to the overall framework. Section 4 presents the experimental data, operating environment, and mean Intersection over Union (MIOU) test results. The final section concludes this paper.
Shang et al. [8] proposed a deep-learning-based weed recognition system and a Res-UNet image segmentation network—an improved version of the UNet. By using the ResNet-50 network instead of the UNet backbone network, the problems of difficult crop and weed extraction, poor detection of small plants, segmentation edge oscillation, and distortion of complex backgrounds were solved. The average cross-rate, accuracy, and training time of the images were used as evaluation indices for the experiments. The results showed that the average intersection rate was 82.25% and the average pixel accuracy was 98.67% using the Res-UNet model. Compared with UNet, the Res-UNet model had a higher average intersection rate of 4.74%, a higher average intersection rate of 10.68% than Seg-Net, and a reduced training time of 3 hours. This method is effective in detecting sugar beet weeds in complex backgrounds and can provide a reference for robotic precision weeding. The results show that UNet has a positive effect on weed detection in complex backgrounds and can be used as a reference for robotic weeding.
Wang and Chen [9] proposed real-time segmentation of field weeds based on PCAW-UNet and a real-time segmentation method based on the UNet network. We take UNet as the backbone network, extract multiscale information fusion, add a dual attention module at the end of the model, consider the dependency between image pixel positions and the information connection between different channels, and classify the category information in RGB in the pixel-level image. Dynamic weight coefficients were introduced to solve the problem of low classification accuracy due to the unbalanced proportion of sample categories.
Yu et al. [10] proposed an image-based deep learning method for lawn weed detection. DetectNet is the most successful deep convolutional neural network (DCNN) architecture for detecting early annual grass (Poa annua L.) and various broadleaf weeds of early grass growing on dormant Bermuda grass. DetectNet was used to detect weeds in dormant dogwood, with an F1-score of
Zhao et al. [11] proposed the extraction of the centerlines of ridges from remotely sensed unmanned aerial vehicle (UAV) images using an FCN. Based on high-precision visible remote sensing images acquired by UAVs, they designed a dataset annotation method using the sliding window method to extract the centerline of a farmland monopoly. The images were decomposed into blocks, and a ridge region with a width of 7–17 pixels near the ridge centerline was extracted using a deep learning semantic segmentation network. The results show that the processing of UAV remote sensing images based on the FCN can yield a grid map of global ridge centers, which is convenient for global path planning by agricultural robots.
In this study, the classical image segmentation model UNet was improved and named ATT-NestedUNet (NestedUNet with an attention mechanism). Its network structure is a nested UNet model with an attention mechanism module added to the upsampling, as shown in Figure 1, and is composed of four parts: downsampling, upsampling, attention mechanism, and skip connection. Using a long–short skip connection, the features of the previous layer can be effectively fused.
The convolutional block attention module (CBAM) [12] represents the attention mechanism module of the convolutional module, which is an attention mechanism module for spatial and component combination. Compared to SENet, which focuses only on the channel model, CBAM can achieve better results.
Figure 3 shows the structure of the CBAM, where the dimensions of the input feature are C×H×W, the dimensions of the channel attention (CA) model are C×1×1, and the dimensions of the spatial attention model are 1×H×W.
CA is an effective for obtaining a larger range of information without introducing a large overhead. The CA is effective for obtaining a large range of information without introducing a large amount of overhead.
UNet++ [13] makes three additions to the original UNet:
1. The network combines a DenseNet-like structure with dense skip connections to improve gradient mobility.
2. Filling the hollow structure of UNet to connect the semantic gap between encoder and decoder feature maps.
3. Pruning is possible with deep supervision.
The advantages of UNet++ are as follows: by embedding U-Net models of different depths in UNet++, improved segmentation performance can be achieved for objects of different sizes, which is an improvement over the fixed-depth U-Net. Therefore, all the UNet parts share a single encoder, and their decoders are interwoven into one piece. UNet++ can train all UNets simultaneously. By pruning the trained UNet++, the inference speed was accelerated while maintaining its performance.
ATT-NestedUNet is an encoder that is densely connected to the decoder via a jump connection to fuse the shallow and deep features. Finally, a 1×1 convolutional layer and a sigmoid activation function follow nodes x0,1, x0,2, x0,3, x0,4. A 1×1 convolutional layer and activation function were used to output a segmentation map of the sugar beet and weed images with the same size as the output image of the original input, as shown in Figure 4.
The ATT-NestedUNet can be pruned by pruning the ATT-Nested UNet model into four different depth structures. The ATT-NestedUNet (L1), ATT-NestedUNet (L2), ATT-NestedU-Net (L3), and ATT-NestedUNet (L4) are shown in Figure 5.
The advantage of UNet++ over UNet is that it is not easy to explicitly choose the network depth because UNet is embedded at different depths in the structure. All of these U-Nets partially share a single encoder, whereas the decoders of these UNets are intertwined.
By training ATT-NestedUNet under deep supervision, all constituent UNets can be trained simultaneously while benefiting from a shared image representation. This improves the overall segmentation performance and allows pruning of the model during inference. As shown in Table 1, deep supervision entails counting the outputs of models of different depths such that the outputs of different depths can be observed; thus, the depth of the network can be better designed. At this point, the metric is unclear. We have included this information in the revised manuscript.
The image dataset used in this study was obtained from sugar beet and weed images collected at the University of Bonn, Germany [14]. Images were acquired using a multifunctional robot manufactured by Bosch Deepfield Robotics. The acquisition device was a JAIAD-130GE camera, which provided images with a maximum resolution of 480×360 pixels. The acquisition was conducted on May 23, 2016. Figure 6 shows the farm information collection robot BoniRob, and Figure 7 shows the collection sample and label map in the dataset. Here, the green labels represent sugar beets, and the red labels represent weeds.
Eighty percent of the data in the beet dataset were used for model training, and 20% were used to evaluate the accuracy of the model. There were 820 images in the training set and 206 images in the validation set, all in the PNG format with an image resolution of 480×360 pixels. As shown in Figure 8, there were four categories in the dataset: sugar beets only, weeds only, a mixture of sugar beets and weeds, and neither sugar beets nor weeds. We first preprocessed each image to obtain a normalized average intensity for each channel and separated the vegetation (mainly soil) from the rest of the image.
Semantic segmentation based on deep learning requires a sufficiently large sample dataset, and the number of sample images in the training set cannot meet the experimental requirements. Therefore, data augmentation is used to enhance the number of datasets and reduce the overfitting phenomenon between image samples and the training network. In our experiments, we selected one of the four data enhancement transforms by randomly rotating 90°, randomly flipping, transposing, and selecting data enhancement transforms in hue, brightness, and cropping according to the normalized probability to effectively expand the number of positives, which are called datasets, and effectively eliminate the distortion and overfitting effects caused by an insufficient number of datasets. Finally, the dataset was resized to 480×480 pixels.
The running environment for the entire experiment was a Windows 10 (64-bit) operating system, Anaconda 4.10.3, Python 3.9, CUDA 11.3, cuDNN 8.2.1, AMD Ryzen 9 5900X 12-Core Processor 3.70 GHz processor, using the deep learning framework PyTorch as the development environment, and 32 GB of computer memory.
To improve the performance of the model and enhance its generalization ability, the parameters of the semantic segmentation model pretrained for this experiment were set. Using the Saccharomyces Genome Database (SGD) optimizer, one of the drawbacks of the SGD method is that its update direction is completely dependent on the gradient calculated by the current batch and thus is unstable. The Momentum algorithm borrows the physical concept of momentum, which simulates the inertia of motion of an object. The momentum was set to 0.9, the learning rate was set to 10−2, the minimum learning rate was set to 10−4, the weight decay was set to 10−4, the loss function used was BCEDiceLoss, and the epochs were set to 200.
The loss function [15, 16] used in this study is a combination of two loss functions, binary cross-entropy (BCE) and Dice coefficient, and the experimental results show that the combination of these two loss functions can effectively accelerate the convergence speed of the model. The loss function equation is as follows:
where,
MIOU is a standard measure of semantic segmentation [17, 18]. IOU refers to the ratio of the intersection and union of the ground truth and predicted segments. The MIOU is the average of the IOUs of all images. MIOU was calculated using the following formula:
where TP indicates that the true value is positive and the predicted value is positive, which is called the true positive. False negative (FN) means that the true value is positive and the prediction is negative. FP means that the true value is negative and the prediction is false positive. True negative (TN) means that the true value is negative and the prediction is negative.
To select a model that is more suitable for the research object of this study, this experiment compared the UNet, UNet ++, and ATT-NestedUNet models. This study mainly used the MIOU test as an evaluation index for the model in the system. As listed in Table 2, ATT-NestedUNet achieved MIOU gains of 0.6% and 0.38% on UNet and UNet ++, respectively, throughout the day. A comparison between ATT-NestedUNet and the other two algorithms shows that ATT-NestedUNet has a better training effect. Comparing the three models, the MIOU value of the background was not significantly different. The MIOU values of beets and weeds were relatively high, and the segmentation effect was better.
As shown in Figure 8, the results of ATT-NestedUNet were compared with those of the two de-othering algorithms. To reduce the effect of background complexity, the background of the dataset was black, the beets were green, and the weeds were red. To better display these segmentation results, we converted them to RGB format. The output was divided into three folders labeled background, beet, and weed. The experimental results show that, compared with other algorithms, it can be seen that ATT-NestedUNet has fewer false pixels and can segment beet and weed images more accurately.
The segmentation results in graph form are provided for beets, weeds only, backgrounds only, and both beets & weeds in Figures 8
To segment beet and weed images more accurately, an improved algorithm called ATT-NestedUNet (NestedUNet with an attention mechanism) is proposed in this study. The model follows the design concept of UNet++ and adds an attention mechanism module to the upsampling module. To verify the segmentation performance of ATT-NestedUNet, the model was trained on a dataset containing 820 sugar beet and weed images and tested on a test set containing 206 sugar beet and weed images. The experimental results showed that ATT-NestedUNet achieved better results for both subjective visual perception and objective evaluation metrics than the comparison algorithm.
Considering that the proposed network is not easy to implement on low-power mobile devices, future work will include optimizing the network parameters and reducing the computational costs. Currently, our goal is to improve the computation time without degrading the segmentation accuracy.
The structure diagram of UNet.
The structure diagram of UNet++.
Structure diagram of CBAM.
Structure diagram of ATT-NestedUNet.
Four different depths ATT-NestedUNet structures.
Farmland information collection robot BoniRob.
Sample collection and labelingmap, e.g. sugar beet (green) weed (red).
Example instances from the dataset (a) weeds only, (b) beets only, (c) beets and weeds, (d) soil free of beets and weeds.
Experiment results with beets only.
Experiment results with weeds only.
Background-only experimental results.
Experiment results with beets, weeds, background.
Table 1 . MIOU values of L1, L2, L3, L4.
Model | MIOU |
---|---|
ATT-NestedUNet L1 | 90.03% |
ATT-NestedUNet L2 | 91.14% |
ATT-NestedUNet L3 | 91.43% |
ATT-NestedUNet L4 | 91.42% |
Table 2 . mIOU values for the three trained models.
Model | mIOU | Back ground | Weed | Sugar beet |
---|---|---|---|---|
UNet | 90.82% | 95.29% | 95.72% | 98.95% |
Nested UNet | 91.04% | 95.29% | 95.73% | 98.96% |
ATT-NestedUNet | 91.42% | 95.29% | 95.75% | 98.98% |
Gayoung Kim
International Journal of Fuzzy Logic and Intelligent Systems 2024; 24(3): 287-294 https://doi.org/10.5391/IJFIS.2024.24.3.287Jeongmin Kim and Hyukdoo Choi
International Journal of Fuzzy Logic and Intelligent Systems 2024; 24(2): 105-113 https://doi.org/10.5391/IJFIS.2024.24.2.105Herlawati Herlawati, Edi Abdurachman, Yaya Heryadi, and Haryono Soeparno
International Journal of Fuzzy Logic and Intelligent Systems 2023; 23(4): 389-398 https://doi.org/10.5391/IJFIS.2023.23.4.389The structure diagram of UNet.
|@|~(^,^)~|@|The structure diagram of UNet++.
|@|~(^,^)~|@|Structure diagram of CBAM.
|@|~(^,^)~|@|Structure diagram of ATT-NestedUNet.
|@|~(^,^)~|@|Four different depths ATT-NestedUNet structures.
|@|~(^,^)~|@|Farmland information collection robot BoniRob.
|@|~(^,^)~|@|Sample collection and labelingmap, e.g. sugar beet (green) weed (red).
|@|~(^,^)~|@|Example instances from the dataset (a) weeds only, (b) beets only, (c) beets and weeds, (d) soil free of beets and weeds.
|@|~(^,^)~|@|Experiment results with beets only.
|@|~(^,^)~|@|Experiment results with weeds only.
|@|~(^,^)~|@|Background-only experimental results.
|@|~(^,^)~|@|Experiment results with beets, weeds, background.