Article Search
닫기

Original Article

Split Viewer

International Journal of Fuzzy Logic and Intelligent Systems 2024; 24(1): 1-9

Published online March 25, 2024

https://doi.org/10.5391/IJFIS.2024.24.1.1

© The Korean Institute of Intelligent Systems

ATT-NestedUnet: Sugar Beet and Weed Detection Using Semantic Segmentation

Xinzhi Hu1, Wang-Su Jeon2, Grezgorz Cielniak3, and Sang-Yong Rhee2

1Department of IT Convergence Engineering, Kyungnam University, Changwon, Korea
2Department of Computer Engineering, Kyungnam University, Changwon, Korea
3School Computer Science, University of Lincoln, Lincoln, UK

Correspondence to :
Sang-Yong Rhee (syrhee@kyungnam.ac.kr)

Received: April 17, 2023; Accepted: March 20, 2024

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Sugar beet is a biennial herb with cold, drought, and salinity resistance and is one of the world’s major sugar crops. In addition to sugar, sugar beets are important raw materials for chemical and pharmaceutical products, and the residue after sugar extraction can be used to produce agricultural by-products, such as compound feed, which has a high comprehensive utilization value [1]. Field weeds, such as sugar beets, are harmful to crop growth and can compete with crops for sunlight and nutrients. If weeds are not removed in time during crop growth, they cause a decrease in crop yield and quality. Therefore, there is considerable interest in the development of automated machinery for selective weeding operations. The core component of this technology is a vision system that distinguishes between crops and weeds. To address the problems of difficult weed extraction, poor detection, and segmentation of region boundaries in traditional sugar beet detection, an end-to-end encoder–decoder model based on an improved UNet++ for segmentation is proposed in this paper and applied to sugar beet and weed detection. UNet++ can better fuse feature maps from different layers by skipping connections and can effectively preserve the details of sugar beet and weed images. The new model adds an attention mechanism to UNet++ by embedding the attention module into the upsampling process of UNet++ to suppress interference from extraneous noise. The improved model was evaluated on a sugar beet and weed dataset containing 1026 images. The image dataset in this study was obtained from sugar beet and weed images collected at the University of Bonn, Germany. According to the experimental results, the model can significantly eliminate noise and improve segmentation accuracy.

Keywords: ATT-NestedUNet, Deep learning, Weed detection, Semantic segmentation, Sugar beet

Currently, computer vision is widely used in various fields because of its efficiency and simplicity in reducing the waste of agricultural resources on weed control and increasing crop yield. Excessive use of herbicides for weed control not only pollutes the environment and increases agricultural costs, but also affects food safety [2]. The precise detection, recognition, and analysis techniques of computer vision have made its application popular in agricultural robotics. See our previous comment on adding a sentence about selective weeding machines that require such vision systems.

Ronneberger et al. [3] proposed UNet as an improved network based on fully convolutional network (FCN). The structure of UNet is symmetrical because of its similarity to U, as shown in Figure 1. The UNet model was first proposed for the segmentation of medical images by Zhou et al. [4]. To satisfy the demand for more accurate image segmentation, they proposed the UNet++ model, as shown in Figure 2. By constructing nested UNet networks, they gradually enriched high-resolution feature maps from the encoder network before fusing the corresponding feature maps from the decoder network. Consequently, the model can capture the fine-grained details of the foreground objects more efficiently.

The nested UNet creates feature maps of the decoder and encoder networks that must be fused semantically. After continuous development and improvement, UNet++ was proposed as a powerful image-segmentation architecture. The advantages of UNet++ are the improvement in accuracy and the flexible network structure with depth supervision, which allows a deep network with a large number of parameters to drastically reduce the number of parameters within an acceptable accuracy range.

We add a set of contributions proposed in this study in the form of a bullet-point list. Attention mechanisms have now been widely applied to image processing tasks [5], and Fu et al. [6] proposed a new framework called the Daul Attention Network (DANet) in 2019. This framework introduces a self-attentive mechanism that captures feature dependencies in the spatial and channel dimensions. Oktay et al. [7] used an attention mechanism to suppress feature responses in irrelevant background regions, which effectively reduced false alarm predictions in image segmentation. The experimental results show that the combination of nested UNet and the attention mechanism can effectively eliminate noise in images and improve image segmentation performance.

The remainder of this paper is organized as follows: in Section 2, we present related studies, including the achievements of UNet in semantic segmentation and UNet++ in weed identification, as proposed by scholars in recent years. In Section 3, we present the proposed approach, including our modifications to the overall framework. Section 4 presents the experimental data, operating environment, and mean Intersection over Union (MIOU) test results. The final section concludes this paper.

Shang et al. [8] proposed a deep-learning-based weed recognition system and a Res-UNet image segmentation network—an improved version of the UNet. By using the ResNet-50 network instead of the UNet backbone network, the problems of difficult crop and weed extraction, poor detection of small plants, segmentation edge oscillation, and distortion of complex backgrounds were solved. The average cross-rate, accuracy, and training time of the images were used as evaluation indices for the experiments. The results showed that the average intersection rate was 82.25% and the average pixel accuracy was 98.67% using the Res-UNet model. Compared with UNet, the Res-UNet model had a higher average intersection rate of 4.74%, a higher average intersection rate of 10.68% than Seg-Net, and a reduced training time of 3 hours. This method is effective in detecting sugar beet weeds in complex backgrounds and can provide a reference for robotic precision weeding. The results show that UNet has a positive effect on weed detection in complex backgrounds and can be used as a reference for robotic weeding.

Wang and Chen [9] proposed real-time segmentation of field weeds based on PCAW-UNet and a real-time segmentation method based on the UNet network. We take UNet as the backbone network, extract multiscale information fusion, add a dual attention module at the end of the model, consider the dependency between image pixel positions and the information connection between different channels, and classify the category information in RGB in the pixel-level image. Dynamic weight coefficients were introduced to solve the problem of low classification accuracy due to the unbalanced proportion of sample categories.

Yu et al. [10] proposed an image-based deep learning method for lawn weed detection. DetectNet is the most successful deep convolutional neural network (DCNN) architecture for detecting early annual grass (Poa annua L.) and various broadleaf weeds of early grass growing on dormant Bermuda grass. DetectNet was used to detect weeds in dormant dogwood, with an F1-score of > 0.99.

Zhao et al. [11] proposed the extraction of the centerlines of ridges from remotely sensed unmanned aerial vehicle (UAV) images using an FCN. Based on high-precision visible remote sensing images acquired by UAVs, they designed a dataset annotation method using the sliding window method to extract the centerline of a farmland monopoly. The images were decomposed into blocks, and a ridge region with a width of 7–17 pixels near the ridge centerline was extracted using a deep learning semantic segmentation network. The results show that the processing of UAV remote sensing images based on the FCN can yield a grid map of global ridge centers, which is convenient for global path planning by agricultural robots.

In this study, the classical image segmentation model UNet was improved and named ATT-NestedUNet (NestedUNet with an attention mechanism). Its network structure is a nested UNet model with an attention mechanism module added to the upsampling, as shown in Figure 1, and is composed of four parts: downsampling, upsampling, attention mechanism, and skip connection. Using a long–short skip connection, the features of the previous layer can be effectively fused.

3.1 Methodology

The convolutional block attention module (CBAM) [12] represents the attention mechanism module of the convolutional module, which is an attention mechanism module for spatial and component combination. Compared to SENet, which focuses only on the channel model, CBAM can achieve better results.

Figure 3 shows the structure of the CBAM, where the dimensions of the input feature are C×H×W, the dimensions of the channel attention (CA) model are C×1×1, and the dimensions of the spatial attention model are 1×H×W.

CA is an effective for obtaining a larger range of information without introducing a large overhead. The CA is effective for obtaining a large range of information without introducing a large amount of overhead.

UNet++ [13] makes three additions to the original UNet:

  • 1. The network combines a DenseNet-like structure with dense skip connections to improve gradient mobility.

  • 2. Filling the hollow structure of UNet to connect the semantic gap between encoder and decoder feature maps.

  • 3. Pruning is possible with deep supervision.

The advantages of UNet++ are as follows: by embedding U-Net models of different depths in UNet++, improved segmentation performance can be achieved for objects of different sizes, which is an improvement over the fixed-depth U-Net. Therefore, all the UNet parts share a single encoder, and their decoders are interwoven into one piece. UNet++ can train all UNets simultaneously. By pruning the trained UNet++, the inference speed was accelerated while maintaining its performance.

ATT-NestedUNet is an encoder that is densely connected to the decoder via a jump connection to fuse the shallow and deep features. Finally, a 1×1 convolutional layer and a sigmoid activation function follow nodes x0,1, x0,2, x0,3, x0,4. A 1×1 convolutional layer and activation function were used to output a segmentation map of the sugar beet and weed images with the same size as the output image of the original input, as shown in Figure 4.

3.2 Four Different Depths of the ATT-NestedUNet Structures

The ATT-NestedUNet can be pruned by pruning the ATT-Nested UNet model into four different depth structures. The ATT-NestedUNet (L1), ATT-NestedUNet (L2), ATT-NestedU-Net (L3), and ATT-NestedUNet (L4) are shown in Figure 5.

The advantage of UNet++ over UNet is that it is not easy to explicitly choose the network depth because UNet is embedded at different depths in the structure. All of these U-Nets partially share a single encoder, whereas the decoders of these UNets are intertwined.

By training ATT-NestedUNet under deep supervision, all constituent UNets can be trained simultaneously while benefiting from a shared image representation. This improves the overall segmentation performance and allows pruning of the model during inference. As shown in Table 1, deep supervision entails counting the outputs of models of different depths such that the outputs of different depths can be observed; thus, the depth of the network can be better designed. At this point, the metric is unclear. We have included this information in the revised manuscript.

4.1 Dataset

The image dataset used in this study was obtained from sugar beet and weed images collected at the University of Bonn, Germany [14]. Images were acquired using a multifunctional robot manufactured by Bosch Deepfield Robotics. The acquisition device was a JAIAD-130GE camera, which provided images with a maximum resolution of 480×360 pixels. The acquisition was conducted on May 23, 2016. Figure 6 shows the farm information collection robot BoniRob, and Figure 7 shows the collection sample and label map in the dataset. Here, the green labels represent sugar beets, and the red labels represent weeds.

Eighty percent of the data in the beet dataset were used for model training, and 20% were used to evaluate the accuracy of the model. There were 820 images in the training set and 206 images in the validation set, all in the PNG format with an image resolution of 480×360 pixels. As shown in Figure 8, there were four categories in the dataset: sugar beets only, weeds only, a mixture of sugar beets and weeds, and neither sugar beets nor weeds. We first preprocessed each image to obtain a normalized average intensity for each channel and separated the vegetation (mainly soil) from the rest of the image.

4.2 Data Preprocessing

Semantic segmentation based on deep learning requires a sufficiently large sample dataset, and the number of sample images in the training set cannot meet the experimental requirements. Therefore, data augmentation is used to enhance the number of datasets and reduce the overfitting phenomenon between image samples and the training network. In our experiments, we selected one of the four data enhancement transforms by randomly rotating 90°, randomly flipping, transposing, and selecting data enhancement transforms in hue, brightness, and cropping according to the normalized probability to effectively expand the number of positives, which are called datasets, and effectively eliminate the distortion and overfitting effects caused by an insufficient number of datasets. Finally, the dataset was resized to 480×480 pixels.

4.3 Experimental Environment Setup

The running environment for the entire experiment was a Windows 10 (64-bit) operating system, Anaconda 4.10.3, Python 3.9, CUDA 11.3, cuDNN 8.2.1, AMD Ryzen 9 5900X 12-Core Processor 3.70 GHz processor, using the deep learning framework PyTorch as the development environment, and 32 GB of computer memory.

To improve the performance of the model and enhance its generalization ability, the parameters of the semantic segmentation model pretrained for this experiment were set. Using the Saccharomyces Genome Database (SGD) optimizer, one of the drawbacks of the SGD method is that its update direction is completely dependent on the gradient calculated by the current batch and thus is unstable. The Momentum algorithm borrows the physical concept of momentum, which simulates the inertia of motion of an object. The momentum was set to 0.9, the learning rate was set to 10−2, the minimum learning rate was set to 10−4, the weight decay was set to 10−4, the loss function used was BCEDiceLoss, and the epochs were set to 200.

4.4 Loss Function

The loss function [15, 16] used in this study is a combination of two loss functions, binary cross-entropy (BCE) and Dice coefficient, and the experimental results show that the combination of these two loss functions can effectively accelerate the convergence speed of the model. The loss function equation is as follows:

BCE=-1/2(Y×log(σ(X))+(1-Y)×log(1-σ(X))),Dice=1-2XYX+Y+S,loss=ω×BCE+(1-ω)×Dice,

where, σ(·) is the sigmoid function, X is the predicted output of model, Y is the label, S = 1, and ω is the balance coefficient of two loss functions, here is 0.5.

4.5 Evaluation Metrics

MIOU is a standard measure of semantic segmentation [17, 18]. IOU refers to the ratio of the intersection and union of the ground truth and predicted segments. The MIOU is the average of the IOUs of all images. MIOU was calculated using the following formula:

MIOU=1nni=1[TPFP+FN+TP]i,

where TP indicates that the true value is positive and the predicted value is positive, which is called the true positive. False negative (FN) means that the true value is positive and the prediction is negative. FP means that the true value is negative and the prediction is false positive. True negative (TN) means that the true value is negative and the prediction is negative.

4.6 Experimental Results

To select a model that is more suitable for the research object of this study, this experiment compared the UNet, UNet ++, and ATT-NestedUNet models. This study mainly used the MIOU test as an evaluation index for the model in the system. As listed in Table 2, ATT-NestedUNet achieved MIOU gains of 0.6% and 0.38% on UNet and UNet ++, respectively, throughout the day. A comparison between ATT-NestedUNet and the other two algorithms shows that ATT-NestedUNet has a better training effect. Comparing the three models, the MIOU value of the background was not significantly different. The MIOU values of beets and weeds were relatively high, and the segmentation effect was better.

As shown in Figure 8, the results of ATT-NestedUNet were compared with those of the two de-othering algorithms. To reduce the effect of background complexity, the background of the dataset was black, the beets were green, and the weeds were red. To better display these segmentation results, we converted them to RGB format. The output was divided into three folders labeled background, beet, and weed. The experimental results show that, compared with other algorithms, it can be seen that ATT-NestedUNet has fewer false pixels and can segment beet and weed images more accurately.

The segmentation results in graph form are provided for beets, weeds only, backgrounds only, and both beets & weeds in Figures 811, respectively. To better understand the experimental segmentation results, they were divided into different folders by category. The first line in the figures shows the segmentation result graph for the beets, and the second line shows the segmentation result graph for the weeds. The image shows that UNet will have misdetected pixels, whereas the segmentation effect of ATT-NestedUNet is better, and there are fewer misdetected pixels.

To segment beet and weed images more accurately, an improved algorithm called ATT-NestedUNet (NestedUNet with an attention mechanism) is proposed in this study. The model follows the design concept of UNet++ and adds an attention mechanism module to the upsampling module. To verify the segmentation performance of ATT-NestedUNet, the model was trained on a dataset containing 820 sugar beet and weed images and tested on a test set containing 206 sugar beet and weed images. The experimental results showed that ATT-NestedUNet achieved better results for both subjective visual perception and objective evaluation metrics than the comparison algorithm.

Considering that the proposed network is not easy to implement on low-power mobile devices, future work will include optimizing the network parameters and reducing the computational costs. Currently, our goal is to improve the computation time without degrading the segmentation accuracy.

This result was supported by “Regional Innovation Strategy (RIS)” through the National Research Foundation of Korea (NRF), funded by the Ministry of Education (MOE) (No. 2021RIS-003).
Fig. 1.

The structure diagram of UNet.


Fig. 2.

The structure diagram of UNet++.


Fig. 3.

Structure diagram of CBAM.


Fig. 4.

Structure diagram of ATT-NestedUNet.


Fig. 5.

Four different depths ATT-NestedUNet structures.


Fig. 6.

Farmland information collection robot BoniRob.


Fig. 7.

Sample collection and labelingmap, e.g. sugar beet (green) weed (red).


Fig. 8.

Example instances from the dataset (a) weeds only, (b) beets only, (c) beets and weeds, (d) soil free of beets and weeds.


Fig. 9.

Experiment results with beets only.


Fig. 10.

Experiment results with weeds only.


Fig. 11.

Background-only experimental results.


Fig. 12.

Experiment results with beets, weeds, background.


Table. 1.

Table 1. MIOU values of L1, L2, L3, L4.

ModelMIOU
ATT-NestedUNet L190.03%
ATT-NestedUNet L291.14%
ATT-NestedUNet L391.43%
ATT-NestedUNet L491.42%

Table. 2.

Table 2. mIOU values for the three trained models.

ModelmIOUBack groundWeedSugar beet
UNet90.82%95.29%95.72%98.95%
Nested UNet91.04%95.29%95.73%98.96%
ATT-NestedUNet91.42%95.29%95.75%98.98%

  1. Ywen, Chen, Yungcai, Li, and Lingyi, Yu (2017). Analysis of sugar beet industry development in three major regions of China. China Sugar.
  2. Tyagi, AC (2016). Towards a second green revolution. Irrigation and Drainage. 65, 388-389. https://doi.org/10.1002/ird.2076
    CrossRef
  3. Ronneberger, O, Fischer, P, and Brox, T. (2015) . U-Net: convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention– MICCAI 2015. Cham, Switzerland: Springer, 2015, pp. 234-241. https://doi.org/10.1007/978-3-319-24574-4_28
    CrossRef
  4. Zhou, Z, Rahman Siddiquee, MM, Tajbakhsh, N, and Liang, J (2018). UNet++: a nested U-Net architecture for medical image segmentation. Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. Cham, Switzerland: Springer, pp. 3-11 https://doi.org/10.1007/978-3-030-00889-5_1
    Pubmed KoreaMed CrossRef
  5. Vaswani, A, Shazeer, N, Parmar, N, Uszkoreit, J, Jones, L, Gomez, AN, Kaiser, L, and Polosukhin, I (2017). Attention is all you need. Advances in Neural Information Processing Systems. 30, 5998-6008.
  6. Fu, J, Liu, J, Tian, H, Li, Y, Bao, Y, Fang, Z, and Lu, H . Dual attention network for scene segmentation., Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2009, pp.3146-3154. https://doi.org/10.1109/CVPR.2019.00326
    CrossRef
  7. Oktay, O, Schlemper, J, Folgoc, LL, Lee, M, Heinrich, M, and Misawa, K. (2018) . Attention U-Net: learning where to look for the pancreas. Available: https://arxiv.org/abs/1804.03999
  8. Shang, J, Jiang, H, Yu, G, Chen, Z, Wang, B, Li, Z, and Zhang, W (2020). Weed identification system based on deep learning. Software Guide. 19, 127-130.
  9. Wang, H, and Chen, G (2021). Real-time segmentation of field weeds based on PCAW-UNet. Journal of Xi’an University of Arts and Sciences (Natural Science Edition). 24, 27-37. https://doi.org/10.3969/j.issn.1008-5564.2021.02.006
    CrossRef
  10. Yu, J, Sharpe, SM, Schumann, AW, and Boyd, NS (2019). Deep learning for image-based weed detection in turfgrass. European Journal of Agronomy. 104, 78-84. https://doi.org/10.1016/j.eja.2019.01.004
    CrossRef
  11. Zhao, J, Cao, D, Lan, Y, Pan, F, Wen, Y, Yang, D, and Lu, L (2021). Extraction of maize field ridge centerline based on FCN with UAV remote sensing images. Transactions of the Chinese Society of Agricultural Engineering. 37, 72-80. https://doi.org/10.11975/j.issn.1002-6819.2021.09.009
    CrossRef
  12. Woo, S, Park, J, Lee, JY, and Kweon, IS (2018). CBAM: convolutional block attention module. Computer Vision - ECCV 2018. Cham, Switzerland: Springer, 201/8, pp. 3-19 https://doi.org/10.1007/978-3-030-01234-2_1
    CrossRef
  13. Fan, X, Cao, P, Shi, P, Wang, J, Xin, Y, and Huang, W. A nested UNet with attention mechanism for road crack image segmentation, Proceedings of 2021 IEEE 6th International Conference on Signal and Image Processing (ICSIP), Nanjing, China, 2021, pp. 189-193. https://doi.org/10.1109/ICSIP52628.2021.9688782
    CrossRef
  14. Chebrolu, N, Lottes, P, Schaefer, A, Winterhalter, W, Burgard, W, and Stachniss, C (2017). Agricultural robot dataset for plant classification, localization and mapping on sugar beet fields. The International Journal of Robotics Research. 36, 1045-1052. https://doi.org/10.1177/0278364917720510
    CrossRef
  15. Christoffersen, P, and Jacobs, K (2004). The importance of the loss function in option valuation. Journal of Financial Economics. 72, 291-318. https://doi.org/10.1016/j.jfineco.2003.02.001
    CrossRef
  16. Barron, JT. A general and adaptive robust loss function, 4331-4339. https://doi.org/10.1109/CVPR.2019.00446
    CrossRef
  17. Dalianis, H (2018). Evaluation metrics and evaluation. Clinical Text Mining. Cham, Switzerland: Springer, pp. 45-53 https://doi.org/10.1007/978-3-319-78503-5_6
    CrossRef
  18. Wang, Z, Wang, E, and Zhu, Y (2020). Image segmentation evaluation: a survey of methods. Artificial Intelligence Review. 53, 5637-5674. https://doi.org/10.1007/s10462-020-09830-9
    CrossRef
  19. He, H, Zhang, C, Chen, J, Geng, R, Chen, L, Liang, Y, Lu, Y, Wu, J, and Xu, Y (2021). A hybrid-attention nested UNet for nuclear segmentation in histopathological images. Frontiers in Molecular Biosciences. 8. article no 614174. https://doi.org/10.3389/fmolb.2021.614174
    CrossRef
  20. Huo, G, Lin, D, and Yuan, M (2022). Iris segmentation method based on improved UNet++. Multimedia Tools and Applications. 81, 41249-41269. https://doi.org/10.1007/s11042-022-13198-z
    CrossRef
  21. Miao, F, Zheng, S, and Tao, B. Crop weed identification system based on convolutional neural network, Proceedings of 2019 IEEE 2nd International Conference on Electronic Information and Communication Technology (ICEICT), Harbin, China, 2019, pp. 595-598. https://doi.org/10.1109/ICEICT.2019.8846268
    CrossRef

Xin-Zhi Hu received her B.S. in Computer Engineering from Kyungnam University, Masan, South Korea in 2021 and is currently pursuing the M.S. in IT Convergence Engineering at Kyungnam University, Masan, South Korea. Her present interests include computer vision pattern recognition.

E-mail: huxinzhi0326@gmail.com

Wang-Su Jeon received his B.S. and M.S. degrees in Computer Engineering and IT Convergence Engineering from Kyungnam University, Masan, South Korea, in 2016 and 2018, and is currently pursuing the Ph.D. in IT Convergence Engineering at Kyungnam University, Masan, South Korea. His present interests include computer vision, pattern recognition, and machine learning.

E-mail: jws2218@naver.com

Grzegorz Cielniak is a senior lecturer at the School of Computer Science, University of Lincoln. He received his M.Sc. in Robotics from the Wrocław University of Technology in 2000 and Ph.D. in Computer Science from the Örebro University in 2006. His research interests include mobile robotics, artificial intelligence, real-time computer vision systems and multisensor fusion with particular focus on robotic applications in agriculture.

E-mail: Grzegorz.Cielniak@gmail.com

Sang-Yong Rhee received his B.S. and M.S. degrees in Industrial Engineering from Korea University, Seoul, South Korea, in 1982 and 1984, respectively, and his Ph.D. degree in Industrial Engineering at Pohang University, Pohang, South Korea. He is currently a professor at the Computer Engineering, Kyungnam University, Masan, South Korea. His research interests include computer vision, augmented reality, neuro-fuzzy and human-robot interface.

E-mail: syrhee@kyungnam.ac.kr

Article

Original Article

International Journal of Fuzzy Logic and Intelligent Systems 2024; 24(1): 1-9

Published online March 25, 2024 https://doi.org/10.5391/IJFIS.2024.24.1.1

Copyright © The Korean Institute of Intelligent Systems.

ATT-NestedUnet: Sugar Beet and Weed Detection Using Semantic Segmentation

Xinzhi Hu1, Wang-Su Jeon2, Grezgorz Cielniak3, and Sang-Yong Rhee2

1Department of IT Convergence Engineering, Kyungnam University, Changwon, Korea
2Department of Computer Engineering, Kyungnam University, Changwon, Korea
3School Computer Science, University of Lincoln, Lincoln, UK

Correspondence to:Sang-Yong Rhee (syrhee@kyungnam.ac.kr)

Received: April 17, 2023; Accepted: March 20, 2024

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Sugar beet is a biennial herb with cold, drought, and salinity resistance and is one of the world’s major sugar crops. In addition to sugar, sugar beets are important raw materials for chemical and pharmaceutical products, and the residue after sugar extraction can be used to produce agricultural by-products, such as compound feed, which has a high comprehensive utilization value [1]. Field weeds, such as sugar beets, are harmful to crop growth and can compete with crops for sunlight and nutrients. If weeds are not removed in time during crop growth, they cause a decrease in crop yield and quality. Therefore, there is considerable interest in the development of automated machinery for selective weeding operations. The core component of this technology is a vision system that distinguishes between crops and weeds. To address the problems of difficult weed extraction, poor detection, and segmentation of region boundaries in traditional sugar beet detection, an end-to-end encoder–decoder model based on an improved UNet++ for segmentation is proposed in this paper and applied to sugar beet and weed detection. UNet++ can better fuse feature maps from different layers by skipping connections and can effectively preserve the details of sugar beet and weed images. The new model adds an attention mechanism to UNet++ by embedding the attention module into the upsampling process of UNet++ to suppress interference from extraneous noise. The improved model was evaluated on a sugar beet and weed dataset containing 1026 images. The image dataset in this study was obtained from sugar beet and weed images collected at the University of Bonn, Germany. According to the experimental results, the model can significantly eliminate noise and improve segmentation accuracy.

Keywords: ATT-NestedUNet, Deep learning, Weed detection, Semantic segmentation, Sugar beet

1. Introduction

Currently, computer vision is widely used in various fields because of its efficiency and simplicity in reducing the waste of agricultural resources on weed control and increasing crop yield. Excessive use of herbicides for weed control not only pollutes the environment and increases agricultural costs, but also affects food safety [2]. The precise detection, recognition, and analysis techniques of computer vision have made its application popular in agricultural robotics. See our previous comment on adding a sentence about selective weeding machines that require such vision systems.

Ronneberger et al. [3] proposed UNet as an improved network based on fully convolutional network (FCN). The structure of UNet is symmetrical because of its similarity to U, as shown in Figure 1. The UNet model was first proposed for the segmentation of medical images by Zhou et al. [4]. To satisfy the demand for more accurate image segmentation, they proposed the UNet++ model, as shown in Figure 2. By constructing nested UNet networks, they gradually enriched high-resolution feature maps from the encoder network before fusing the corresponding feature maps from the decoder network. Consequently, the model can capture the fine-grained details of the foreground objects more efficiently.

The nested UNet creates feature maps of the decoder and encoder networks that must be fused semantically. After continuous development and improvement, UNet++ was proposed as a powerful image-segmentation architecture. The advantages of UNet++ are the improvement in accuracy and the flexible network structure with depth supervision, which allows a deep network with a large number of parameters to drastically reduce the number of parameters within an acceptable accuracy range.

We add a set of contributions proposed in this study in the form of a bullet-point list. Attention mechanisms have now been widely applied to image processing tasks [5], and Fu et al. [6] proposed a new framework called the Daul Attention Network (DANet) in 2019. This framework introduces a self-attentive mechanism that captures feature dependencies in the spatial and channel dimensions. Oktay et al. [7] used an attention mechanism to suppress feature responses in irrelevant background regions, which effectively reduced false alarm predictions in image segmentation. The experimental results show that the combination of nested UNet and the attention mechanism can effectively eliminate noise in images and improve image segmentation performance.

The remainder of this paper is organized as follows: in Section 2, we present related studies, including the achievements of UNet in semantic segmentation and UNet++ in weed identification, as proposed by scholars in recent years. In Section 3, we present the proposed approach, including our modifications to the overall framework. Section 4 presents the experimental data, operating environment, and mean Intersection over Union (MIOU) test results. The final section concludes this paper.

2. Related Works

Shang et al. [8] proposed a deep-learning-based weed recognition system and a Res-UNet image segmentation network—an improved version of the UNet. By using the ResNet-50 network instead of the UNet backbone network, the problems of difficult crop and weed extraction, poor detection of small plants, segmentation edge oscillation, and distortion of complex backgrounds were solved. The average cross-rate, accuracy, and training time of the images were used as evaluation indices for the experiments. The results showed that the average intersection rate was 82.25% and the average pixel accuracy was 98.67% using the Res-UNet model. Compared with UNet, the Res-UNet model had a higher average intersection rate of 4.74%, a higher average intersection rate of 10.68% than Seg-Net, and a reduced training time of 3 hours. This method is effective in detecting sugar beet weeds in complex backgrounds and can provide a reference for robotic precision weeding. The results show that UNet has a positive effect on weed detection in complex backgrounds and can be used as a reference for robotic weeding.

Wang and Chen [9] proposed real-time segmentation of field weeds based on PCAW-UNet and a real-time segmentation method based on the UNet network. We take UNet as the backbone network, extract multiscale information fusion, add a dual attention module at the end of the model, consider the dependency between image pixel positions and the information connection between different channels, and classify the category information in RGB in the pixel-level image. Dynamic weight coefficients were introduced to solve the problem of low classification accuracy due to the unbalanced proportion of sample categories.

Yu et al. [10] proposed an image-based deep learning method for lawn weed detection. DetectNet is the most successful deep convolutional neural network (DCNN) architecture for detecting early annual grass (Poa annua L.) and various broadleaf weeds of early grass growing on dormant Bermuda grass. DetectNet was used to detect weeds in dormant dogwood, with an F1-score of > 0.99.

Zhao et al. [11] proposed the extraction of the centerlines of ridges from remotely sensed unmanned aerial vehicle (UAV) images using an FCN. Based on high-precision visible remote sensing images acquired by UAVs, they designed a dataset annotation method using the sliding window method to extract the centerline of a farmland monopoly. The images were decomposed into blocks, and a ridge region with a width of 7–17 pixels near the ridge centerline was extracted using a deep learning semantic segmentation network. The results show that the processing of UAV remote sensing images based on the FCN can yield a grid map of global ridge centers, which is convenient for global path planning by agricultural robots.

3. Methodology and Results

In this study, the classical image segmentation model UNet was improved and named ATT-NestedUNet (NestedUNet with an attention mechanism). Its network structure is a nested UNet model with an attention mechanism module added to the upsampling, as shown in Figure 1, and is composed of four parts: downsampling, upsampling, attention mechanism, and skip connection. Using a long–short skip connection, the features of the previous layer can be effectively fused.

3.1 Methodology

The convolutional block attention module (CBAM) [12] represents the attention mechanism module of the convolutional module, which is an attention mechanism module for spatial and component combination. Compared to SENet, which focuses only on the channel model, CBAM can achieve better results.

Figure 3 shows the structure of the CBAM, where the dimensions of the input feature are C×H×W, the dimensions of the channel attention (CA) model are C×1×1, and the dimensions of the spatial attention model are 1×H×W.

CA is an effective for obtaining a larger range of information without introducing a large overhead. The CA is effective for obtaining a large range of information without introducing a large amount of overhead.

UNet++ [13] makes three additions to the original UNet:

  • 1. The network combines a DenseNet-like structure with dense skip connections to improve gradient mobility.

  • 2. Filling the hollow structure of UNet to connect the semantic gap between encoder and decoder feature maps.

  • 3. Pruning is possible with deep supervision.

The advantages of UNet++ are as follows: by embedding U-Net models of different depths in UNet++, improved segmentation performance can be achieved for objects of different sizes, which is an improvement over the fixed-depth U-Net. Therefore, all the UNet parts share a single encoder, and their decoders are interwoven into one piece. UNet++ can train all UNets simultaneously. By pruning the trained UNet++, the inference speed was accelerated while maintaining its performance.

ATT-NestedUNet is an encoder that is densely connected to the decoder via a jump connection to fuse the shallow and deep features. Finally, a 1×1 convolutional layer and a sigmoid activation function follow nodes x0,1, x0,2, x0,3, x0,4. A 1×1 convolutional layer and activation function were used to output a segmentation map of the sugar beet and weed images with the same size as the output image of the original input, as shown in Figure 4.

3.2 Four Different Depths of the ATT-NestedUNet Structures

The ATT-NestedUNet can be pruned by pruning the ATT-Nested UNet model into four different depth structures. The ATT-NestedUNet (L1), ATT-NestedUNet (L2), ATT-NestedU-Net (L3), and ATT-NestedUNet (L4) are shown in Figure 5.

The advantage of UNet++ over UNet is that it is not easy to explicitly choose the network depth because UNet is embedded at different depths in the structure. All of these U-Nets partially share a single encoder, whereas the decoders of these UNets are intertwined.

By training ATT-NestedUNet under deep supervision, all constituent UNets can be trained simultaneously while benefiting from a shared image representation. This improves the overall segmentation performance and allows pruning of the model during inference. As shown in Table 1, deep supervision entails counting the outputs of models of different depths such that the outputs of different depths can be observed; thus, the depth of the network can be better designed. At this point, the metric is unclear. We have included this information in the revised manuscript.

4. Experimental Setup

4.1 Dataset

The image dataset used in this study was obtained from sugar beet and weed images collected at the University of Bonn, Germany [14]. Images were acquired using a multifunctional robot manufactured by Bosch Deepfield Robotics. The acquisition device was a JAIAD-130GE camera, which provided images with a maximum resolution of 480×360 pixels. The acquisition was conducted on May 23, 2016. Figure 6 shows the farm information collection robot BoniRob, and Figure 7 shows the collection sample and label map in the dataset. Here, the green labels represent sugar beets, and the red labels represent weeds.

Eighty percent of the data in the beet dataset were used for model training, and 20% were used to evaluate the accuracy of the model. There were 820 images in the training set and 206 images in the validation set, all in the PNG format with an image resolution of 480×360 pixels. As shown in Figure 8, there were four categories in the dataset: sugar beets only, weeds only, a mixture of sugar beets and weeds, and neither sugar beets nor weeds. We first preprocessed each image to obtain a normalized average intensity for each channel and separated the vegetation (mainly soil) from the rest of the image.

4.2 Data Preprocessing

Semantic segmentation based on deep learning requires a sufficiently large sample dataset, and the number of sample images in the training set cannot meet the experimental requirements. Therefore, data augmentation is used to enhance the number of datasets and reduce the overfitting phenomenon between image samples and the training network. In our experiments, we selected one of the four data enhancement transforms by randomly rotating 90°, randomly flipping, transposing, and selecting data enhancement transforms in hue, brightness, and cropping according to the normalized probability to effectively expand the number of positives, which are called datasets, and effectively eliminate the distortion and overfitting effects caused by an insufficient number of datasets. Finally, the dataset was resized to 480×480 pixels.

4.3 Experimental Environment Setup

The running environment for the entire experiment was a Windows 10 (64-bit) operating system, Anaconda 4.10.3, Python 3.9, CUDA 11.3, cuDNN 8.2.1, AMD Ryzen 9 5900X 12-Core Processor 3.70 GHz processor, using the deep learning framework PyTorch as the development environment, and 32 GB of computer memory.

To improve the performance of the model and enhance its generalization ability, the parameters of the semantic segmentation model pretrained for this experiment were set. Using the Saccharomyces Genome Database (SGD) optimizer, one of the drawbacks of the SGD method is that its update direction is completely dependent on the gradient calculated by the current batch and thus is unstable. The Momentum algorithm borrows the physical concept of momentum, which simulates the inertia of motion of an object. The momentum was set to 0.9, the learning rate was set to 10−2, the minimum learning rate was set to 10−4, the weight decay was set to 10−4, the loss function used was BCEDiceLoss, and the epochs were set to 200.

4.4 Loss Function

The loss function [15, 16] used in this study is a combination of two loss functions, binary cross-entropy (BCE) and Dice coefficient, and the experimental results show that the combination of these two loss functions can effectively accelerate the convergence speed of the model. The loss function equation is as follows:

BCE=-1/2(Y×log(σ(X))+(1-Y)×log(1-σ(X))),Dice=1-2XYX+Y+S,loss=ω×BCE+(1-ω)×Dice,

where, σ(·) is the sigmoid function, X is the predicted output of model, Y is the label, S = 1, and ω is the balance coefficient of two loss functions, here is 0.5.

4.5 Evaluation Metrics

MIOU is a standard measure of semantic segmentation [17, 18]. IOU refers to the ratio of the intersection and union of the ground truth and predicted segments. The MIOU is the average of the IOUs of all images. MIOU was calculated using the following formula:

MIOU=1nni=1[TPFP+FN+TP]i,

where TP indicates that the true value is positive and the predicted value is positive, which is called the true positive. False negative (FN) means that the true value is positive and the prediction is negative. FP means that the true value is negative and the prediction is false positive. True negative (TN) means that the true value is negative and the prediction is negative.

4.6 Experimental Results

To select a model that is more suitable for the research object of this study, this experiment compared the UNet, UNet ++, and ATT-NestedUNet models. This study mainly used the MIOU test as an evaluation index for the model in the system. As listed in Table 2, ATT-NestedUNet achieved MIOU gains of 0.6% and 0.38% on UNet and UNet ++, respectively, throughout the day. A comparison between ATT-NestedUNet and the other two algorithms shows that ATT-NestedUNet has a better training effect. Comparing the three models, the MIOU value of the background was not significantly different. The MIOU values of beets and weeds were relatively high, and the segmentation effect was better.

As shown in Figure 8, the results of ATT-NestedUNet were compared with those of the two de-othering algorithms. To reduce the effect of background complexity, the background of the dataset was black, the beets were green, and the weeds were red. To better display these segmentation results, we converted them to RGB format. The output was divided into three folders labeled background, beet, and weed. The experimental results show that, compared with other algorithms, it can be seen that ATT-NestedUNet has fewer false pixels and can segment beet and weed images more accurately.

The segmentation results in graph form are provided for beets, weeds only, backgrounds only, and both beets & weeds in Figures 811, respectively. To better understand the experimental segmentation results, they were divided into different folders by category. The first line in the figures shows the segmentation result graph for the beets, and the second line shows the segmentation result graph for the weeds. The image shows that UNet will have misdetected pixels, whereas the segmentation effect of ATT-NestedUNet is better, and there are fewer misdetected pixels.

5. Conclusion

To segment beet and weed images more accurately, an improved algorithm called ATT-NestedUNet (NestedUNet with an attention mechanism) is proposed in this study. The model follows the design concept of UNet++ and adds an attention mechanism module to the upsampling module. To verify the segmentation performance of ATT-NestedUNet, the model was trained on a dataset containing 820 sugar beet and weed images and tested on a test set containing 206 sugar beet and weed images. The experimental results showed that ATT-NestedUNet achieved better results for both subjective visual perception and objective evaluation metrics than the comparison algorithm.

Considering that the proposed network is not easy to implement on low-power mobile devices, future work will include optimizing the network parameters and reducing the computational costs. Currently, our goal is to improve the computation time without degrading the segmentation accuracy.

Fig 1.

Figure 1.

The structure diagram of UNet.

The International Journal of Fuzzy Logic and Intelligent Systems 2024; 24: 1-9https://doi.org/10.5391/IJFIS.2024.24.1.1

Fig 2.

Figure 2.

The structure diagram of UNet++.

The International Journal of Fuzzy Logic and Intelligent Systems 2024; 24: 1-9https://doi.org/10.5391/IJFIS.2024.24.1.1

Fig 3.

Figure 3.

Structure diagram of CBAM.

The International Journal of Fuzzy Logic and Intelligent Systems 2024; 24: 1-9https://doi.org/10.5391/IJFIS.2024.24.1.1

Fig 4.

Figure 4.

Structure diagram of ATT-NestedUNet.

The International Journal of Fuzzy Logic and Intelligent Systems 2024; 24: 1-9https://doi.org/10.5391/IJFIS.2024.24.1.1

Fig 5.

Figure 5.

Four different depths ATT-NestedUNet structures.

The International Journal of Fuzzy Logic and Intelligent Systems 2024; 24: 1-9https://doi.org/10.5391/IJFIS.2024.24.1.1

Fig 6.

Figure 6.

Farmland information collection robot BoniRob.

The International Journal of Fuzzy Logic and Intelligent Systems 2024; 24: 1-9https://doi.org/10.5391/IJFIS.2024.24.1.1

Fig 7.

Figure 7.

Sample collection and labelingmap, e.g. sugar beet (green) weed (red).

The International Journal of Fuzzy Logic and Intelligent Systems 2024; 24: 1-9https://doi.org/10.5391/IJFIS.2024.24.1.1

Fig 8.

Figure 8.

Example instances from the dataset (a) weeds only, (b) beets only, (c) beets and weeds, (d) soil free of beets and weeds.

The International Journal of Fuzzy Logic and Intelligent Systems 2024; 24: 1-9https://doi.org/10.5391/IJFIS.2024.24.1.1

Fig 9.

Figure 9.

Experiment results with beets only.

The International Journal of Fuzzy Logic and Intelligent Systems 2024; 24: 1-9https://doi.org/10.5391/IJFIS.2024.24.1.1

Fig 10.

Figure 10.

Experiment results with weeds only.

The International Journal of Fuzzy Logic and Intelligent Systems 2024; 24: 1-9https://doi.org/10.5391/IJFIS.2024.24.1.1

Fig 11.

Figure 11.

Background-only experimental results.

The International Journal of Fuzzy Logic and Intelligent Systems 2024; 24: 1-9https://doi.org/10.5391/IJFIS.2024.24.1.1

Fig 12.

Figure 12.

Experiment results with beets, weeds, background.

The International Journal of Fuzzy Logic and Intelligent Systems 2024; 24: 1-9https://doi.org/10.5391/IJFIS.2024.24.1.1

Table 1 . MIOU values of L1, L2, L3, L4.

ModelMIOU
ATT-NestedUNet L190.03%
ATT-NestedUNet L291.14%
ATT-NestedUNet L391.43%
ATT-NestedUNet L491.42%

Table 2 . mIOU values for the three trained models.

ModelmIOUBack groundWeedSugar beet
UNet90.82%95.29%95.72%98.95%
Nested UNet91.04%95.29%95.73%98.96%
ATT-NestedUNet91.42%95.29%95.75%98.98%

References

  1. Ywen, Chen, Yungcai, Li, and Lingyi, Yu (2017). Analysis of sugar beet industry development in three major regions of China. China Sugar.
  2. Tyagi, AC (2016). Towards a second green revolution. Irrigation and Drainage. 65, 388-389. https://doi.org/10.1002/ird.2076
    CrossRef
  3. Ronneberger, O, Fischer, P, and Brox, T. (2015) . U-Net: convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention– MICCAI 2015. Cham, Switzerland: Springer, 2015, pp. 234-241. https://doi.org/10.1007/978-3-319-24574-4_28
    CrossRef
  4. Zhou, Z, Rahman Siddiquee, MM, Tajbakhsh, N, and Liang, J (2018). UNet++: a nested U-Net architecture for medical image segmentation. Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. Cham, Switzerland: Springer, pp. 3-11 https://doi.org/10.1007/978-3-030-00889-5_1
    Pubmed KoreaMed CrossRef
  5. Vaswani, A, Shazeer, N, Parmar, N, Uszkoreit, J, Jones, L, Gomez, AN, Kaiser, L, and Polosukhin, I (2017). Attention is all you need. Advances in Neural Information Processing Systems. 30, 5998-6008.
  6. Fu, J, Liu, J, Tian, H, Li, Y, Bao, Y, Fang, Z, and Lu, H . Dual attention network for scene segmentation., Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2009, pp.3146-3154. https://doi.org/10.1109/CVPR.2019.00326
    CrossRef
  7. Oktay, O, Schlemper, J, Folgoc, LL, Lee, M, Heinrich, M, and Misawa, K. (2018) . Attention U-Net: learning where to look for the pancreas. Available: https://arxiv.org/abs/1804.03999
  8. Shang, J, Jiang, H, Yu, G, Chen, Z, Wang, B, Li, Z, and Zhang, W (2020). Weed identification system based on deep learning. Software Guide. 19, 127-130.
  9. Wang, H, and Chen, G (2021). Real-time segmentation of field weeds based on PCAW-UNet. Journal of Xi’an University of Arts and Sciences (Natural Science Edition). 24, 27-37. https://doi.org/10.3969/j.issn.1008-5564.2021.02.006
    CrossRef
  10. Yu, J, Sharpe, SM, Schumann, AW, and Boyd, NS (2019). Deep learning for image-based weed detection in turfgrass. European Journal of Agronomy. 104, 78-84. https://doi.org/10.1016/j.eja.2019.01.004
    CrossRef
  11. Zhao, J, Cao, D, Lan, Y, Pan, F, Wen, Y, Yang, D, and Lu, L (2021). Extraction of maize field ridge centerline based on FCN with UAV remote sensing images. Transactions of the Chinese Society of Agricultural Engineering. 37, 72-80. https://doi.org/10.11975/j.issn.1002-6819.2021.09.009
    CrossRef
  12. Woo, S, Park, J, Lee, JY, and Kweon, IS (2018). CBAM: convolutional block attention module. Computer Vision - ECCV 2018. Cham, Switzerland: Springer, 201/8, pp. 3-19 https://doi.org/10.1007/978-3-030-01234-2_1
    CrossRef
  13. Fan, X, Cao, P, Shi, P, Wang, J, Xin, Y, and Huang, W. A nested UNet with attention mechanism for road crack image segmentation, Proceedings of 2021 IEEE 6th International Conference on Signal and Image Processing (ICSIP), Nanjing, China, 2021, pp. 189-193. https://doi.org/10.1109/ICSIP52628.2021.9688782
    CrossRef
  14. Chebrolu, N, Lottes, P, Schaefer, A, Winterhalter, W, Burgard, W, and Stachniss, C (2017). Agricultural robot dataset for plant classification, localization and mapping on sugar beet fields. The International Journal of Robotics Research. 36, 1045-1052. https://doi.org/10.1177/0278364917720510
    CrossRef
  15. Christoffersen, P, and Jacobs, K (2004). The importance of the loss function in option valuation. Journal of Financial Economics. 72, 291-318. https://doi.org/10.1016/j.jfineco.2003.02.001
    CrossRef
  16. Barron, JT. A general and adaptive robust loss function, 4331-4339. https://doi.org/10.1109/CVPR.2019.00446
    CrossRef
  17. Dalianis, H (2018). Evaluation metrics and evaluation. Clinical Text Mining. Cham, Switzerland: Springer, pp. 45-53 https://doi.org/10.1007/978-3-319-78503-5_6
    CrossRef
  18. Wang, Z, Wang, E, and Zhu, Y (2020). Image segmentation evaluation: a survey of methods. Artificial Intelligence Review. 53, 5637-5674. https://doi.org/10.1007/s10462-020-09830-9
    CrossRef
  19. He, H, Zhang, C, Chen, J, Geng, R, Chen, L, Liang, Y, Lu, Y, Wu, J, and Xu, Y (2021). A hybrid-attention nested UNet for nuclear segmentation in histopathological images. Frontiers in Molecular Biosciences. 8. article no 614174. https://doi.org/10.3389/fmolb.2021.614174
    CrossRef
  20. Huo, G, Lin, D, and Yuan, M (2022). Iris segmentation method based on improved UNet++. Multimedia Tools and Applications. 81, 41249-41269. https://doi.org/10.1007/s11042-022-13198-z
    CrossRef
  21. Miao, F, Zheng, S, and Tao, B. Crop weed identification system based on convolutional neural network, Proceedings of 2019 IEEE 2nd International Conference on Electronic Information and Communication Technology (ICEICT), Harbin, China, 2019, pp. 595-598. https://doi.org/10.1109/ICEICT.2019.8846268
    CrossRef

Share this article on :

Related articles in IJFIS