Article Search
닫기

Original Article

Split Viewer

International Journal of Fuzzy Logic and Intelligent Systems 2021; 21(4): 401-408

Published online December 25, 2021

https://doi.org/10.5391/IJFIS.2021.21.4.401

© The Korean Institute of Intelligent Systems

MPFANet: Semantic Segmentation Using Multiple Path Feature Aggregation

Wang-Su Jeon1 and Sang-Yong Rhee2

1Department of IT Convergence Engineering, Kyungnam University, Changwon, Korea
2Department of Computer Engineering, Kyungnam University, Changwon, Korea

Correspondence to :
Sang-Yong Rhee (syrhee@kyungnam.ac.kr)
*These authors contributed equally to this work.

Received: November 4, 2020; Revised: September 23, 2021; Accepted: November 8, 2021

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Image segmentation is the process of simplifying the analysis of the meaning or the front to say the process of dividing the image into a set of multiple pixels. The multiple path feature aggregation (MPFA) method proposed in this paper aims to extract various information of an object, and uses conventional pyramid pooling or the extraction of various sized features. This information can be combined with different regional features to obtain the overall feature information. We split four paths to extract numerous local features, and the results showed that the mean intersection over union (mIOU) is 81.6% for the validation data from the PASCAL VOC 2012 dataset, and a better performance than the existing DeepLab model was demonstrated.

Keywords: MPFA, Semantic segmentation, Feature aggregation, CNN, Inverted residual block, Local context

The performance of robots and self-driving cars is improving owing to the recently developed hardware, big data technology, and artificial intelligence. Requires the robot and the autonomous navigation technology is basically a visual recognition, and visual recognition is extracted feature data in a meaningful image entered through the camera detects an object, it refers to the ability to recognize, classify.

For the visual perception, which is basically categorized into the foreground and background, and in the case of traditional computer vision, image segmentation was performed through image segmentation. Image segmentation refers to the process of dividing an image into several sets of pixels. The purpose of segmentation is also a process of simplifying a scene based on its meaning or interpretation, rather than simply classifying it. In particular, a boundary or set with an object is determined.

Image segmentation can also be divided into three types, as shown in Figure 1 [1], instance segmentation, semantic segmentation, and panoptic segmentation [2]. Instance segmentation refers to classifying each object possessed by a pixel, and semantic segmentation refers to assigning a pixel value for each class. In addition, panoptic segmentation refers to a method of classifying the foreground as an instance segmentation and the background as a semantic segmentation and classifying as “stuff” or “things.” The term stuff refers to countless backgrounds, and things refers to countable objects.

The technology of existing computer vision after the use of deep learning has not achieved a good performance in the field of image segmentation can see what looks outstanding performance [3,4], which has also included Cityscapes [5], PASCAL VOC [6], and COCO [7] in recent years. Through data provided by competition, it can be seen that the performance has improved significantly.

When applying a convolutional neural network (CNN) [810], three problems occur: a reduction in the resolution, a loss of information, objects, and space invariance. These problems are the first, occurs in the pooling process, the second, is generated by the existence of an object between the layers and the layer or between the extracted feature map. Third, it is an object is literally learning the spatial information of the entire image less affected due to the surrounding environment when you want to present in the center there is a disadvantage vulnerable to rotate or noise. Although this can be solved using techniques such as data augmentation, doing so creates another problem that requires a long learning time.

In this paper, we propose a multiple path feature aggregation (MPFA) method to solve this problem. This method can improve the performance by using learning to create a feature map and spatial information in a variety of sizes.

The remainder of this paper is organized as follows: Section 2 discusses related work, and Section 3 describes how the proposed model applies MPFA. Section 4 details the experimental results and analysis of the proposed method, and Section 5 provides some concluding remarks and areas of future research.

Semantic segmentation can use the models [1114] in this category that have been proven in terms of performance based on four challenge images as a backbone, and provides a better performance through fine-tuning. A fully convolutional network is the starting point of its improved semantic segmentation performance by changing the existing fully connected approach to a fully convolution approach [3]. In addition, the method of increasing the receptive field in viewing neural network objects [1517] is a method that uses a dilated convolution or deconvolution [18]. To carry out the semantic segmentation, DeepLab [16] and PSPNet [19], which are the most frequently used models, employ multi-scale segmentation.

When the size of the reduced or increased layer is characterized by a loss of information, i.e., in this case, when the correction information of the previous higher-level layer is used, the loss can be reduced. Therefore, a multi-scale is applied because it takes a method of complementary features of different sizes [2024] and can improve the performance, increasing the class and location information for image segmentation. Additionally, a conditional random field (CRF) is used post-processing to improve the performance of the predicted result.

By learning to end-to-end, not to apply the CRF with postprocessing methods but to improve the network, improve [2527] and the location information of an object, the objects overlap or complex scenes have a poor performance. Therefore, to better take advantage of the features of various sizes by using non-traditional methods [28,29] through deep learning, this method has an improved performance in detecting objects [30]. Liu et al. [31] found that it can be used to improve global average pooling (GAP) with the fully convolutional network (FCN).

It is, therefore, possible to use the information in the global spatial features through the GAP information and combining different characteristic maps for local information through a pyramid network.

The backbone network of the model proposed in this research is a fine-tuning of the ResNet-50 layer model, and MPFA consists of four paths, and thee of these paths have four sub-paths. As shown in Figure 2, a feature map of size 41 × 41 × 512 is made by concatenating the outputs of the three paths and the ones of the backbone. Next, after adjusting the size to 21 by 1×1 convolution to predict the class, training and prediction are performed.

3.1 Multi-Scale Feature

The proposed model used for the first path is the atrous spatial pyramid pooling (ASPP) of DeepLab, as shown in Figure 3. ASPP may be in the number of computations while maintaining the size and area of the receptive field of various sizes. The atrous rate used in this study apply to {6,8,16,24}.

3.2 Multi-Scale Residual Feature

The second path of the proposed model is composed of a layer that performs the residual operation in a pyramidal format, as shown in Figure 4. The overall structure uses the inverted residual block shown in Figure 5(a), and this method is used in MobileNet v2 [32]. Inverted residual block, so that the dimensions are expressed in that advantage that it is possible MobileNet v2 and then expanded to a higher order channel expressed in the low-order channel by the high-level information through the hypothesis of a manifold to minimize the loss of information through ReLu.

3.3 Pyramid Pooling Module

The third path of the proposed model is the module used in PSP-Net. In the case of average pooling used here, after obtaining the average of the channels, partial information of a 1/4 size is extracted through the operation of each pooling, and then connected. At this time, the parameters of the average pooling are set to 7, 9, 11, and 15, respectively, and the structure is shown in Figure 6.

4.1 Experiment Details

The experiment environment used in this study was as follows. The operating system used an Intel Xeon Gold 5120 CPU in an Ubuntu 18.04.4 environment, 32 GB of memory, and RTX TITAN X 24GB graphics cards. The deep learning framework uses TensorFlow 1.14.0, and PASCAL VOC 2012 was used for learning. The PASCAL VOC 2012 video segmentation dataset consists of 10,582 pieces of training data and 1,449 pieces of test data, with a total of 20 classes. The model learning applied 20,000 iterations, a batch size of 8, a learning rate of 0.00001, and the stochastic gradient descent method was used for Adam. Then, by using the constructed model, the performance change of the model was measured by experimenting while controlling the number of iterations according to the number of MPFAs and batch size of the proposed module.

4.2 Experiment Result

The experiment results obtained using the proposed method are shown in Figures 7 and 8. The input image, ground truth, DeepLab [16] result, and the proposed model result is shown in order from the left. In the case of Figure 7, bottles, persons, buses, horses, and TVs are classified. When comparing DeepLab and the proposed model, it can be seen that it is similar to deeplab or misclassifier error than DeepLab. This problem is judged to be a classifier error owing to the large amount of information on the scale. Figure 8 shows the classification of people, bottles, horses, and cows. In Figure 8, it can be seen that the overall partitioning performance is excellent. In the second image, one can see that there is a part that recognizes the error of a horse identified as a cow.

The results of the evaluation and comparison of the overall performance are shown in Table 1. Table 1 shows the results of measuring the performance change by adjusting the feature size of the proposed method. For the performance, the accuracy (mean intersection over union [mIOU]) and frames per second (fps) are shown. When the feature size is the largest, it can be seen that the performance is good, and when the feature size is 256, there is little deterioration in performance, and 32 frames can be calculated.

As shown in Table 2, the performance was compared by adjusting the batch size and the number of iterations, and when the batch size was 10, the highest performance was shown. Table 3 is shown the comparison of the model in the PASCAL VOC 2012 Validation Leader Board that evaluate the performance as a valid data DeepLab-CRF [16], FastDenseNas-arch0 [33], DeepLab v3+ [34], DFN [35], it represents a ResNet-GCN [36] with the result of comparing the five model and performance.

In this paper, we proposed an MPFA method based on PASCAL VOC 2012 data and the DeepLab model to conduct semantic segmentation. The performance results of the proposed method show an accuracy of 81.6% eoteuna, demonstrating a better performance than the DeepLab model class. There was a problem in extracting the non-labeled regions and a degraded classification performance occurred. Therefore, if there is a lot of scale information of the model, the performance will be poor and the approach is judged to be unsuitable.

In future research, we plan to develop a lightweight model suitable for real-time use while improving the classification performance and improving it into a model suitable for real-time application.

This work was supported by a National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (No. 2020R1F1A1075968).
Fig. 1.

Type of segmentation: (a) input, (b) instance segmentation, (c) semantic segmentation, and (d) panoptic segmentation. The images are from [1].


Fig. 2.

Multiple path feature aggregation network architecture.


Fig. 3.

Atrous spatial pyramid pooling structure.


Fig. 4.

Multi-scale residual feature structure.


Fig. 5.

Linear bottleneck structure: (a) residual block and (b) inverted residual block.


Fig. 6.

Pyramid pooling module structure.


Fig. 7.

Visual measurement of PASCAL VOC 2012 data. Results of the proposed method are compared with the baseline.


Fig. 8.

Visual measurement of PASCAL VOC 2012 data. The results of the proposed method are compared with the baseline.


Table. 1.

Table 1. Mean IOU and fps measurement according to MPFA size.

MethodMean IOU(%)fps
ResNet50-MPFA51281.612
ResNet50-MPFA25678.324
ResNet50-MPFA12875.532
ResNet50-MPFA6472.340

Table. 2.

Table 2. Mean IOU measurement according to batch size and iterations.

MethodBatch sizeIterationMean IOU(%)
ResNet50-MPFA512410k77.0
ResNet50-MPFA512420k77.7
ResNet50-MPFA5121020k78.3
ResNet50-MPFA5121020k81.6

Table. 3.

Table 3. PASCAL-VOC-2012 validation data of each model compared with mean IOU.

MethodMean IOU(%)
ResNet-GCN [36]81.0
DFN [35]80.6
DeepLab v3+ [34]79.3
FastDenseNas-arch0 [33]78.0
DeepLab-CRF [16]77.69
Proposed method81.6

  1. Mechea, D. (2019) . What is Panoptic Segmentation and why you should care. Available: https://medium.com/@danielmechea/whatis-panoptic-segmentation-and-why-you-should-care-7f6c953d2a6a
  2. Kirillov, A, He, K, Girshick, R, Rother, C, and Dollar, P . Panoptic segmentation., Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, Long Beach, CA, pp.9404-9413.
  3. Sermanet, P, Eigen, D, Zhang, X, Mathieu, M, Fergus, R, and LeCun, Y. (2013) . Overfeat: integrated recognition, localization and detection using convolutional networks. Available: https://arxiv.org/abs/1312.6229
  4. Long, J, Shelhamer, E, and Darrell, T . Fully convolutional networks for semantic segmentation., Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp.3431-3440.
  5. Cordts, M, Omran, M, Ramos, S, Rehfeld, T, Enzweiler, M, Benenson, R, Franke, U, Roth, S, and Schiele, B . The cityscapes dataset for semantic urban scene understanding., Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, Las Vegas, NV, pp.3213-3223.
  6. Mottaghi, R, Chen, X, Liu, X, Cho, NG, Lee, SW, Fidler, S, Urtasun, R, and Yuille, A . The role of context for object detection and semantic segmentation in the wild., Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, Columbus, OH, pp.891-898.
  7. Lin, TY, Maire, M, Belongie, S, Hays, J, Perona, P, Ramanan, D, Dollar, P, and Lawrence Zitnick, C. (2014) . COCO Microsoft: common objects in context. Computer Vision – ECCV 2014, 740-755. https://doi.org/10.1007/978-3-319-10602-1_48
    CrossRef
  8. Choee, HW, Park, SM, and Sim, KB (2019). CNN-based speech emotion recognition using transfer learning. Journal of Korean Institute of Intelligent Systems. 29, 339-344. https://doi.org/10.5391/JKIIS.2019.29.5.339
    CrossRef
  9. Lee, WY, Ko, KE, Geem, ZW, and Sim, KB (2017). Method that determining the Hyperparameter of CNN using HS algorithm. Journal of Korean Institute of Intelligent Systems. 27, 22-28. https://doi.org/10.5391/JKIIS.2017.27.1.022
    CrossRef
  10. Kim, S, and Cho, Y (2020). An artificial intelligence Othello game agent using CNN based MCTS and reinforcement learning. Journal of Korean Institute of Intelligent Systems. 30, 40-46. https://doi.org/10.5391/JKIIS.2020.30.1.40
    CrossRef
  11. Krizhevsky, A, Sutskever, I, and Hinton, GE (2012). ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems. 25, 1097-1105.
  12. Simonyan, K, and Zisserman, A. (2014) . Very deep convolutional networks for large-scale image recognition. Available: https://arxiv.org/abs/1409.1556
  13. Szegedy, C, Liu, W, Jia, Y, Sermanet, P, Reed, S, Anguelov, D, Erhan, D, Vanhoucke, V, and Rabinovich, A . Going deeper with convolutions., Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, Boston, MA, pp.1-9.
  14. He, K, Zhang, X, Ren, S, and Sun, J . Deep residual learning for image recognition., Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, Las Vegas, NV, pp.770-778.
  15. Chen, LC, Papandreou, G, Kokkinos, I, Murphy, K, and Yuille, AL. (2014) . Semantic image segmentation with deep convolutional nets and fully connected CRFs. Available: https://arxiv.org/abs/1412.7062
  16. Chen, LC, Papandreou, G, Kokkinos, I, Murphy, K, and Yuille, AL. (2016) . DeepLab: semantic image segmentation with deep convolutional nets, Atrous convolution, and fully connected CRFs. Available: https://arxiv.org/abs/1606.00915
  17. Yu, F, and Koltun, V. (2015) . Multi-scale context aggregation by dilated convolutions. Available: https://arxiv.org/abs/1511.07122
  18. Noh, H, Hong, S, and Han, B . Learning deconvolution network for semantic segmentation., Proceedings of the IEEE International Conference on Computer Vision, 2015, Santiago, Chile, pp.1520-1528.
  19. Zhao, H, Shi, J, Qi, X, Wang, X, and Jia, J . Pyramid scene parsing network., Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, Honolulu, HI, pp.2881-2890.
  20. Xia, F, Wang, P, Chen, LC, and Yuille, AL (2016). Zoom better to see clearer: human and object parsing with hierarchical auto-zoom net. Computer Vision – ECCV 2016. Cham, Switzerland: Springer, pp. 648-663 https://doi.org/10.1007/978-3-319-46454-1_39
    CrossRef
  21. Hariharan, B, Arbelaez, P, Girshick, R, and Malik, J . Hypercolumns for object segmentation and fine-grained localization., Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, Boston, MA, pp.447-456.
  22. Li, H, Xiong, P, Fan, H, and Sun, J . Dfanet: deep feature aggregation for real-time semantic segmentation., Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, Long Beach, CA, pp.9522-9531.
  23. Choi, H, Lee, HJ, You, HJ, Rhee, SY, and Jeon, WS (2019). Comparative analysis of generalized intersection over union and error matrix for vegetation cover classification assessment. Sensors and Materials. 31, 3849-3858. https://doi.org/10.18494/sam.2019.2584
    CrossRef
  24. Zhang, J, Lin, S, Ding, L, and Bruzzone, L (2020). Multi-scale context aggregation for semantic segmentation of remote sensing images. Remote Sensing. 12. article no 701
  25. Liu, Z, Li, X, Luo, P, Loy, CC, and Tang, X . Semantic image segmentation via deep parsing network., Proceedings of the IEEE International Conference on Computer Vision, 2015, Santiago, Chile, pp.1377-1385.
  26. Arnab, A, Jayasumana, S, Zheng, S, and Torr, PH (2016). Higher order conditional random fields in deep neural networks. Computer Vision – ECCV 2016. Cham, Switzerland: Springer, pp. 524-540 https://doi.org/10.1007/978-3-319-46475-6_33
    CrossRef
  27. Zheng, S, Jayasumana, S, Romera-Paredes, B, Vineet, V, Su, Z, Du, D, Huang, C, and Torr, PH . Conditional random fields as recurrent neural networks., Proceedings of the IEEE International Conference on Computer Vision, 2015, Santiago, Chile, pp.1529-1537.
  28. Lazebnik, S, Schmid, C, and Ponce, J . Beyond bags of features: spatial pyramid matching for recognizing natural scene categories., Proceedings of 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2016, New York, NY, pp.2169-2178.
  29. Lucchi, A, Li, Y, Boix, X, Smith, K, and Fua, P . Are spatial and global constraints really necessary for segmentation?., Proceedings of 2011 International Conference on Computer Vision, 2011, Barcelona, Spain, pp.9-16.
  30. Szegedy, C, Reed, S, Erhan, D, Anguelov, D, and Ioffe, S. (2014) . Scalable, high-quality object detection. Available: https://arxiv.org/abs/1412.1441.407
  31. Liu, W, Rabinovich, A, and Berg, AC. (2015) . ParseNet: looking wider to see better. Available: https://arxiv.org/abs/1506.04579
  32. Sandler, M, Howard, A, Zhu, M, Zhmoginov, A, and Chen, LC . Mobilenetv2: inverted residuals and linear bottlenecks., Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, Salt Lake City, UT, pp.4510-4520.
  33. Nekrasov, V, Chen, H, Shen, C, and Reid, I. (2019) . Fast neural architecture search of compact semantic segmentation models via auxiliary cells. Available: https://arxiv.org/abs/1810.10804
  34. Gao, S, Cheng, MM, Zhao, K, Zhang, XY, Yang, MH, and Torr, PH (2021). Res2Net: a new multi-scale backbone architecture. IEEE Transactions on Pattern Analysis and Machine Intelligence. 43, 652-662. https://doi.org/10.1109/TPAMI.2019.2938758
    CrossRef
  35. Yu, C, Wang, J, Peng, C, Gao, C, Yu, G, and Sang, N . Learning a discriminative feature network for semantic segmentation., Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, Salt Lake City, UT, pp.1857-1866.
  36. Peng, C, Zhang, X, Yu, G, Luo, G, and Sun, J . Large kernel matters–improve semantic segmentation by global convolutional network., Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, Honolulu, HI, pp.4353-4361.

Wang-Su Jeon received his B.S. and M. S. degrees in Computer Engineering and IT Convergence Engineering from Kyungnam University, Masan, Korea, in 2016 and 2018, and is currently pursuing the Ph.D. degree in IT Convergence Engineering at Kyungnam University, Masan, Korea. His present interests include computer vision.

E-mail : jws2218@naver.com

Sang-Yong Rhee received his B.S. and M.S. degrees in Industrial Engineering from Korea University, Seoul, Korea, in 1982 and 1984, respectively, and his Ph.D. degree in Industrial Engineering at Pohang University, Pohang, Korea. He is currently a professor at the Computer Engineering, Kyungnam University, Masan, Korea. His research interests include computer vision, augmented reality, deep learning and human-robot interface.

E-mail : syrhee@kyungnam.ac.kr

Article

Original Article

International Journal of Fuzzy Logic and Intelligent Systems 2021; 21(4): 401-408

Published online December 25, 2021 https://doi.org/10.5391/IJFIS.2021.21.4.401

Copyright © The Korean Institute of Intelligent Systems.

MPFANet: Semantic Segmentation Using Multiple Path Feature Aggregation

Wang-Su Jeon1 and Sang-Yong Rhee2

1Department of IT Convergence Engineering, Kyungnam University, Changwon, Korea
2Department of Computer Engineering, Kyungnam University, Changwon, Korea

Correspondence to:Sang-Yong Rhee (syrhee@kyungnam.ac.kr)
*These authors contributed equally to this work.

Received: November 4, 2020; Revised: September 23, 2021; Accepted: November 8, 2021

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Image segmentation is the process of simplifying the analysis of the meaning or the front to say the process of dividing the image into a set of multiple pixels. The multiple path feature aggregation (MPFA) method proposed in this paper aims to extract various information of an object, and uses conventional pyramid pooling or the extraction of various sized features. This information can be combined with different regional features to obtain the overall feature information. We split four paths to extract numerous local features, and the results showed that the mean intersection over union (mIOU) is 81.6% for the validation data from the PASCAL VOC 2012 dataset, and a better performance than the existing DeepLab model was demonstrated.

Keywords: MPFA, Semantic segmentation, Feature aggregation, CNN, Inverted residual block, Local context

1. Introduction

The performance of robots and self-driving cars is improving owing to the recently developed hardware, big data technology, and artificial intelligence. Requires the robot and the autonomous navigation technology is basically a visual recognition, and visual recognition is extracted feature data in a meaningful image entered through the camera detects an object, it refers to the ability to recognize, classify.

For the visual perception, which is basically categorized into the foreground and background, and in the case of traditional computer vision, image segmentation was performed through image segmentation. Image segmentation refers to the process of dividing an image into several sets of pixels. The purpose of segmentation is also a process of simplifying a scene based on its meaning or interpretation, rather than simply classifying it. In particular, a boundary or set with an object is determined.

Image segmentation can also be divided into three types, as shown in Figure 1 [1], instance segmentation, semantic segmentation, and panoptic segmentation [2]. Instance segmentation refers to classifying each object possessed by a pixel, and semantic segmentation refers to assigning a pixel value for each class. In addition, panoptic segmentation refers to a method of classifying the foreground as an instance segmentation and the background as a semantic segmentation and classifying as “stuff” or “things.” The term stuff refers to countless backgrounds, and things refers to countable objects.

The technology of existing computer vision after the use of deep learning has not achieved a good performance in the field of image segmentation can see what looks outstanding performance [3,4], which has also included Cityscapes [5], PASCAL VOC [6], and COCO [7] in recent years. Through data provided by competition, it can be seen that the performance has improved significantly.

When applying a convolutional neural network (CNN) [810], three problems occur: a reduction in the resolution, a loss of information, objects, and space invariance. These problems are the first, occurs in the pooling process, the second, is generated by the existence of an object between the layers and the layer or between the extracted feature map. Third, it is an object is literally learning the spatial information of the entire image less affected due to the surrounding environment when you want to present in the center there is a disadvantage vulnerable to rotate or noise. Although this can be solved using techniques such as data augmentation, doing so creates another problem that requires a long learning time.

In this paper, we propose a multiple path feature aggregation (MPFA) method to solve this problem. This method can improve the performance by using learning to create a feature map and spatial information in a variety of sizes.

The remainder of this paper is organized as follows: Section 2 discusses related work, and Section 3 describes how the proposed model applies MPFA. Section 4 details the experimental results and analysis of the proposed method, and Section 5 provides some concluding remarks and areas of future research.

2. Related Work

Semantic segmentation can use the models [1114] in this category that have been proven in terms of performance based on four challenge images as a backbone, and provides a better performance through fine-tuning. A fully convolutional network is the starting point of its improved semantic segmentation performance by changing the existing fully connected approach to a fully convolution approach [3]. In addition, the method of increasing the receptive field in viewing neural network objects [1517] is a method that uses a dilated convolution or deconvolution [18]. To carry out the semantic segmentation, DeepLab [16] and PSPNet [19], which are the most frequently used models, employ multi-scale segmentation.

When the size of the reduced or increased layer is characterized by a loss of information, i.e., in this case, when the correction information of the previous higher-level layer is used, the loss can be reduced. Therefore, a multi-scale is applied because it takes a method of complementary features of different sizes [2024] and can improve the performance, increasing the class and location information for image segmentation. Additionally, a conditional random field (CRF) is used post-processing to improve the performance of the predicted result.

By learning to end-to-end, not to apply the CRF with postprocessing methods but to improve the network, improve [2527] and the location information of an object, the objects overlap or complex scenes have a poor performance. Therefore, to better take advantage of the features of various sizes by using non-traditional methods [28,29] through deep learning, this method has an improved performance in detecting objects [30]. Liu et al. [31] found that it can be used to improve global average pooling (GAP) with the fully convolutional network (FCN).

It is, therefore, possible to use the information in the global spatial features through the GAP information and combining different characteristic maps for local information through a pyramid network.

3. MPFA

The backbone network of the model proposed in this research is a fine-tuning of the ResNet-50 layer model, and MPFA consists of four paths, and thee of these paths have four sub-paths. As shown in Figure 2, a feature map of size 41 × 41 × 512 is made by concatenating the outputs of the three paths and the ones of the backbone. Next, after adjusting the size to 21 by 1×1 convolution to predict the class, training and prediction are performed.

3.1 Multi-Scale Feature

The proposed model used for the first path is the atrous spatial pyramid pooling (ASPP) of DeepLab, as shown in Figure 3. ASPP may be in the number of computations while maintaining the size and area of the receptive field of various sizes. The atrous rate used in this study apply to {6,8,16,24}.

3.2 Multi-Scale Residual Feature

The second path of the proposed model is composed of a layer that performs the residual operation in a pyramidal format, as shown in Figure 4. The overall structure uses the inverted residual block shown in Figure 5(a), and this method is used in MobileNet v2 [32]. Inverted residual block, so that the dimensions are expressed in that advantage that it is possible MobileNet v2 and then expanded to a higher order channel expressed in the low-order channel by the high-level information through the hypothesis of a manifold to minimize the loss of information through ReLu.

3.3 Pyramid Pooling Module

The third path of the proposed model is the module used in PSP-Net. In the case of average pooling used here, after obtaining the average of the channels, partial information of a 1/4 size is extracted through the operation of each pooling, and then connected. At this time, the parameters of the average pooling are set to 7, 9, 11, and 15, respectively, and the structure is shown in Figure 6.

4. Experiments

4.1 Experiment Details

The experiment environment used in this study was as follows. The operating system used an Intel Xeon Gold 5120 CPU in an Ubuntu 18.04.4 environment, 32 GB of memory, and RTX TITAN X 24GB graphics cards. The deep learning framework uses TensorFlow 1.14.0, and PASCAL VOC 2012 was used for learning. The PASCAL VOC 2012 video segmentation dataset consists of 10,582 pieces of training data and 1,449 pieces of test data, with a total of 20 classes. The model learning applied 20,000 iterations, a batch size of 8, a learning rate of 0.00001, and the stochastic gradient descent method was used for Adam. Then, by using the constructed model, the performance change of the model was measured by experimenting while controlling the number of iterations according to the number of MPFAs and batch size of the proposed module.

4.2 Experiment Result

The experiment results obtained using the proposed method are shown in Figures 7 and 8. The input image, ground truth, DeepLab [16] result, and the proposed model result is shown in order from the left. In the case of Figure 7, bottles, persons, buses, horses, and TVs are classified. When comparing DeepLab and the proposed model, it can be seen that it is similar to deeplab or misclassifier error than DeepLab. This problem is judged to be a classifier error owing to the large amount of information on the scale. Figure 8 shows the classification of people, bottles, horses, and cows. In Figure 8, it can be seen that the overall partitioning performance is excellent. In the second image, one can see that there is a part that recognizes the error of a horse identified as a cow.

The results of the evaluation and comparison of the overall performance are shown in Table 1. Table 1 shows the results of measuring the performance change by adjusting the feature size of the proposed method. For the performance, the accuracy (mean intersection over union [mIOU]) and frames per second (fps) are shown. When the feature size is the largest, it can be seen that the performance is good, and when the feature size is 256, there is little deterioration in performance, and 32 frames can be calculated.

As shown in Table 2, the performance was compared by adjusting the batch size and the number of iterations, and when the batch size was 10, the highest performance was shown. Table 3 is shown the comparison of the model in the PASCAL VOC 2012 Validation Leader Board that evaluate the performance as a valid data DeepLab-CRF [16], FastDenseNas-arch0 [33], DeepLab v3+ [34], DFN [35], it represents a ResNet-GCN [36] with the result of comparing the five model and performance.

5. Conclusion

In this paper, we proposed an MPFA method based on PASCAL VOC 2012 data and the DeepLab model to conduct semantic segmentation. The performance results of the proposed method show an accuracy of 81.6% eoteuna, demonstrating a better performance than the DeepLab model class. There was a problem in extracting the non-labeled regions and a degraded classification performance occurred. Therefore, if there is a lot of scale information of the model, the performance will be poor and the approach is judged to be unsuitable.

In future research, we plan to develop a lightweight model suitable for real-time use while improving the classification performance and improving it into a model suitable for real-time application.

Fig 1.

Figure 1.

Type of segmentation: (a) input, (b) instance segmentation, (c) semantic segmentation, and (d) panoptic segmentation. The images are from [1].

The International Journal of Fuzzy Logic and Intelligent Systems 2021; 21: 401-408https://doi.org/10.5391/IJFIS.2021.21.4.401

Fig 2.

Figure 2.

Multiple path feature aggregation network architecture.

The International Journal of Fuzzy Logic and Intelligent Systems 2021; 21: 401-408https://doi.org/10.5391/IJFIS.2021.21.4.401

Fig 3.

Figure 3.

Atrous spatial pyramid pooling structure.

The International Journal of Fuzzy Logic and Intelligent Systems 2021; 21: 401-408https://doi.org/10.5391/IJFIS.2021.21.4.401

Fig 4.

Figure 4.

Multi-scale residual feature structure.

The International Journal of Fuzzy Logic and Intelligent Systems 2021; 21: 401-408https://doi.org/10.5391/IJFIS.2021.21.4.401

Fig 5.

Figure 5.

Linear bottleneck structure: (a) residual block and (b) inverted residual block.

The International Journal of Fuzzy Logic and Intelligent Systems 2021; 21: 401-408https://doi.org/10.5391/IJFIS.2021.21.4.401

Fig 6.

Figure 6.

Pyramid pooling module structure.

The International Journal of Fuzzy Logic and Intelligent Systems 2021; 21: 401-408https://doi.org/10.5391/IJFIS.2021.21.4.401

Fig 7.

Figure 7.

Visual measurement of PASCAL VOC 2012 data. Results of the proposed method are compared with the baseline.

The International Journal of Fuzzy Logic and Intelligent Systems 2021; 21: 401-408https://doi.org/10.5391/IJFIS.2021.21.4.401

Fig 8.

Figure 8.

Visual measurement of PASCAL VOC 2012 data. The results of the proposed method are compared with the baseline.

The International Journal of Fuzzy Logic and Intelligent Systems 2021; 21: 401-408https://doi.org/10.5391/IJFIS.2021.21.4.401

Table 1 . Mean IOU and fps measurement according to MPFA size.

MethodMean IOU(%)fps
ResNet50-MPFA51281.612
ResNet50-MPFA25678.324
ResNet50-MPFA12875.532
ResNet50-MPFA6472.340

Table 2 . Mean IOU measurement according to batch size and iterations.

MethodBatch sizeIterationMean IOU(%)
ResNet50-MPFA512410k77.0
ResNet50-MPFA512420k77.7
ResNet50-MPFA5121020k78.3
ResNet50-MPFA5121020k81.6

Table 3 . PASCAL-VOC-2012 validation data of each model compared with mean IOU.

MethodMean IOU(%)
ResNet-GCN [36]81.0
DFN [35]80.6
DeepLab v3+ [34]79.3
FastDenseNas-arch0 [33]78.0
DeepLab-CRF [16]77.69
Proposed method81.6

References

  1. Mechea, D. (2019) . What is Panoptic Segmentation and why you should care. Available: https://medium.com/@danielmechea/whatis-panoptic-segmentation-and-why-you-should-care-7f6c953d2a6a
  2. Kirillov, A, He, K, Girshick, R, Rother, C, and Dollar, P . Panoptic segmentation., Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, Long Beach, CA, pp.9404-9413.
  3. Sermanet, P, Eigen, D, Zhang, X, Mathieu, M, Fergus, R, and LeCun, Y. (2013) . Overfeat: integrated recognition, localization and detection using convolutional networks. Available: https://arxiv.org/abs/1312.6229
  4. Long, J, Shelhamer, E, and Darrell, T . Fully convolutional networks for semantic segmentation., Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp.3431-3440.
  5. Cordts, M, Omran, M, Ramos, S, Rehfeld, T, Enzweiler, M, Benenson, R, Franke, U, Roth, S, and Schiele, B . The cityscapes dataset for semantic urban scene understanding., Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, Las Vegas, NV, pp.3213-3223.
  6. Mottaghi, R, Chen, X, Liu, X, Cho, NG, Lee, SW, Fidler, S, Urtasun, R, and Yuille, A . The role of context for object detection and semantic segmentation in the wild., Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, Columbus, OH, pp.891-898.
  7. Lin, TY, Maire, M, Belongie, S, Hays, J, Perona, P, Ramanan, D, Dollar, P, and Lawrence Zitnick, C. (2014) . COCO Microsoft: common objects in context. Computer Vision – ECCV 2014, 740-755. https://doi.org/10.1007/978-3-319-10602-1_48
    CrossRef
  8. Choee, HW, Park, SM, and Sim, KB (2019). CNN-based speech emotion recognition using transfer learning. Journal of Korean Institute of Intelligent Systems. 29, 339-344. https://doi.org/10.5391/JKIIS.2019.29.5.339
    CrossRef
  9. Lee, WY, Ko, KE, Geem, ZW, and Sim, KB (2017). Method that determining the Hyperparameter of CNN using HS algorithm. Journal of Korean Institute of Intelligent Systems. 27, 22-28. https://doi.org/10.5391/JKIIS.2017.27.1.022
    CrossRef
  10. Kim, S, and Cho, Y (2020). An artificial intelligence Othello game agent using CNN based MCTS and reinforcement learning. Journal of Korean Institute of Intelligent Systems. 30, 40-46. https://doi.org/10.5391/JKIIS.2020.30.1.40
    CrossRef
  11. Krizhevsky, A, Sutskever, I, and Hinton, GE (2012). ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems. 25, 1097-1105.
  12. Simonyan, K, and Zisserman, A. (2014) . Very deep convolutional networks for large-scale image recognition. Available: https://arxiv.org/abs/1409.1556
  13. Szegedy, C, Liu, W, Jia, Y, Sermanet, P, Reed, S, Anguelov, D, Erhan, D, Vanhoucke, V, and Rabinovich, A . Going deeper with convolutions., Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, Boston, MA, pp.1-9.
  14. He, K, Zhang, X, Ren, S, and Sun, J . Deep residual learning for image recognition., Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, Las Vegas, NV, pp.770-778.
  15. Chen, LC, Papandreou, G, Kokkinos, I, Murphy, K, and Yuille, AL. (2014) . Semantic image segmentation with deep convolutional nets and fully connected CRFs. Available: https://arxiv.org/abs/1412.7062
  16. Chen, LC, Papandreou, G, Kokkinos, I, Murphy, K, and Yuille, AL. (2016) . DeepLab: semantic image segmentation with deep convolutional nets, Atrous convolution, and fully connected CRFs. Available: https://arxiv.org/abs/1606.00915
  17. Yu, F, and Koltun, V. (2015) . Multi-scale context aggregation by dilated convolutions. Available: https://arxiv.org/abs/1511.07122
  18. Noh, H, Hong, S, and Han, B . Learning deconvolution network for semantic segmentation., Proceedings of the IEEE International Conference on Computer Vision, 2015, Santiago, Chile, pp.1520-1528.
  19. Zhao, H, Shi, J, Qi, X, Wang, X, and Jia, J . Pyramid scene parsing network., Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, Honolulu, HI, pp.2881-2890.
  20. Xia, F, Wang, P, Chen, LC, and Yuille, AL (2016). Zoom better to see clearer: human and object parsing with hierarchical auto-zoom net. Computer Vision – ECCV 2016. Cham, Switzerland: Springer, pp. 648-663 https://doi.org/10.1007/978-3-319-46454-1_39
    CrossRef
  21. Hariharan, B, Arbelaez, P, Girshick, R, and Malik, J . Hypercolumns for object segmentation and fine-grained localization., Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, Boston, MA, pp.447-456.
  22. Li, H, Xiong, P, Fan, H, and Sun, J . Dfanet: deep feature aggregation for real-time semantic segmentation., Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, Long Beach, CA, pp.9522-9531.
  23. Choi, H, Lee, HJ, You, HJ, Rhee, SY, and Jeon, WS (2019). Comparative analysis of generalized intersection over union and error matrix for vegetation cover classification assessment. Sensors and Materials. 31, 3849-3858. https://doi.org/10.18494/sam.2019.2584
    CrossRef
  24. Zhang, J, Lin, S, Ding, L, and Bruzzone, L (2020). Multi-scale context aggregation for semantic segmentation of remote sensing images. Remote Sensing. 12. article no 701
  25. Liu, Z, Li, X, Luo, P, Loy, CC, and Tang, X . Semantic image segmentation via deep parsing network., Proceedings of the IEEE International Conference on Computer Vision, 2015, Santiago, Chile, pp.1377-1385.
  26. Arnab, A, Jayasumana, S, Zheng, S, and Torr, PH (2016). Higher order conditional random fields in deep neural networks. Computer Vision – ECCV 2016. Cham, Switzerland: Springer, pp. 524-540 https://doi.org/10.1007/978-3-319-46475-6_33
    CrossRef
  27. Zheng, S, Jayasumana, S, Romera-Paredes, B, Vineet, V, Su, Z, Du, D, Huang, C, and Torr, PH . Conditional random fields as recurrent neural networks., Proceedings of the IEEE International Conference on Computer Vision, 2015, Santiago, Chile, pp.1529-1537.
  28. Lazebnik, S, Schmid, C, and Ponce, J . Beyond bags of features: spatial pyramid matching for recognizing natural scene categories., Proceedings of 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2016, New York, NY, pp.2169-2178.
  29. Lucchi, A, Li, Y, Boix, X, Smith, K, and Fua, P . Are spatial and global constraints really necessary for segmentation?., Proceedings of 2011 International Conference on Computer Vision, 2011, Barcelona, Spain, pp.9-16.
  30. Szegedy, C, Reed, S, Erhan, D, Anguelov, D, and Ioffe, S. (2014) . Scalable, high-quality object detection. Available: https://arxiv.org/abs/1412.1441.407
  31. Liu, W, Rabinovich, A, and Berg, AC. (2015) . ParseNet: looking wider to see better. Available: https://arxiv.org/abs/1506.04579
  32. Sandler, M, Howard, A, Zhu, M, Zhmoginov, A, and Chen, LC . Mobilenetv2: inverted residuals and linear bottlenecks., Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, Salt Lake City, UT, pp.4510-4520.
  33. Nekrasov, V, Chen, H, Shen, C, and Reid, I. (2019) . Fast neural architecture search of compact semantic segmentation models via auxiliary cells. Available: https://arxiv.org/abs/1810.10804
  34. Gao, S, Cheng, MM, Zhao, K, Zhang, XY, Yang, MH, and Torr, PH (2021). Res2Net: a new multi-scale backbone architecture. IEEE Transactions on Pattern Analysis and Machine Intelligence. 43, 652-662. https://doi.org/10.1109/TPAMI.2019.2938758
    CrossRef
  35. Yu, C, Wang, J, Peng, C, Gao, C, Yu, G, and Sang, N . Learning a discriminative feature network for semantic segmentation., Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, Salt Lake City, UT, pp.1857-1866.
  36. Peng, C, Zhang, X, Yu, G, Luo, G, and Sun, J . Large kernel matters–improve semantic segmentation by global convolutional network., Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, Honolulu, HI, pp.4353-4361.

Share this article on :

Related articles in IJFIS