International Journal of Fuzzy Logic and Intelligent Systems 2021; 21(4): 401-408
Published online December 25, 2021
https://doi.org/10.5391/IJFIS.2021.21.4.401
© The Korean Institute of Intelligent Systems
Wang-Su Jeon1 and Sang-Yong Rhee2
1Department of IT Convergence Engineering, Kyungnam University, Changwon, Korea
2Department of Computer Engineering, Kyungnam University, Changwon, Korea
Correspondence to :
Sang-Yong Rhee (syrhee@kyungnam.ac.kr)
*These authors contributed equally to this work.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Image segmentation is the process of simplifying the analysis of the meaning or the front to say the process of dividing the image into a set of multiple pixels. The multiple path feature aggregation (MPFA) method proposed in this paper aims to extract various information of an object, and uses conventional pyramid pooling or the extraction of various sized features. This information can be combined with different regional features to obtain the overall feature information. We split four paths to extract numerous local features, and the results showed that the mean intersection over union (mIOU) is 81.6% for the validation data from the PASCAL VOC 2012 dataset, and a better performance than the existing DeepLab model was demonstrated.
Keywords: MPFA, Semantic segmentation, Feature aggregation, CNN, Inverted residual block, Local context
The performance of robots and self-driving cars is improving owing to the recently developed hardware, big data technology, and artificial intelligence. Requires the robot and the autonomous navigation technology is basically a visual recognition, and visual recognition is extracted feature data in a meaningful image entered through the camera detects an object, it refers to the ability to recognize, classify.
For the visual perception, which is basically categorized into the foreground and background, and in the case of traditional computer vision, image segmentation was performed through image segmentation. Image segmentation refers to the process of dividing an image into several sets of pixels. The purpose of segmentation is also a process of simplifying a scene based on its meaning or interpretation, rather than simply classifying it. In particular, a boundary or set with an object is determined.
Image segmentation can also be divided into three types, as shown in Figure 1 [1], instance segmentation, semantic segmentation, and panoptic segmentation [2]. Instance segmentation refers to classifying each object possessed by a pixel, and semantic segmentation refers to assigning a pixel value for each class. In addition, panoptic segmentation refers to a method of classifying the foreground as an instance segmentation and the background as a semantic segmentation and classifying as “stuff” or “things.” The term stuff refers to countless backgrounds, and things refers to countable objects.
The technology of existing computer vision after the use of deep learning has not achieved a good performance in the field of image segmentation can see what looks outstanding performance [3,4], which has also included Cityscapes [5], PASCAL VOC [6], and COCO [7] in recent years. Through data provided by competition, it can be seen that the performance has improved significantly.
When applying a convolutional neural network (CNN) [8–10], three problems occur: a reduction in the resolution, a loss of information, objects, and space invariance. These problems are the first, occurs in the pooling process, the second, is generated by the existence of an object between the layers and the layer or between the extracted feature map. Third, it is an object is literally learning the spatial information of the entire image less affected due to the surrounding environment when you want to present in the center there is a disadvantage vulnerable to rotate or noise. Although this can be solved using techniques such as data augmentation, doing so creates another problem that requires a long learning time.
In this paper, we propose a multiple path feature aggregation (MPFA) method to solve this problem. This method can improve the performance by using learning to create a feature map and spatial information in a variety of sizes.
The remainder of this paper is organized as follows: Section 2 discusses related work, and Section 3 describes how the proposed model applies MPFA. Section 4 details the experimental results and analysis of the proposed method, and Section 5 provides some concluding remarks and areas of future research.
Semantic segmentation can use the models [11–14] in this category that have been proven in terms of performance based on four challenge images as a backbone, and provides a better performance through fine-tuning. A fully convolutional network is the starting point of its improved semantic segmentation performance by changing the existing fully connected approach to a fully convolution approach [3]. In addition, the method of increasing the receptive field in viewing neural network objects [15–17] is a method that uses a dilated convolution or deconvolution [18]. To carry out the semantic segmentation, DeepLab [16] and PSPNet [19], which are the most frequently used models, employ multi-scale segmentation.
When the size of the reduced or increased layer is characterized by a loss of information, i.e., in this case, when the correction information of the previous higher-level layer is used, the loss can be reduced. Therefore, a multi-scale is applied because it takes a method of complementary features of different sizes [20–24] and can improve the performance, increasing the class and location information for image segmentation. Additionally, a conditional random field (CRF) is used post-processing to improve the performance of the predicted result.
By learning to end-to-end, not to apply the CRF with postprocessing methods but to improve the network, improve [25–27] and the location information of an object, the objects overlap or complex scenes have a poor performance. Therefore, to better take advantage of the features of various sizes by using non-traditional methods [28,29] through deep learning, this method has an improved performance in detecting objects [30]. Liu et al. [31] found that it can be used to improve global average pooling (GAP) with the fully convolutional network (FCN).
It is, therefore, possible to use the information in the global spatial features through the GAP information and combining different characteristic maps for local information through a pyramid network.
The backbone network of the model proposed in this research is a fine-tuning of the ResNet-50 layer model, and MPFA consists of four paths, and thee of these paths have four sub-paths. As shown in Figure 2, a feature map of size 41 × 41
The proposed model used for the first path is the atrous spatial pyramid pooling (ASPP) of DeepLab, as shown in Figure 3. ASPP may be in the number of computations while maintaining the size and area of the receptive field of various sizes. The atrous rate used in this study apply to {6,8,16,24}.
The second path of the proposed model is composed of a layer that performs the residual operation in a pyramidal format, as shown in Figure 4. The overall structure uses the inverted residual block shown in Figure 5(a), and this method is used in MobileNet v2 [32]. Inverted residual block, so that the dimensions are expressed in that advantage that it is possible MobileNet v2 and then expanded to a higher order channel expressed in the low-order channel by the high-level information through the hypothesis of a manifold to minimize the loss of information through ReLu.
The third path of the proposed model is the module used in PSP-Net. In the case of average pooling used here, after obtaining the average of the channels, partial information of a 1/4 size is extracted through the operation of each pooling, and then connected. At this time, the parameters of the average pooling are set to 7, 9, 11, and 15, respectively, and the structure is shown in Figure 6.
The experiment environment used in this study was as follows. The operating system used an Intel Xeon Gold 5120 CPU in an Ubuntu 18.04.4 environment, 32 GB of memory, and RTX TITAN X 24GB graphics cards. The deep learning framework uses TensorFlow 1.14.0, and PASCAL VOC 2012 was used for learning. The PASCAL VOC 2012 video segmentation dataset consists of 10,582 pieces of training data and 1,449 pieces of test data, with a total of 20 classes. The model learning applied 20,000 iterations, a batch size of 8, a learning rate of 0.00001, and the stochastic gradient descent method was used for Adam. Then, by using the constructed model, the performance change of the model was measured by experimenting while controlling the number of iterations according to the number of MPFAs and batch size of the proposed module.
The experiment results obtained using the proposed method are shown in Figures 7 and 8. The input image, ground truth, DeepLab [16] result, and the proposed model result is shown in order from the left. In the case of Figure 7, bottles, persons, buses, horses, and TVs are classified. When comparing DeepLab and the proposed model, it can be seen that it is similar to deeplab or misclassifier error than DeepLab. This problem is judged to be a classifier error owing to the large amount of information on the scale. Figure 8 shows the classification of people, bottles, horses, and cows. In Figure 8, it can be seen that the overall partitioning performance is excellent. In the second image, one can see that there is a part that recognizes the error of a horse identified as a cow.
The results of the evaluation and comparison of the overall performance are shown in Table 1. Table 1 shows the results of measuring the performance change by adjusting the feature size of the proposed method. For the performance, the accuracy (mean intersection over union [mIOU]) and frames per second (fps) are shown. When the feature size is the largest, it can be seen that the performance is good, and when the feature size is 256, there is little deterioration in performance, and 32 frames can be calculated.
As shown in Table 2, the performance was compared by adjusting the batch size and the number of iterations, and when the batch size was 10, the highest performance was shown. Table 3 is shown the comparison of the model in the PASCAL VOC 2012 Validation Leader Board that evaluate the performance as a valid data DeepLab-CRF [16], FastDenseNas-arch0 [33], DeepLab v3+ [34], DFN [35], it represents a ResNet-GCN [36] with the result of comparing the five model and performance.
In this paper, we proposed an MPFA method based on PASCAL VOC 2012 data and the DeepLab model to conduct semantic segmentation. The performance results of the proposed method show an accuracy of 81.6% eoteuna, demonstrating a better performance than the DeepLab model class. There was a problem in extracting the non-labeled regions and a degraded classification performance occurred. Therefore, if there is a lot of scale information of the model, the performance will be poor and the approach is judged to be unsuitable.
In future research, we plan to develop a lightweight model suitable for real-time use while improving the classification performance and improving it into a model suitable for real-time application.
No potential conflicts of interest relevant to this article are reported.
Type of segmentation: (a) input, (b) instance segmentation, (c) semantic segmentation, and (d) panoptic segmentation. The images are from [
Visual measurement of PASCAL VOC 2012 data. Results of the proposed method are compared with the baseline.
Visual measurement of PASCAL VOC 2012 data. The results of the proposed method are compared with the baseline.
Table 1. Mean IOU and fps measurement according to MPFA size.
Method | Mean IOU(%) | fps |
---|---|---|
ResNet50-MPFA512 | 12 | |
ResNet50-MPFA256 | 78.3 | 24 |
ResNet50-MPFA128 | 75.5 | 32 |
ResNet50-MPFA64 | 72.3 |
Table 2. Mean IOU measurement according to batch size and iterations.
Method | Batch size | Iteration | Mean IOU(%) |
---|---|---|---|
ResNet50-MPFA512 | 4 | 10k | |
ResNet50-MPFA512 | 4 | 20k | 77.7 |
ResNet50-MPFA512 | 10 | 20k | 78.3 |
ResNet50-MPFA512 | 10 | 20k |
E-mail : jws2218@naver.com
E-mail : syrhee@kyungnam.ac.kr
International Journal of Fuzzy Logic and Intelligent Systems 2021; 21(4): 401-408
Published online December 25, 2021 https://doi.org/10.5391/IJFIS.2021.21.4.401
Copyright © The Korean Institute of Intelligent Systems.
Wang-Su Jeon1 and Sang-Yong Rhee2
1Department of IT Convergence Engineering, Kyungnam University, Changwon, Korea
2Department of Computer Engineering, Kyungnam University, Changwon, Korea
Correspondence to:Sang-Yong Rhee (syrhee@kyungnam.ac.kr)
*These authors contributed equally to this work.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Image segmentation is the process of simplifying the analysis of the meaning or the front to say the process of dividing the image into a set of multiple pixels. The multiple path feature aggregation (MPFA) method proposed in this paper aims to extract various information of an object, and uses conventional pyramid pooling or the extraction of various sized features. This information can be combined with different regional features to obtain the overall feature information. We split four paths to extract numerous local features, and the results showed that the mean intersection over union (mIOU) is 81.6% for the validation data from the PASCAL VOC 2012 dataset, and a better performance than the existing DeepLab model was demonstrated.
Keywords: MPFA, Semantic segmentation, Feature aggregation, CNN, Inverted residual block, Local context
The performance of robots and self-driving cars is improving owing to the recently developed hardware, big data technology, and artificial intelligence. Requires the robot and the autonomous navigation technology is basically a visual recognition, and visual recognition is extracted feature data in a meaningful image entered through the camera detects an object, it refers to the ability to recognize, classify.
For the visual perception, which is basically categorized into the foreground and background, and in the case of traditional computer vision, image segmentation was performed through image segmentation. Image segmentation refers to the process of dividing an image into several sets of pixels. The purpose of segmentation is also a process of simplifying a scene based on its meaning or interpretation, rather than simply classifying it. In particular, a boundary or set with an object is determined.
Image segmentation can also be divided into three types, as shown in Figure 1 [1], instance segmentation, semantic segmentation, and panoptic segmentation [2]. Instance segmentation refers to classifying each object possessed by a pixel, and semantic segmentation refers to assigning a pixel value for each class. In addition, panoptic segmentation refers to a method of classifying the foreground as an instance segmentation and the background as a semantic segmentation and classifying as “stuff” or “things.” The term stuff refers to countless backgrounds, and things refers to countable objects.
The technology of existing computer vision after the use of deep learning has not achieved a good performance in the field of image segmentation can see what looks outstanding performance [3,4], which has also included Cityscapes [5], PASCAL VOC [6], and COCO [7] in recent years. Through data provided by competition, it can be seen that the performance has improved significantly.
When applying a convolutional neural network (CNN) [8–10], three problems occur: a reduction in the resolution, a loss of information, objects, and space invariance. These problems are the first, occurs in the pooling process, the second, is generated by the existence of an object between the layers and the layer or between the extracted feature map. Third, it is an object is literally learning the spatial information of the entire image less affected due to the surrounding environment when you want to present in the center there is a disadvantage vulnerable to rotate or noise. Although this can be solved using techniques such as data augmentation, doing so creates another problem that requires a long learning time.
In this paper, we propose a multiple path feature aggregation (MPFA) method to solve this problem. This method can improve the performance by using learning to create a feature map and spatial information in a variety of sizes.
The remainder of this paper is organized as follows: Section 2 discusses related work, and Section 3 describes how the proposed model applies MPFA. Section 4 details the experimental results and analysis of the proposed method, and Section 5 provides some concluding remarks and areas of future research.
Semantic segmentation can use the models [11–14] in this category that have been proven in terms of performance based on four challenge images as a backbone, and provides a better performance through fine-tuning. A fully convolutional network is the starting point of its improved semantic segmentation performance by changing the existing fully connected approach to a fully convolution approach [3]. In addition, the method of increasing the receptive field in viewing neural network objects [15–17] is a method that uses a dilated convolution or deconvolution [18]. To carry out the semantic segmentation, DeepLab [16] and PSPNet [19], which are the most frequently used models, employ multi-scale segmentation.
When the size of the reduced or increased layer is characterized by a loss of information, i.e., in this case, when the correction information of the previous higher-level layer is used, the loss can be reduced. Therefore, a multi-scale is applied because it takes a method of complementary features of different sizes [20–24] and can improve the performance, increasing the class and location information for image segmentation. Additionally, a conditional random field (CRF) is used post-processing to improve the performance of the predicted result.
By learning to end-to-end, not to apply the CRF with postprocessing methods but to improve the network, improve [25–27] and the location information of an object, the objects overlap or complex scenes have a poor performance. Therefore, to better take advantage of the features of various sizes by using non-traditional methods [28,29] through deep learning, this method has an improved performance in detecting objects [30]. Liu et al. [31] found that it can be used to improve global average pooling (GAP) with the fully convolutional network (FCN).
It is, therefore, possible to use the information in the global spatial features through the GAP information and combining different characteristic maps for local information through a pyramid network.
The backbone network of the model proposed in this research is a fine-tuning of the ResNet-50 layer model, and MPFA consists of four paths, and thee of these paths have four sub-paths. As shown in Figure 2, a feature map of size 41 × 41
The proposed model used for the first path is the atrous spatial pyramid pooling (ASPP) of DeepLab, as shown in Figure 3. ASPP may be in the number of computations while maintaining the size and area of the receptive field of various sizes. The atrous rate used in this study apply to {6,8,16,24}.
The second path of the proposed model is composed of a layer that performs the residual operation in a pyramidal format, as shown in Figure 4. The overall structure uses the inverted residual block shown in Figure 5(a), and this method is used in MobileNet v2 [32]. Inverted residual block, so that the dimensions are expressed in that advantage that it is possible MobileNet v2 and then expanded to a higher order channel expressed in the low-order channel by the high-level information through the hypothesis of a manifold to minimize the loss of information through ReLu.
The third path of the proposed model is the module used in PSP-Net. In the case of average pooling used here, after obtaining the average of the channels, partial information of a 1/4 size is extracted through the operation of each pooling, and then connected. At this time, the parameters of the average pooling are set to 7, 9, 11, and 15, respectively, and the structure is shown in Figure 6.
The experiment environment used in this study was as follows. The operating system used an Intel Xeon Gold 5120 CPU in an Ubuntu 18.04.4 environment, 32 GB of memory, and RTX TITAN X 24GB graphics cards. The deep learning framework uses TensorFlow 1.14.0, and PASCAL VOC 2012 was used for learning. The PASCAL VOC 2012 video segmentation dataset consists of 10,582 pieces of training data and 1,449 pieces of test data, with a total of 20 classes. The model learning applied 20,000 iterations, a batch size of 8, a learning rate of 0.00001, and the stochastic gradient descent method was used for Adam. Then, by using the constructed model, the performance change of the model was measured by experimenting while controlling the number of iterations according to the number of MPFAs and batch size of the proposed module.
The experiment results obtained using the proposed method are shown in Figures 7 and 8. The input image, ground truth, DeepLab [16] result, and the proposed model result is shown in order from the left. In the case of Figure 7, bottles, persons, buses, horses, and TVs are classified. When comparing DeepLab and the proposed model, it can be seen that it is similar to deeplab or misclassifier error than DeepLab. This problem is judged to be a classifier error owing to the large amount of information on the scale. Figure 8 shows the classification of people, bottles, horses, and cows. In Figure 8, it can be seen that the overall partitioning performance is excellent. In the second image, one can see that there is a part that recognizes the error of a horse identified as a cow.
The results of the evaluation and comparison of the overall performance are shown in Table 1. Table 1 shows the results of measuring the performance change by adjusting the feature size of the proposed method. For the performance, the accuracy (mean intersection over union [mIOU]) and frames per second (fps) are shown. When the feature size is the largest, it can be seen that the performance is good, and when the feature size is 256, there is little deterioration in performance, and 32 frames can be calculated.
As shown in Table 2, the performance was compared by adjusting the batch size and the number of iterations, and when the batch size was 10, the highest performance was shown. Table 3 is shown the comparison of the model in the PASCAL VOC 2012 Validation Leader Board that evaluate the performance as a valid data DeepLab-CRF [16], FastDenseNas-arch0 [33], DeepLab v3+ [34], DFN [35], it represents a ResNet-GCN [36] with the result of comparing the five model and performance.
In this paper, we proposed an MPFA method based on PASCAL VOC 2012 data and the DeepLab model to conduct semantic segmentation. The performance results of the proposed method show an accuracy of 81.6% eoteuna, demonstrating a better performance than the DeepLab model class. There was a problem in extracting the non-labeled regions and a degraded classification performance occurred. Therefore, if there is a lot of scale information of the model, the performance will be poor and the approach is judged to be unsuitable.
In future research, we plan to develop a lightweight model suitable for real-time use while improving the classification performance and improving it into a model suitable for real-time application.
Type of segmentation: (a) input, (b) instance segmentation, (c) semantic segmentation, and (d) panoptic segmentation. The images are from [
Multiple path feature aggregation network architecture.
Atrous spatial pyramid pooling structure.
Multi-scale residual feature structure.
Linear bottleneck structure: (a) residual block and (b) inverted residual block.
Pyramid pooling module structure.
Visual measurement of PASCAL VOC 2012 data. Results of the proposed method are compared with the baseline.
Visual measurement of PASCAL VOC 2012 data. The results of the proposed method are compared with the baseline.
Table 1 . Mean IOU and fps measurement according to MPFA size.
Method | Mean IOU(%) | fps |
---|---|---|
ResNet50-MPFA512 | 12 | |
ResNet50-MPFA256 | 78.3 | 24 |
ResNet50-MPFA128 | 75.5 | 32 |
ResNet50-MPFA64 | 72.3 |
Table 2 . Mean IOU measurement according to batch size and iterations.
Method | Batch size | Iteration | Mean IOU(%) |
---|---|---|---|
ResNet50-MPFA512 | 4 | 10k | |
ResNet50-MPFA512 | 4 | 20k | 77.7 |
ResNet50-MPFA512 | 10 | 20k | 78.3 |
ResNet50-MPFA512 | 10 | 20k |
Jeongmin Kim and Hyukdoo Choi
International Journal of Fuzzy Logic and Intelligent Systems 2024; 24(2): 105-113 https://doi.org/10.5391/IJFIS.2024.24.2.105Xinzhi Hu, Wang-Su Jeon, Grezgorz Cielniak, and Sang-Yong Rhee
International Journal of Fuzzy Logic and Intelligent Systems 2024; 24(1): 1-9 https://doi.org/10.5391/IJFIS.2024.24.1.1Herlawati Herlawati, Edi Abdurachman, Yaya Heryadi, and Haryono Soeparno
International Journal of Fuzzy Logic and Intelligent Systems 2023; 23(4): 389-398 https://doi.org/10.5391/IJFIS.2023.23.4.389Type of segmentation: (a) input, (b) instance segmentation, (c) semantic segmentation, and (d) panoptic segmentation. The images are from [
Multiple path feature aggregation network architecture.
|@|~(^,^)~|@|Atrous spatial pyramid pooling structure.
|@|~(^,^)~|@|Multi-scale residual feature structure.
|@|~(^,^)~|@|Linear bottleneck structure: (a) residual block and (b) inverted residual block.
|@|~(^,^)~|@|Pyramid pooling module structure.
|@|~(^,^)~|@|Visual measurement of PASCAL VOC 2012 data. Results of the proposed method are compared with the baseline.
|@|~(^,^)~|@|Visual measurement of PASCAL VOC 2012 data. The results of the proposed method are compared with the baseline.